#there is nothing up there except regression formulas and r code | Explore Tumblr posts and blogs

learning2program · 8 years ago

Text

Linear Regression

Today, we will be covering linear regression and coding up a program which does simple linear regression.

Linear regression is one of the fundamental concepts in statistics. It has a plethora of uses, ranging from machine learning to economics to finance to even epidemiology (see: determining a correlation between lung cancer and smoking cigarettes). It’s importance is derived from how useful it is in modeling the relationship between a scalar dependent variable and, in general, one or more explanatory variables. Moreover, it gives one an idea on how to predict the outcome of certain events given a set of data, provided the data lies in a linear fashion.

What I ended up coding today was only simple linear regression. In the real world, one wants to model more than one explanatory variable. However, without going into matrices and developing a matrices module, which I plan on doing later, I will instead have to be relying on the structure of lists. Because of the niceness of two variable linear regression, we end up getting a closed formula for our two coefficients, which I will be abusing in the code.

The general formula for simple linear regression is as follows:

\[ y = x \beta + \epsilon, \text{ }\epsilon, \beta \in \mathbb{R}. \]

In other words, one can imagine the beta and epsilon to be the slope and y-intercept of a linear formula, respectively.

One can technically derive the closed formulas for the beta and epsilon using the least squares estimates, but I won’t be doing that here. For more information on that, see here. Instead, I’ll just be going over what my code does. To see the closed formula for the beta and epsilon, see here.

The code breaks down as follows:

power(x, power): Nothing too crazy here. While there is technically a power function in the math library which probably is much more efficient, for practice I ended up coding up my own. Requires the power to be an integer greater than 0, since inverses will not be considered.

def power(x, pow): if type(pow) != int and pow > 0: warnings.warn("pow needs to be an integer value greater than 0") if pow == 0: return 1 for i in range(pow-1): x = x*x return x

ffsum(list): This function sums up all the elements in a list. Once again, python has a built in function to do this, but for practice I decided to write up my own.

def ffsum(list): x = 0 for i in list: try: x+= i except: warnings.warn("you probably don't have just numeric values in the list") return x

simple_linear_regression(x1, x2): The meat of our rather short script. Here, we create three more lists, dubbed x1x2, x1sq, and x2sq. This is because we’ll need the sums of these lists to find our a and b. Recall from the website that the closed formulas for a and b are

\[ a = \frac{(\sum y)(\sum x^2) - (\sum x)(\sum x y)}{n(\sum x^2)-(\sum x)^2},\]

\[b = \frac{n(\sum x y) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2}. \]

After finding the sums of x1x2, x1sq (representing x1 squared), and x2sq, it is just a matter of plugging in the numbers. Note that we must preform a type conversion on each of these values into floats instead of ints, so that we get accurate values. Otherwise, the program will round our a and b, which we don’t want.

def simple_linear_regression(x1, x2): if len(x1) != len(x2): warnings.warn("the lists are not the same size") x1x2 = [] x1sq = [] x2sq = [] for i in range(len(x1)): x1x2.append(x1[i]*x2[i]) x1sq.append(power(x1[i], 2)) x2sq.append(power(x2[i], 2)) #arbitrarily chose x1, could've used x2 as well n = len(x1) #Note we have to sloppily convert all of these values to float. Probably should find out a way to do it better. a = ((float(ffsum(x2))*float(ffsum(x1sq)))-(float(ffsum(x1))*float(ffsum(x1x2))))/(float(n)*float(ffsum(x1sq))-float(power(ffsum(x1), 2))) b = (float(n)*float(ffsum(x1x2))-float(ffsum(x1))*float(ffsum(x2)))/(float(n)*float(ffsum(x1sq))-float(power(ffsum(x1), 2))) print('a: ' + str(a)) print('b: ' + str(b)) return a, b

The code can be found here.

#python #statistics #linear regression

0 notes