Tumgik
simple-uvreg-blog · 7 years
Text
print(lin_mod.summary())                            OLS Regression Results                             ============================================================================== Dep. Variable:                 S1Q10B   R-squared:                       0.189 Model:                            OLS   Adj. R-squared:                  0.189 Method:                 Least Squares   F-statistic:                 1.004e+04 Date:                Tue, 21 Nov 2017   Prob (F-statistic):               0.00 Time:                        17:52:38   Log-Likelihood:            -1.2054e+05 No. Observations:               43093   AIC:                         2.411e+05 Df Residuals:                   43091   BIC:                         2.411e+05 Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept      0.3357      0.067      5.020      0.000         0.205     0.467 S1Q6A          0.7599      0.008    100.220      0.000         0.745     0.775 ============================================================================== Omnibus:                      339.404   Durbin-Watson:                   1.973 Prob(Omnibus):                  0.000   Jarque-Bera (JB):              232.025 Skew:                           0.041   Prob(JB):                     4.13e-51 Kurtosis:                       2.650   Cond. No.                         31.2 ==============================================================================
0 notes
simple-uvreg-blog · 7 years
Text
Pyton Program for Simple Univariate Regression
import pandas as pd # Read in the orginal dataset which is never modified. # Create a subset of the data and write it out to another csv file for reloading # during a new session. # data = pd.read_csv('D:\\Education\\Coursera\\Regression_Modelling_Practice\\nesarc_pds.csv', nrows=10000000) # header = None # data = pd.DataFrame(data) # print(data.shape) # (43093 rows 3008 vars # print(data.iloc[:,[0,1,2,3,4,5,6,7,8,]]) # print(data.iloc[:,[9,10,11,12,13,14,15,16,17]]) # # col_names=data.columns.tolist() # print(col_names) # # Subset dataframe to only some rows # df1 = df[['a','b']] # or use the iloc method above after discovering the positions (starting from 0) my_var_list=["IDNUM","WEIGHT","BUILDTYP","AGE","SEX","S1Q1F","S1Q2C1","S1Q2C3",             "MARITAL","S1Q6A","S1Q7A1","S1Q9A","S1Q9B","S1Q10B",             "S1Q16","S1Q212","SMOKER","S3AQ3B1","CHECK322","S3AQ3B2","S2AQ5B",             "S2AQ6B","S2AQ7B"] print(my_var_list) # df2=data[my_var_list] # print(df2.shape) # (43093, 23) # Save reduced dataset to save time later when restarting # data.to_csv(file_name, index=False) # index=False prevent prepending with comma # data.to_csv(file_name, sep='\t', index=False) # data.to_csv(file_name, sep='\t', encoding='utf-8', , index=False) # data.to_csv(file_name, sep='\t', , index=False) # df2.to_csv('D:\\Education\\Coursera\\Regression_Modelling_Practice\\vic_vars.csv', index=False) # df2=pd.DataFrame(pd.read_csv('D:\\Education\\Coursera\\Regression_Modelling_Practice\\vic_vars.csv')) print(df2.shape) # (43093, 23) print(df2.ix[0]) # Check first line # # For the first model, take as response variable # S1Q10B TOTAL PERSONAL INCOME IN LAST 12 MONTHS # It too is categorical (0-17) but is ordinated and can # act as a continuous variable (approximately) # # Use as the ssole explanatory variable # S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED # and treat as a continuous variable. # It certainly is ordinated from # 1  = No formal schooling, to # 14 = masters degree or higher
# # Step 1: # Subtract 1 from S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED # so that is is 0-based df2['S1Q6A']=df2['S1Q6A']-1 print(df2.ix[0]) # Check first line # # Step 2 # Produce categorical summary table to show check on recoded variables # crosstabs describe in http://hamelg.blogspot.co.nz/2015/11/python-for-data-analysis-part-19_17.html pd.crosstab(df2['S1Q6A'],columns="count")  # yes, from 0 to 13 pd.crosstab(df2['S1Q10B'],columns="count") # yes, from 0 to 17 # # Step 3 # Fit a simple OLS linear model to model income as a result of education grade # import statsmodels.formula.api as sm # http://www.statsmodels.org/dev/examples/notebooks/generated/regression_plots.html import statsmodels.api as sm lin_mod = sm.formula.ols(formula="S1Q10B ~ S1Q6A", data=df2).fit() print(lin_mod.params) print(lin_mod.summary()) # import matplotlib.pyplot as plt fig = plt.figure(figsize=(12,8)) fig = sm.graphics.plot_regress_exog(lin_mod, "S1Q6A", fig=fig) fig # Wait for a minute or so before the display comes up
0 notes
simple-uvreg-blog · 7 years
Text
Simple Univariate Regression
Quick Summary: NESARC Wave 2 data was used for a study of the effect of education level on annual income. The result was in the direction expected, i.e. higher levels of education indicated higher income (regression slope of 0.759889). However the model fit was not strong (as measured by the r-squared value of 0.189) so other factors (measured or not measured) apart from income must also determine income. The fitted parameters (intercept and education) had p-values of 0 (to 3 decimal places) indicating highly significant departures from the null hypothesis of 0 values.
Details: A univariate ordinary least squares (OLS) regression was conducted using S1Q6A (HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED) as the independent (explanatory) variable for predicting S1Q10B (TOTAL PERSONAL INCOME IN LAST 12 MONTHS). Both variables are categorical !4 and 18 categories respectively), but as stated in the exercise instructions they are multiclass (and ordinated, in fact) so can be treated as (roughly) continuous. For purposes of simple discussion, the S1Q6A variable will be sometimes be referred to as "education" and the S1Q10B variable will be sometimes be referred to as "income". Any outputs from computer programs will show the original names. In future work I will probably rename the variables. As requested, the education variable was recoded so as to begin at 0 rather than 1. This was checked by creating the following table: # pd.crosstab(df2['S1Q6A'],columns="count") col_0  count S1Q6A       0        218 1        137 2        421 3        931 4        414 5       1210 6       4518 7      10935 8       1612 9       8891 10      3772 11      5251 12      1526 13      3257
It shows the existence of the required 0 category. (0 is the lowest ranked qualification, and 13 the highest (Masters+) The regression was conducted using matplotlib as follows: # import statsmodels.api as sm lin_mod = sm.formula.ols(formula="S1Q10B ~ S1Q6A", data=df2).fit() print(lin_mod.params) print(lin_mod.summary()) # where df2 is a NESARC subset (I kept only 24 of the original 3008 variables, but all of the 43093 Wave 1 records) The results were: # print(lin_mod.params) Intercept    0.335664 S1Q6A        0.759889 dtype: float64
print(lin_mod.summary())                            OLS Regression Results                             ============================================================= Dep. Variable:                 S1Q10B   R-squared:                       0.189 Model:                            OLS   Adj. R-squared:                  0.189 Method:                 Least Squares   F-statistic:                 1.004e+04 Date:                Tue, 21 Nov 2017   Prob (F-statistic):               0.00 Time:                        17:52:38   Log-Likelihood:            -1.2054e+05 No. Observations:               43093   AIC:                         2.411e+05 Df Residuals:                   43091   BIC:                         2.411e+05 Df Model:                           1                                         Covariance Type:            nonrobust                                         =============================================================                 coef    std err          t      P>|t|      [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept      0.3357      0.067      5.020      0.000         0.205     0.467 S1Q6A          0.7599      0.008    100.220      0.000         0.745     0.775 ============================================================= Omnibus:                      339.404   Durbin-Watson:                   1.973 Prob(Omnibus):                  0.000   Jarque-Bera (JB):              232.025 Skew:                           0.041   Prob(JB):                     4.13e-51 Kurtosis:                       2.650   Cond. No.                         31.2 ============================================================= # As can be seen, the r-squared value is low (0.189). However, the model is simple, and in real-life social data modelling, low r-squareds are the norm. The intercept and education coefficient estimates are all significant Their p-values (P>|t|) are both less than 0.05, and are in fact, effectively 0), but this is to be expected give the huge sample size (43,093 individuals). It is gratifying that higher education implies higher income, as this is expected from "common-sense". The relationship is not very strong, however, and this will partly be a result of "the spirit of free enterprise". Some people do not need a high education to succeed. Also, some people with high levels of education might find that intellectually satisfying in itself and find no need to pursue income.
0 notes