hellodatascientist-blog
hellodatascientist-blog
Untitled
8 posts
Don't wanna be here? Send us removal request.
hellodatascientist-blog · 6 years ago
Text
Coursera Data Analysis Tools HW4
Tumblr media
Program:
import pandas import numpy import re import scipy.stats import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)
codebook = ''' # S2AQ8A HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8E HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ9 HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) # S2AQ10 HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS # S2AQ11 HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED # S2AQ4B HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS # S2AQ5B HOW OFTEN DRANK BEER IN LAST 12 MONTHS # S2AQ6B HOW OFTEN DRANK WINE IN LAST 12 MONTHS # S2AQ7B HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS # S2AQ14 NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS # S2AQ15R1 NUMBER OF MONTHS SINCE LAST DRINK (ROUNDED TO NEAREST MONTH) # S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS # S2AQ19 AGE AT START OF PERIOD OF HEAVIEST DRINKING # S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING # S2AQ21A HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ22 HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ23 MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING # S2DQ1 BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ2 BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ5C2 ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ6C2 ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS ''' #extract code and description m = re.findall(r'^# (\w+) (.*)$', codebook, re.MULTILINE)
#build code to description map code_dict = {} for code in m:    code_dict[code[0]] = code[1]
#setting variables you will be working with to numeric for code in m:    data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True)    data[code[0]] = pandas.to_numeric(data[code[0]])
data['S2AQ8A'] = data['S2AQ8A'].replace(99, numpy.nan) data['S2DQ1'] = data['S2DQ1'].replace(9, numpy.nan) data['S2AQ16A'] = data['S2AQ16A'].replace(99, numpy.nan)
def DRINKFQ (row):   if row['S2AQ8A'] < 4 :      return 'OFTEN'   elif row['S2AQ8A'] < 7:      return 'SOMETIMES'   elif row['S2AQ8A'] <= 10:      return 'SELDOM'   else:      return row['S2AQ8A']
data['S2DQ1'] = pandas.Categorical(data.S2DQ1) data['S2DQ1'] = data['S2DQ1'].cat.rename_categories(['Yes', 'No'])
data['DRINKFQ'] = data.apply(lambda row: DRINKFQ(row), axis=1)
sub1 = data[['DRINKFQ', 'S2DQ1']].dropna()
# contingency table of observed counts ct1=pandas.crosstab(sub1['DRINKFQ'], sub1['S2DQ1']) print (ct1)
# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)
sub1["DRINKFQ_CAT"] = sub1["DRINKFQ"].astype('category') sub1['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)
seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub1, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status')
# post hoc tests
print ('post hoc tests')
recode2 = {'OFTEN': 'OFTEN', 'SOMETIMES': 'SOMETIMES'} sub1['COMP1v2']= sub1['DRINKFQ'].map(recode2)
# contingency table of observed counts ct2 = pandas.crosstab(sub1['COMP1v2'], sub1['S2DQ1']) print (ct2)
print ('chi-square value, p value, expected counts') cs2 = scipy.stats.chi2_contingency(ct2) print (cs2)
recode3 = {'OFTEN': 'OFTEN', 'SELDOM': 'SELDOM'} sub1['COMP1v3']= sub1['DRINKFQ'].map(recode3)
# contingency table of observed counts ct3 = pandas.crosstab(sub1['COMP1v3'], sub1['S2DQ1']) print (ct3)
print ('chi-square value, p value, expected counts') cs3 = scipy.stats.chi2_contingency(ct3) print (cs3)
recode4 = {'SOMETIMES': 'SOMETIMES', 'SELDOM': 'SELDOM'} sub1['COMP2v3']= sub1['DRINKFQ'].map(recode4)
# contingency table of observed counts ct4 = pandas.crosstab(sub1['COMP2v3'], sub1['S2DQ1']) print (ct4)
print ('chi-square value, p value, expected counts') cs4 = scipy.stats.chi2_contingency(ct4) print (cs4)
def AGESTDRINK (row):    if row['S2AQ16A'] < 12 :        return 'CHILD'    elif row['S2AQ16A'] < 21 :        return 'TEEN'    elif row['S2AQ16A'] < 30 :        return 'YOUNG'    else:        return 'MIDDLE'
data['AGESTDRINK'] = data.apply(lambda row: AGESTDRINK(row), axis=1)
data['S2AQ16A'].value_counts()
sub2 = data[data['AGESTDRINK'] == 'CHILD'] sub3 = data[data['AGESTDRINK'] == 'TEEN'] sub4 = data[data['AGESTDRINK'] == 'YOUNG'] sub5 = data[data['AGESTDRINK'] == 'MIDDLE']
print ('association between drink frequency and father’s drinking status for age start drink child') # contingency table of observed counts ct2=pandas.crosstab(sub2['DRINKFQ'], sub2['S2DQ1']) print (ct2)
# chi-square print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)
sub2["DRINKFQ_CAT"] = sub2["DRINKFQ"].astype('category') sub2['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)
seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub2, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status for age start drink child')
print ('association between drink frequency and father’s drinking status for age start drink teen') # contingency table of observed counts ct3=pandas.crosstab(sub3['DRINKFQ'], sub3['S2DQ1']) print (ct3)
# chi-square print ('chi-square value, p value, expected counts') cs3 = scipy.stats.chi2_contingency(ct3) print (cs3)
sub3["DRINKFQ_CAT"] = sub3["DRINKFQ"].astype('category') sub3['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)
seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub3, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status for age start drink teen')
print ('association between drink frequency and father’s drinking status for age start drink young') # contingency table of observed counts ct4=pandas.crosstab(sub4['DRINKFQ'], sub4['S2DQ1']) print (ct4)
# chi-square print ('chi-square value, p value, expected counts') cs4 = scipy.stats.chi2_contingency(ct4) print (cs4)
sub4["DRINKFQ_CAT"] = sub4["DRINKFQ"].astype('category') sub4['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)
seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub4, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status for age start drink young')
print ('association between drink frequency and father’s drinking status for age start drink middle') # contingency table of observed counts ct5 = pandas.crosstab(sub5['DRINKFQ'], sub5['S2DQ1']) print (ct5)
# chi-square print ('chi-square value, p value, expected counts') cs5 = scipy.stats.chi2_contingency(ct5) print (cs5)
sub5["DRINKFQ_CAT"] = sub5["DRINKFQ"].astype('category') sub5['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)
seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub5, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status for age start drink middle')
Outputs:
43093 3010 S2DQ1       Yes    No DRINKFQ               OFTEN      1236  4130 SELDOM     2299  8447 SOMETIMES  1988  7282 S2DQ1           Yes        No DRINKFQ                       OFTEN      0.223791  0.207966 SELDOM     0.416259  0.425349 SOMETIMES  0.359949  0.366685 chi-square value, p value, expected counts (6.499059891211013, 0.03879243810278014, 2, array([[1167.61555433, 4198.38444567],       [2338.27744071, 8407.72255929],       [2017.10700496, 7252.89299504]])) C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.  warnings.warn(msg) S2DQ1       Yes    No COMP1v2               OFTEN      1236  4130 SOMETIMES  1988  7282 post hoc tests chi-square value, p value, expected counts (4.900384738017596, 0.0268507126955822, 1, array([[1182.01585133, 4183.98414867],       [2041.98414867, 7228.01585133]])) S2DQ1     Yes    No COMP1v3             OFTEN    1236  4130 SELDOM   2299  8447 chi-square value, p value, expected counts (5.52445154721267, 0.018752480772151237, 1, array([[1177.30945879, 4188.69054121],       [2357.69054121, 8388.30945879]])) S2DQ1       Yes    No COMP2v3               SELDOM     2299  8447 SOMETIMES  1988  7282 chi-square value, p value, expected counts (0.005085182174825198, 0.9431506679292518, 1, array([[2301.56384892, 8444.43615108],       [1985.43615108, 7284.56384892]])) association between drink frequency and father’s drinking status for age start drink child S2DQ1      Yes   No DRINKFQ             OFTEN       52   82 SELDOM      35  110 SOMETIMES   37   81 chi-square value, p value, expected counts (6.977204355874777, 0.030543536775082713, 2, array([[41.85390428, 92.14609572],       [45.28967254, 99.71032746],       [36.85642317, 81.14357683]])) E:/python/coursera/class2.py:164: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:/python/coursera/class2.py:165: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy  sub4["DRINKFQ_CAT"] = sub4["DRINKFQ"].astype('category') C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.  warnings.warn(msg) E:/python/coursera/class2.py:181: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
association between drink frequency and father’s drinking status for age start drink teen S2DQ1       Yes    No DRINKFQ               OFTEN       933  2897 SELDOM     1495  4823 SOMETIMES  1474  4996 chi-square value, p value, expected counts (3.5231846022150157, 0.17177113458336468, 2, array([[ 899.30557227, 2930.69442773],       [1483.5019858 , 4834.4980142 ],       [1519.19244193, 4950.80755807]])) E:/python/coursera/class2.py:182: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy  sub5["DRINKFQ_CAT"] = sub5["DRINKFQ"].astype('category') C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.  warnings.warn(msg) association between drink frequency and father’s drinking status for age start drink young S2DQ1      Yes    No DRINKFQ             OFTEN      206   868 SELDOM     630  2673 SOMETIMES  404  1795 chi-square value, p value, expected counts (0.512844330564573, 0.7738152219049121, 2, array([[ 202.51824818,  871.48175182],       [ 622.82846715, 2680.17153285],       [ 414.65328467, 1784.34671533]])) E:/python/coursera/class2.py:198: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:/python/coursera/class2.py:199: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.  warnings.warn(msg) association between drink frequency and father’s drinking status for age start drink middle S2DQ1      Yes   No DRINKFQ             OFTEN       45  283 SELDOM     139  841 SOMETIMES   73  410 chi-square value, p value, expected counts (0.3574536470777455, 0.8363343350479611, 2, array([[ 47.06644333, 280.93355667],       [140.62534897, 839.37465103],       [ 69.30820771, 413.69179229]])) E:/python/coursera/class2.py:215: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:/python/coursera/class2.py:216: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.  warnings.warn(msg)
Tumblr media
Tumblr media
Tumblr media
Tumblr media Tumblr media
Description:
Age when start drinking has a significant moderation effect as only child group the p-value is less than 0.5 and null hypothesis is rejected. For other groups, drink frequency and father’s drinking status has no relationships.
0 notes
hellodatascientist-blog · 6 years ago
Text
Coursera Data Analysis Tools HW3
Program:
import pandas import numpy import re import scipy.stats import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)
codebook = ''' # S2AQ8A HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8E HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ9 HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) # S2AQ10 HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS # S2AQ11 HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED # S2AQ4B HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS # S2AQ5B HOW OFTEN DRANK BEER IN LAST 12 MONTHS # S2AQ6B HOW OFTEN DRANK WINE IN LAST 12 MONTHS # S2AQ7B HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS # S2AQ14 NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS # S2AQ15R1 NUMBER OF MONTHS SINCE LAST DRINK (ROUNDED TO NEAREST MONTH) # S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS # S2AQ19 AGE AT START OF PERIOD OF HEAVIEST DRINKING # S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING # S2AQ21A HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ22 HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ23 MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING # S2DQ1 BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ2 BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ5C2 ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ6C2 ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS ''' #extract code and description m = re.findall(r'^# (\w+) (.*)$', codebook, re.MULTILINE)
#build code to description map code_dict = {} for code in m:    code_dict[code[0]] = code[1]
#setting variables you will be working with to numeric for code in m:    data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True)    data[code[0]] = pandas.to_numeric(data[code[0]])
data['S2AQ8B'] = data['S2AQ8B'].replace(99, numpy.nan) data['S2AQ8C'] = data['S2AQ8C'].replace(9, numpy.nan)
sub1 = data[['S2AQ8B', 'S2AQ8C']].dropna()
#setting variables you will be working with to numeric
scat1 = seaborn.regplot(x="S2AQ8B", y="S2AQ8C", fit_reg=True, data=sub1) plt.xlabel('number of drinks') plt.ylabel('largest number of drinks') plt.title('Scatterplot for the Association Between number of drinks and largest number of drinks')
print ('association between number of drinks and largest number of drinks') print (scipy.stats.pearsonr(sub1['S2AQ8B'], sub1['S2AQ8C']))
Outputs:
43093 3010 C:\Users\asus\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval association between number of drinks and largest number of drinks (0.45374669130106854, 0.0)
Tumblr media
Description:
Number of drinks and largest number of drinks are postive linear correlated.
0 notes
hellodatascientist-blog · 7 years ago
Text
Coursera Data Analysis Tools HW2
Program:
import pandas import numpy import re import scipy.stats import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)
codebook = ''' # S2AQ8A HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8E HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ9 HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) # S2AQ10 HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS # S2AQ11 HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED # S2AQ4B HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS # S2AQ5B HOW OFTEN DRANK BEER IN LAST 12 MONTHS # S2AQ6B HOW OFTEN DRANK WINE IN LAST 12 MONTHS # S2AQ7B HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS # S2AQ14 NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS # S2AQ15R1 NUMBER OF MONTHS SINCE LAST DRINK (ROUNDED TO NEAREST MONTH) # S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS # S2AQ19 AGE AT START OF PERIOD OF HEAVIEST DRINKING # S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING # S2AQ21A HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ22 HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ23 MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING # S2DQ1 BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ2 BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ5C2 ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ6C2 ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS ''' #extract code and description m = re.findall(r'^# (\w+) (.*)$', codebook, re.MULTILINE)
#build code to description map code_dict = {} for code in m:    code_dict[code[0]] = code[1]
#setting variables you will be working with to numeric for code in m:    data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True)    data[code[0]] = pandas.to_numeric(data[code[0]])
data['S2AQ8A'] = data['S2AQ8A'].replace(99, numpy.nan) data['S2DQ1'] = data['S2DQ1'].replace(9, numpy.nan)
def DRINKFQ (row):   if row['S2AQ8A'] < 4 :      return 'OFTEN'   elif row['S2AQ8A'] < 7:      return 'SOMETIMES'   elif row['S2AQ8A'] <= 10:      return 'SELDOM'   else:      return row['S2AQ8A']
data['S2DQ1'] = pandas.Categorical(data.S2DQ1) data['S2DQ1'] = data['S2DQ1'].cat.rename_categories(['Yes', 'No'])
data['DRINKFQ'] = data.apply(lambda row: DRINKFQ(row), axis=1)
sub1 = data[['DRINKFQ', 'S2DQ1']].dropna()
# contingency table of observed counts ct1=pandas.crosstab(sub1['DRINKFQ'], sub1['S2DQ1']) print (ct1)
# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)
sub1["DRINKFQ_CAT"] = sub1["DRINKFQ"].astype('category') sub1['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)
seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub1, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status')
recode2 = {'OFTEN': 'OFTEN', 'SOMETIMES': 'SOMETIMES'} sub1['COMP1v2']= sub1['DRINKFQ'].map(recode2)
# contingency table of observed counts ct2 = pandas.crosstab(sub1['COMP1v2'], sub1['S2DQ1']) print (ct2)
# column percentages colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct)
print ('chi-square value, p value, expected counts') cs2 = scipy.stats.chi2_contingency(ct2) print (cs2)
recode3 = {'OFTEN': 'OFTEN', 'SELDOM': 'SELDOM'} sub1['COMP1v3']= sub1['DRINKFQ'].map(recode3)
# contingency table of observed counts ct3 = pandas.crosstab(sub1['COMP1v3'], sub1['S2DQ1']) print (ct3)
# column percentages colsum=ct3.sum(axis=0) colpct=ct3/colsum print(colpct)
print ('chi-square value, p value, expected counts') cs3 = scipy.stats.chi2_contingency(ct3) print (cs3)
recode4 = {'SOMETIMES': 'SOMETIMES', 'SELDOM': 'SELDOM'} sub1['COMP2v3']= sub1['DRINKFQ'].map(recode4)
# contingency table of observed counts ct4 = pandas.crosstab(sub1['COMP2v3'], sub1['S2DQ1']) print (ct4)
# column percentages colsum=ct4.sum(axis=0) colpct=ct4/colsum print(colpct)
print ('chi-square value, p value, expected counts') cs4 = scipy.stats.chi2_contingency(ct4) print (cs4)
Output:
43093 3010 S2DQ1       Yes    No DRINKFQ               OFTEN      1236  4130 SELDOM     2299  8447 SOMETIMES  1988  7282 S2DQ1           Yes        No DRINKFQ                       OFTEN      0.223791  0.207966 SELDOM     0.416259  0.425349 SOMETIMES  0.359949  0.366685 chi-square value, p value, expected counts (6.499059891211013, 0.03879243810278014, 2, array([[1167.61555433, 4198.38444567],       [2338.27744071, 8407.72255929],       [2017.10700496, 7252.89299504]])) C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.  warnings.warn(msg) S2DQ1       Yes    No COMP1v2               OFTEN      1236  4130 SOMETIMES  1988  7282 S2DQ1           Yes      No COMP1v2                     OFTEN      0.383375  0.3619 SOMETIMES  0.616625  0.6381 chi-square value, p value, expected counts (4.900384738017596, 0.0268507126955822, 1, array([[1182.01585133, 4183.98414867],       [2041.98414867, 7228.01585133]])) S2DQ1     Yes    No COMP1v3             OFTEN    1236  4130 SELDOM   2299  8447 S2DQ1         Yes        No COMP1v3                     OFTEN    0.349646  0.328377 SELDOM   0.650354  0.671623 chi-square value, p value, expected counts (5.52445154721267, 0.018752480772151237, 1, array([[1177.30945879, 4188.69054121],       [2357.69054121, 8388.30945879]])) S2DQ1       Yes    No COMP2v3               SELDOM     2299  8447 SOMETIMES  1988  7282 S2DQ1           Yes        No COMP2v3                       SELDOM     0.536272  0.537034 SOMETIMES  0.463728  0.462966 chi-square value, p value, expected counts (0.005085182174825198, 0.9431506679292518, 1, array([[2301.56384892, 8444.43615108],       [1985.43615108, 7284.56384892]]))
Tumblr media
Description:
As p-value is less than 0.05, we can reject the null hypothesis and say that there is an association between the drink frequency and father’s drinking status.
However, for post hoc tests, p-values in all three comparisons are larger then 0.17, hence we accept null hypothesis.
0 notes
hellodatascientist-blog · 7 years ago
Text
Coursera Data Analysis Tools HW1
Program:
import pandas import numpy import re import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi # any additional libraries would be imported here
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)
codebook = ''' # S2AQ8A HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8E HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ9 HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) # S2AQ10 HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS # S2AQ11 HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED # S2AQ4B HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS # S2AQ5B HOW OFTEN DRANK BEER IN LAST 12 MONTHS # S2AQ6B HOW OFTEN DRANK WINE IN LAST 12 MONTHS # S2AQ7B HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS # S2AQ14 NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS # S2AQ15R1 NUMBER OF MONTHS SINCE LAST DRINK (ROUNDED TO NEAREST MONTH) # S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS # S2AQ19 AGE AT START OF PERIOD OF HEAVIEST DRINKING # S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING # S2AQ21A HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ22 HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ23 MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING # S2DQ1 BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ2 BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ5C2 ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ6C2 ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS ''' #extract code and description m = re.findall(r'^# (\w+) (.*)$', codebook, re.MULTILINE)
#build code to description map code_dict = {} for code in m:    code_dict[code[0]] = code[1]
#setting variables you will be working with to numeric for code in m:    data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True)    data[code[0]] = pandas.to_numeric(data[code[0]])
data['S2AQ8C'] = data['S2AQ8C'].replace(99, numpy.nan) data['S2DQ1'] = data['S2DQ1'].replace(9, numpy.nan)
data['S2DQ1'] = pandas.Categorical(data.S2DQ1) data['S2DQ1']=data['S2DQ1'].cat.rename_categories(['Yes', 'No'])
# using ols function for calculating the F-statistic and associated p value model1 = smf.ols(formula='S2AQ8C ~ C(S2DQ1)', data=data) results1 = model1.fit() print(results1.summary())
sub1 = data[['S2AQ8C', 'S2DQ1']].dropna()
print('means for {0} by {1}'.format(code_dict['S2AQ8C'], code_dict['S2DQ1'])) m1= sub1.groupby('S2DQ1').mean() print (m1)
print ('standard deviations for {0} by {1}'.format(code_dict['S2AQ8C'], code_dict['S2DQ1'])) sd1 = sub1.groupby('S2DQ1').std() print (sd1)
Output:
43093 3010                            OLS Regression Results                             ============================================================================== Dep. Variable:                 S2AQ8C   R-squared:                       0.009 Model:                            OLS   Adj. R-squared:                  0.009 Method:                 Least Squares   F-statistic:                     223.9 Date:                Sat, 08 Dec 2018   Prob (F-statistic):           2.06e-50 Time:                        19:09:02   Log-Likelihood:                -74320. No. Observations:               25265   AIC:                         1.486e+05 Df Residuals:                   25263   BIC:                         1.487e+05 Df Model:                           1                                         Covariance Type:            nonrobust                                         ==================================================================================                     coef    std err          t      P>|t|      [0.025      0.975] ---------------------------------------------------------------------------------- Intercept          5.0211      0.062     81.110      0.000       4.900       5.142 C(S2DQ1)[T.No]    -1.0470      0.070    -14.965      0.000      -1.184      -0.910 ============================================================================== Omnibus:                    24732.885   Durbin-Watson:                   1.983 Prob(Omnibus):                  0.000   Jarque-Bera (JB):          2443146.664 Skew:                           4.566   Prob(JB):                         0.00 Kurtosis:                      50.302   Cond. No.                         4.08 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS by BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER         S2AQ8C S2DQ1           Yes    5.021149 No     3.974166 standard deviations for LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS by BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER         S2AQ8C S2DQ1           Yes    5.505642 No     4.294600
Description:
As p-value is less than 0.05, we can reject the null hypothesis and say that there is an association between the largest number of drink and father’s drinking status.
0 notes
hellodatascientist-blog · 7 years ago
Text
Coursera Data Management and Visualization HW4
Program:
import pandas
import numpy import re import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)
codebook = ''' # S2AQ8A HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8E HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ9 HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) # S2AQ10 HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS # S2AQ11 HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED # S2AQ4B HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS # S2AQ5B HOW OFTEN DRANK BEER IN LAST 12 MONTHS # S2AQ6B HOW OFTEN DRANK WINE IN LAST 12 MONTHS # S2AQ7B HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS # S2AQ14 NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS # S2AQ15R1 NUMBER OF MONTHS SINCE LAST DRINK (ROUNDED TO NEAREST MONTH) # S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS # S2AQ19 AGE AT START OF PERIOD OF HEAVIEST DRINKING # S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING # S2AQ21A HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ22 HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ23 MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING # S2DQ1 BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ2 BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ5C2 ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ6C2 ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS ''' #extract code and description m = re.findall(r'^# (\w+) (.*)$', codebook, re.MULTILINE)
#build code to description map code_dict = {} for code in m:    code_dict[code[0]] = code[1]
#setting variables you will be working with to numeric for code in m:    data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True)    data[code[0]] = pandas.to_numeric(data[code[0]])
#counts and percentages (i.e. frequency distributions) for each variable for code in m:    c = data[code[0]].value_counts(sort=False)    p = data[code[0]].value_counts(sort=False, normalize=True)
coding_101 = ['S2AQ8A', 'S2AQ8B', 'S2AQ8C', 'S2AQ8D', 'S2AQ8E', 'S2AQ9', 'S2AQ10', 'S2AQ11',              'S2AQ4B', 'S2AQ5B', 'S2AQ6B', 'S2AQ7B', 'S2AQ14', 'S2AQ16A'] coding_1001 = ['S2AQ15R1'] coding_drop = ['S2AQ19', 'S2AQ20'] coding_drop9 = ['S2DQ1', 'S2DQ2', 'S2DQ3C2', 'S2DQ4C2', 'S2DQ5C2', 'S2DQ6C2']
#coding in valid data #recode missing values to numeric value, in this example replace NaN with 101 for code in coding_101:    data[code].fillna(101, inplace=True)    #recode 99 values as missing    data[code]=data[code].replace(99, numpy.nan)    c = data[code].value_counts(sort=False, dropna=False)
for code in coding_drop9:    data[code]=data[code].replace(9, numpy.nan)    c = data[code].value_counts(sort=False, dropna=False)
for code in m:    data[code[0]] = data[code[0]].astype('category')    seaborn.countplot(x=code[0], data=data)    plt.title(code[1])    plt.show()
#creating 2 level smokegroup variable def DRINKGRP (row):   if row['S2AQ21D'] > 5 :      return 0   else :      return 1
data['DRINKGRP'] = data.apply(lambda row: DRINKGRP(row), axis=1) data['S2DQ1'] = data['S2DQ1'].astype('category') data['S2DQ1']=data['S2DQ1'].cat.rename_categories(['Yes', 'No']) seaborn.factorplot(x="S2DQ1", y="DRINKGRP", data=data, kind="bar", ci=None) plt.xlabel('BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER') plt.ylabel('Frequent drinker')
univariate graph:
Tumblr media

Tumblr media Tumblr media Tumblr media
bivariate graph:
Tumblr media
Description:
X: whether father is even an drinker
Y: frequent of drinker during period of haviest drinking
Whether father is an drinker does not affect too much on frequence of drinking. The group having drinker father even has less frequent drinking.
0 notes
hellodatascientist-blog · 7 years ago
Text
Coursera Data Management and Visualization HW3
program:
import pandas import numpy import re # any additional libraries would be imported here
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)
codebook = ''' # S2AQ8A HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8E HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ9 HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) # S2AQ10 HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS # S2AQ11 HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED # S2AQ4B HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS # S2AQ5B HOW OFTEN DRANK BEER IN LAST 12 MONTHS # S2AQ6B HOW OFTEN DRANK WINE IN LAST 12 MONTHS # S2AQ7B HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS # S2AQ14 NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS # S2AQ15R1 NUMBER OF MONTHS SINCE LAST DRINK (ROUNDED TO NEAREST MONTH) # S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS # S2AQ19 AGE AT START OF PERIOD OF HEAVIEST DRINKING # S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING # S2AQ21A HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ22 HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ23 MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING # S2DQ1 BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ2 BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ5C2 ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ6C2 ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS ''' #extract code and description m = re.findall(r'^# (\w+) (.*)$', codebook, re.MULTILINE)
#build code to description map code_dict = {} for code in m:    code_dict[code[0]] = code[1]
#setting variables you will be working with to numeric for code in m:    data[code[0]] = data[code[0]].convert_objects(convert_numeric=True)
#counts and percentages (i.e. frequency distributions) for each variable for code in m:    c = data[code[0]].value_counts(sort=False)    p = data[code[0]].value_counts(sort=False, normalize=True)
coding_101 = ['S2AQ8A', 'S2AQ8B', 'S2AQ8C', 'S2AQ8D', 'S2AQ8E', 'S2AQ9', 'S2AQ10', 'S2AQ11',              'S2AQ4B', 'S2AQ5B', 'S2AQ6B', 'S2AQ7B', 'S2AQ14', 'S2AQ16A'] coding_1001 = ['S2AQ15R1'] coding_drop = ['S2AQ19', 'S2AQ20'] #coding in valid data #recode missing values to numeric value, in this example replace NaN with 101 for code in coding_101:    data[code].fillna(101, inplace=True)    #recode 99 values as missing    data[code]=data[code].replace(99, numpy.nan)    c = data[code].value_counts(sort=False, dropna=False)    print ('counts for {0} - {1}'.format(code, code_dict[code]))    print(c)
output:
counts for S2AQ7B - HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS 10.000000      3110 8.000000       1154 4.000000        667 1.000000        388 2.000000        233 9.000000       2845 5.000000       1085 3.000000        494 6.000000       1536 101.000000    29768 7.000000       1813 Name: S2AQ7B, dtype: int64 counts for S2AQ14 - NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS 1.000000       3106 8.000000        662 4.000000       1318 2.000000       2612 20.000000      2002 32.000000        65 24.000000       106 16.000000       175 64.000000         1 57.000000        13 25.000000       575 65.000000         9 9.000000        293 29.000000        46 11.000000       227 26.000000        77 61.000000         3 27.000000        83 47.000000        32 41.000000        25 10.000000      3192 18.000000       252 7.000000        627 80.000000         1 33.000000        53 58.000000        14 66.000000         1 62.000000         3 51.000000        12 67.000000         1
17.000000       181 59.000000         6 101.000000    16947 13.000000       230 50.000000       230 60.000000        67 72.000000         1 45.000000        67 19.000000       137 34.000000        51 69.000000         2 63.000000         2 43.000000        19 28.000000       110 30.000000       871 5.000000       2283 68.000000         5 14.000000       236 49.000000        19 21.000000       137 77.000000         1 35.000000       140 71.000000         3 15.000000      1274 38.000000        31 52.000000        16 53.000000        18 23.000000       110 70.000000        10 39.000000        19 Name: S2AQ14, Length: 76, dtype: int64 counts for S2AQ16A - AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS 16.000000     3301 8.000000        76 12.000000      382 32.000000      105 10.000000      132 64.000000        5 57.000000        4 25.000000     1152 43.000000       27 65.000000        9 18.000000     7042 58.000000        8 11.000000       79 61.000000        6 27.000000      223 78.000000        1 50.000000       70 47.000000       18 49.000000        9 20.000000     2661 13.000000      532 7.000000        57 73.000000        4 33.000000       71 29.000000      102 66.000000        1 51.000000        9 75.000000        6 22.000000     1250 31.000000       59
59.000000        5 101.000000    9202 26.000000      262 34.000000       65 60.000000       39 72.000000        2 45.000000       68 83.000000        1 19.000000     2547 68.000000        3 69.000000        8 63.000000        5 70.000000       10 14.000000     1020 15.000000     1649 5.000000       240 28.000000      242 42.000000       26 82.000000        2 46.000000       17 35.000000      231 79.000000        2 30.000000      598 38.000000       44 52.000000        8 44.000000        9 53.000000        8 23.000000      721 39.000000       32 71.000000        5 Name: S2AQ16A, Length: 77, dtype: int64
Descrition:
Unknow values are dropped and former drinker or lifetime abstainer is coded as a new category.
0 notes
hellodatascientist-blog · 7 years ago
Text
Coursera Data Management and Visualization HW2
Program:
import pandas
import numpy import re # any additional libraries would be imported here
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)
codebook = ''' # S2AQ8A HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8E HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ9 HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) # S2AQ10 HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS # S2AQ11 HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED # S2AQ4B HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS # S2AQ5B HOW OFTEN DRANK BEER IN LAST 12 MONTHS # S2AQ6B HOW OFTEN DRANK WINE IN LAST 12 MONTHS # S2AQ7B HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS # S2AQ14 NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS # S2AQ15R1 NUMBER OF MONTHS SINCE LAST DRINK (ROUNDED TO NEAREST MONTH) # S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS # S2AQ19 AGE AT START OF PERIOD OF HEAVIEST DRINKING # S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING # S2AQ21A HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ22 HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ23 MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING # S2DQ1 BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ2 BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ5C2 ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ6C2 ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS ''' #extract code and description m = re.findall(r'^# (\w+) (.*)$', codebook, re.MULTILINE) print(m)
#setting variables you will be working with to numeric for code in m:    data[code[0]] = data[code[0]].convert_objects(convert_numeric=True)
#counts and percentages (i.e. frequency distributions) for each variable for code in m:    c = data[code[0]].value_counts(sort=False)    p = data[code[0]].value_counts(sort=False, normalize=True)    print ('counts for {0} - {1}'.format(code[0], code[1]))    print(c)    print ('percentages for {0} - {1}'.format(code[0], code[1]))    print(p)
Output for three variables:
counts for S2AQ22 - HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING 11.000000    20698 8.000000       532 4.000000      1856 1.000000      2090 2.000000       957 9.000000       968 5.000000      1908 99.000000      498 6.000000      1208 10.000000     1330 3.000000      1764 7.000000      1018 Name: S2AQ22, dtype: int64 percentages for S2AQ22 - HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING 11.000000   0.594309 8.000000    0.015276 4.000000    0.053292 1.000000    0.060011 2.000000    0.027479 9.000000    0.027795 5.000000    0.054785 99.000000   0.014299 6.000000    0.034686 10.000000   0.038189 3.000000    0.050650 7.000000    0.029230 Name: S2AQ22, dtype: float64 counts for S2AQ23 - MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING 2.000000    12351 4.000000     6248 1.000000     1802 9.000000    10745 3.000000     3681 Name: S2AQ23, dtype: int64 percentages for S2AQ23 - MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING 2.000000   0.354639 4.000000   0.179401 1.000000   0.051741 9.000000   0.308525 3.000000   0.105694 Name: S2AQ23, dtype: float64 counts for S2DQ1 - BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER 1     8124 2    32445 9     2524 Name: S2DQ1, dtype: int64 percentages for S2DQ1 - BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER 1   0.188522 2   0.752907 9   0.058571 Name: S2DQ1, dtype: float64
Description:
For HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING
1. Every day 2. Nearly every day 3. 3 to 4 times a week 4. 2 times a week 5. Once a week 6. 2 to 3 times a month 7. Once a month 8. 7 to 11 times a year 9. 3 to 6 times a year 10. 1 or 2 times a year 11. Never 99. Unknown
‘Never’ is the largest category.
For MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING
1. Coolers 2. Beer 3. Wine 4. Liquor 9. Unknown
‘Beer’ is the larget category.
For BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER
1. Yes 2. No 9. Unknown
‘No’ is the largest category.
0 notes
hellodatascientist-blog · 7 years ago
Text
Coursera Data Management and Visualization HW1
data set: NESARC dataset.
research question: alcohol usage
hypothesis: frequent alcohol users are related to their family
codebook:
HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS
NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS
LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS
HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS
HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS
HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY)
HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS
HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED
HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS
HOW OFTEN DRANK BEER IN LAST 12 MONTHS
HOW OFTEN DRANK WINE IN LAST 12 MONTHS
HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS
NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS
NUMBER OF MONTHS SINCE LAST DRINK
AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS
AGE AT START OF PERIOD OF HEAVIEST DRINKING
DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING
HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING
NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING
LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING
HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING
HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING
MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING
BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER
BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER
ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS
ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS
ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS
ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS
literature review:
Some studies that examined parent influence on adolescent alcohol use found that parent modeling and/or attitude are related to adolescent drinking. In a cross-sectional study of inner-city secondary school students.
literature reference:
“The influence of parent, sibling, and peer modeling and attitudes on adolescent use of alcohol”, Dennis V. Ary, Elizabeth Tildesley, Hyman Hops, and Judy Andrews, The International journal of the addictions, August 1993.
literature results:
The clearest finding of this study is that the best predictor of future adoles- cent alcohol use is current use. That is. current behavior predicts future behavior far better than do measures of social innuence. With the exception of parent influence, all social influence on future alcohol use was Inediated by current use. Another way of stating this finding is that the only social influence factor that directly influenced change in youth alcohol use was parent attitude and modeling. This is also a notable finding in that it underscores the impor- tance of parent attitudes regarding alcohol use. and suggests that parents can innuence the future use of alcohol by their children by communicating their attitudes about adolescent alcohol use and by modeling nonuse of alcohol. Con- sistent with previous work in this area. peer and sibling influences were significant factors. peer influences were greatcr than those of sibling(s). How- ever, peer and sibling influences were concurrent: they had no significant influence on change in alcohol use by the focal adolescent. Parent attitudes and modeling, on the other hand. directly influenced change in adolescent alcohOl use I year later. .
1 note · View note