hellodatascientist-blog - Tumblr blog

hellodatascientist-blog · 6 years ago

Text

Coursera Data Analysis Tools HW4

Program:

import pandas import numpy import re import scipy.stats import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)

codebook = ''' # S2AQ8A HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS # S2AQ8D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ8E HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS # S2AQ9 HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) # S2AQ10 HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS # S2AQ11 HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED # S2AQ4B HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS # S2AQ5B HOW OFTEN DRANK BEER IN LAST 12 MONTHS # S2AQ6B HOW OFTEN DRANK WINE IN LAST 12 MONTHS # S2AQ7B HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS # S2AQ14 NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS # S2AQ15R1 NUMBER OF MONTHS SINCE LAST DRINK (ROUNDED TO NEAREST MONTH) # S2AQ16A AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS # S2AQ19 AGE AT START OF PERIOD OF HEAVIEST DRINKING # S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING # S2AQ21A HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21B NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21C LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ21D HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ22 HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING # S2AQ23 MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING # S2DQ1 BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ2 BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER # S2DQ3C2 ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ4C2 ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ5C2 ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS # S2DQ6C2 ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS ''' #extract code and description m = re.findall(r'^# (\w+) (.*)$', codebook, re.MULTILINE)

#build code to description map code_dict = {} for code in m: code_dict[code[0]] = code[1]

#setting variables you will be working with to numeric for code in m: data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True) data[code[0]] = pandas.to_numeric(data[code[0]])

data['S2AQ8A'] = data['S2AQ8A'].replace(99, numpy.nan) data['S2DQ1'] = data['S2DQ1'].replace(9, numpy.nan) data['S2AQ16A'] = data['S2AQ16A'].replace(99, numpy.nan)

def DRINKFQ (row): if row['S2AQ8A'] < 4 : return 'OFTEN' elif row['S2AQ8A'] < 7: return 'SOMETIMES' elif row['S2AQ8A'] <= 10: return 'SELDOM' else: return row['S2AQ8A']

data['S2DQ1'] = pandas.Categorical(data.S2DQ1) data['S2DQ1'] = data['S2DQ1'].cat.rename_categories(['Yes', 'No'])

data['DRINKFQ'] = data.apply(lambda row: DRINKFQ(row), axis=1)

sub1 = data[['DRINKFQ', 'S2DQ1']].dropna()

# contingency table of observed counts ct1=pandas.crosstab(sub1['DRINKFQ'], sub1['S2DQ1']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

sub1["DRINKFQ_CAT"] = sub1["DRINKFQ"].astype('category') sub1['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)

seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub1, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status')

# post hoc tests

print ('post hoc tests')

recode2 = {'OFTEN': 'OFTEN', 'SOMETIMES': 'SOMETIMES'} sub1['COMP1v2']= sub1['DRINKFQ'].map(recode2)

# contingency table of observed counts ct2 = pandas.crosstab(sub1['COMP1v2'], sub1['S2DQ1']) print (ct2)

print ('chi-square value, p value, expected counts') cs2 = scipy.stats.chi2_contingency(ct2) print (cs2)

recode3 = {'OFTEN': 'OFTEN', 'SELDOM': 'SELDOM'} sub1['COMP1v3']= sub1['DRINKFQ'].map(recode3)

# contingency table of observed counts ct3 = pandas.crosstab(sub1['COMP1v3'], sub1['S2DQ1']) print (ct3)

print ('chi-square value, p value, expected counts') cs3 = scipy.stats.chi2_contingency(ct3) print (cs3)

recode4 = {'SOMETIMES': 'SOMETIMES', 'SELDOM': 'SELDOM'} sub1['COMP2v3']= sub1['DRINKFQ'].map(recode4)

# contingency table of observed counts ct4 = pandas.crosstab(sub1['COMP2v3'], sub1['S2DQ1']) print (ct4)

print ('chi-square value, p value, expected counts') cs4 = scipy.stats.chi2_contingency(ct4) print (cs4)

def AGESTDRINK (row): if row['S2AQ16A'] < 12 : return 'CHILD' elif row['S2AQ16A'] < 21 : return 'TEEN' elif row['S2AQ16A'] < 30 : return 'YOUNG' else: return 'MIDDLE'

data['AGESTDRINK'] = data.apply(lambda row: AGESTDRINK(row), axis=1)

data['S2AQ16A'].value_counts()

sub2 = data[data['AGESTDRINK'] == 'CHILD'] sub3 = data[data['AGESTDRINK'] == 'TEEN'] sub4 = data[data['AGESTDRINK'] == 'YOUNG'] sub5 = data[data['AGESTDRINK'] == 'MIDDLE']

print ('association between drink frequency and father’s drinking status for age start drink child') # contingency table of observed counts ct2=pandas.crosstab(sub2['DRINKFQ'], sub2['S2DQ1']) print (ct2)

# chi-square print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

sub2["DRINKFQ_CAT"] = sub2["DRINKFQ"].astype('category') sub2['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)

seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub2, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status for age start drink child')

print ('association between drink frequency and father’s drinking status for age start drink teen') # contingency table of observed counts ct3=pandas.crosstab(sub3['DRINKFQ'], sub3['S2DQ1']) print (ct3)

# chi-square print ('chi-square value, p value, expected counts') cs3 = scipy.stats.chi2_contingency(ct3) print (cs3)

sub3["DRINKFQ_CAT"] = sub3["DRINKFQ"].astype('category') sub3['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)

seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub3, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status for age start drink teen')

print ('association between drink frequency and father’s drinking status for age start drink young') # contingency table of observed counts ct4=pandas.crosstab(sub4['DRINKFQ'], sub4['S2DQ1']) print (ct4)

# chi-square print ('chi-square value, p value, expected counts') cs4 = scipy.stats.chi2_contingency(ct4) print (cs4)

sub4["DRINKFQ_CAT"] = sub4["DRINKFQ"].astype('category') sub4['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)

seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub4, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status for age start drink young')

print ('association between drink frequency and father’s drinking status for age start drink middle') # contingency table of observed counts ct5 = pandas.crosstab(sub5['DRINKFQ'], sub5['S2DQ1']) print (ct5)

# chi-square print ('chi-square value, p value, expected counts') cs5 = scipy.stats.chi2_contingency(ct5) print (cs5)

sub5["DRINKFQ_CAT"] = sub5["DRINKFQ"].astype('category') sub5['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)

seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub5, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status for age start drink middle')

Outputs:

43093 3010 S2DQ1 Yes No DRINKFQ OFTEN 1236 4130 SELDOM 2299 8447 SOMETIMES 1988 7282 S2DQ1 Yes No DRINKFQ OFTEN 0.223791 0.207966 SELDOM 0.416259 0.425349 SOMETIMES 0.359949 0.366685 chi-square value, p value, expected counts (6.499059891211013, 0.03879243810278014, 2, array([[1167.61555433, 4198.38444567], [2338.27744071, 8407.72255929], [2017.10700496, 7252.89299504]])) C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`. warnings.warn(msg) S2DQ1 Yes No COMP1v2 OFTEN 1236 4130 SOMETIMES 1988 7282 post hoc tests chi-square value, p value, expected counts (4.900384738017596, 0.0268507126955822, 1, array([[1182.01585133, 4183.98414867], [2041.98414867, 7228.01585133]])) S2DQ1 Yes No COMP1v3 OFTEN 1236 4130 SELDOM 2299 8447 chi-square value, p value, expected counts (5.52445154721267, 0.018752480772151237, 1, array([[1177.30945879, 4188.69054121], [2357.69054121, 8388.30945879]])) S2DQ1 Yes No COMP2v3 SELDOM 2299 8447 SOMETIMES 1988 7282 chi-square value, p value, expected counts (0.005085182174825198, 0.9431506679292518, 1, array([[2301.56384892, 8444.43615108], [1985.43615108, 7284.56384892]])) association between drink frequency and father’s drinking status for age start drink child S2DQ1 Yes No DRINKFQ OFTEN 52 82 SELDOM 35 110 SOMETIMES 37 81 chi-square value, p value, expected counts (6.977204355874777, 0.030543536775082713, 2, array([[41.85390428, 92.14609572], [45.28967254, 99.71032746], [36.85642317, 81.14357683]])) E:/python/coursera/class2.py:164: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

E:/python/coursera/class2.py:165: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy sub4["DRINKFQ_CAT"] = sub4["DRINKFQ"].astype('category') C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`. warnings.warn(msg) E:/python/coursera/class2.py:181: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

association between drink frequency and father’s drinking status for age start drink teen S2DQ1 Yes No DRINKFQ OFTEN 933 2897 SELDOM 1495 4823 SOMETIMES 1474 4996 chi-square value, p value, expected counts (3.5231846022150157, 0.17177113458336468, 2, array([[ 899.30557227, 2930.69442773], [1483.5019858 , 4834.4980142 ], [1519.19244193, 4950.80755807]])) E:/python/coursera/class2.py:182: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy sub5["DRINKFQ_CAT"] = sub5["DRINKFQ"].astype('category') C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`. warnings.warn(msg) association between drink frequency and father’s drinking status for age start drink young S2DQ1 Yes No DRINKFQ OFTEN 206 868 SELDOM 630 2673 SOMETIMES 404 1795 chi-square value, p value, expected counts (0.512844330564573, 0.7738152219049121, 2, array([[ 202.51824818, 871.48175182], [ 622.82846715, 2680.17153285], [ 414.65328467, 1784.34671533]])) E:/python/coursera/class2.py:198: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:/python/coursera/class2.py:199: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`. warnings.warn(msg) association between drink frequency and father’s drinking status for age start drink middle S2DQ1 Yes No DRINKFQ OFTEN 45 283 SELDOM 139 841 SOMETIMES 73 410 chi-square value, p value, expected counts (0.3574536470777455, 0.8363343350479611, 2, array([[ 47.06644333, 280.93355667], [140.62534897, 839.37465103], [ 69.30820771, 413.69179229]])) E:/python/coursera/class2.py:215: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:/python/coursera/class2.py:216: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

Description:

Age when start drinking has a significant moderation effect as only child group the p-value is less than 0.5 and null hypothesis is rejected. For other groups, drink frequency and father’s drinking status has no relationships.

0 notes

hellodatascientist-blog · 6 years ago

Text

Coursera Data Analysis Tools HW3

Program:

import pandas import numpy import re import scipy.stats import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)

#build code to description map code_dict = {} for code in m: code_dict[code[0]] = code[1]

#setting variables you will be working with to numeric for code in m: data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True) data[code[0]] = pandas.to_numeric(data[code[0]])

data['S2AQ8B'] = data['S2AQ8B'].replace(99, numpy.nan) data['S2AQ8C'] = data['S2AQ8C'].replace(9, numpy.nan)

sub1 = data[['S2AQ8B', 'S2AQ8C']].dropna()

#setting variables you will be working with to numeric

scat1 = seaborn.regplot(x="S2AQ8B", y="S2AQ8C", fit_reg=True, data=sub1) plt.xlabel('number of drinks') plt.ylabel('largest number of drinks') plt.title('Scatterplot for the Association Between number of drinks and largest number of drinks')

print ('association between number of drinks and largest number of drinks') print (scipy.stats.pearsonr(sub1['S2AQ8B'], sub1['S2AQ8C']))

Outputs:

43093 3010 C:\Users\asus\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval association between number of drinks and largest number of drinks (0.45374669130106854, 0.0)

Description:

Number of drinks and largest number of drinks are postive linear correlated.

0 notes

hellodatascientist-blog · 7 years ago

Text

Coursera Data Analysis Tools HW2

Program:

import pandas import numpy import re import scipy.stats import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)

#build code to description map code_dict = {} for code in m: code_dict[code[0]] = code[1]

#setting variables you will be working with to numeric for code in m: data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True) data[code[0]] = pandas.to_numeric(data[code[0]])

data['S2AQ8A'] = data['S2AQ8A'].replace(99, numpy.nan) data['S2DQ1'] = data['S2DQ1'].replace(9, numpy.nan)

def DRINKFQ (row): if row['S2AQ8A'] < 4 : return 'OFTEN' elif row['S2AQ8A'] < 7: return 'SOMETIMES' elif row['S2AQ8A'] <= 10: return 'SELDOM' else: return row['S2AQ8A']

data['S2DQ1'] = pandas.Categorical(data.S2DQ1) data['S2DQ1'] = data['S2DQ1'].cat.rename_categories(['Yes', 'No'])

data['DRINKFQ'] = data.apply(lambda row: DRINKFQ(row), axis=1)

sub1 = data[['DRINKFQ', 'S2DQ1']].dropna()

# contingency table of observed counts ct1=pandas.crosstab(sub1['DRINKFQ'], sub1['S2DQ1']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

sub1["DRINKFQ_CAT"] = sub1["DRINKFQ"].astype('category') sub1['S2DQ1_NUM'] = data.apply(lambda row: 0 if row['S2DQ1'] == 'No' else 1, axis=1)

seaborn.factorplot(x="DRINKFQ_CAT", y="S2DQ1_NUM", data=sub1, kind="bar", ci=None) plt.xlabel('Drink frequency') plt.ylabel('Father’s drinking status')

recode2 = {'OFTEN': 'OFTEN', 'SOMETIMES': 'SOMETIMES'} sub1['COMP1v2']= sub1['DRINKFQ'].map(recode2)

# contingency table of observed counts ct2 = pandas.crosstab(sub1['COMP1v2'], sub1['S2DQ1']) print (ct2)

# column percentages colsum=ct2.sum(axis=0) colpct=ct2/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs2 = scipy.stats.chi2_contingency(ct2) print (cs2)

recode3 = {'OFTEN': 'OFTEN', 'SELDOM': 'SELDOM'} sub1['COMP1v3']= sub1['DRINKFQ'].map(recode3)

# contingency table of observed counts ct3 = pandas.crosstab(sub1['COMP1v3'], sub1['S2DQ1']) print (ct3)

# column percentages colsum=ct3.sum(axis=0) colpct=ct3/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs3 = scipy.stats.chi2_contingency(ct3) print (cs3)

recode4 = {'SOMETIMES': 'SOMETIMES', 'SELDOM': 'SELDOM'} sub1['COMP2v3']= sub1['DRINKFQ'].map(recode4)

# contingency table of observed counts ct4 = pandas.crosstab(sub1['COMP2v3'], sub1['S2DQ1']) print (ct4)

# column percentages colsum=ct4.sum(axis=0) colpct=ct4/colsum print(colpct)

print ('chi-square value, p value, expected counts') cs4 = scipy.stats.chi2_contingency(ct4) print (cs4)

Output:

43093 3010 S2DQ1 Yes No DRINKFQ OFTEN 1236 4130 SELDOM 2299 8447 SOMETIMES 1988 7282 S2DQ1 Yes No DRINKFQ OFTEN 0.223791 0.207966 SELDOM 0.416259 0.425349 SOMETIMES 0.359949 0.366685 chi-square value, p value, expected counts (6.499059891211013, 0.03879243810278014, 2, array([[1167.61555433, 4198.38444567], [2338.27744071, 8407.72255929], [2017.10700496, 7252.89299504]])) C:\Users\asus\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`. warnings.warn(msg) S2DQ1 Yes No COMP1v2 OFTEN 1236 4130 SOMETIMES 1988 7282 S2DQ1 Yes No COMP1v2 OFTEN 0.383375 0.3619 SOMETIMES 0.616625 0.6381 chi-square value, p value, expected counts (4.900384738017596, 0.0268507126955822, 1, array([[1182.01585133, 4183.98414867], [2041.98414867, 7228.01585133]])) S2DQ1 Yes No COMP1v3 OFTEN 1236 4130 SELDOM 2299 8447 S2DQ1 Yes No COMP1v3 OFTEN 0.349646 0.328377 SELDOM 0.650354 0.671623 chi-square value, p value, expected counts (5.52445154721267, 0.018752480772151237, 1, array([[1177.30945879, 4188.69054121], [2357.69054121, 8388.30945879]])) S2DQ1 Yes No COMP2v3 SELDOM 2299 8447 SOMETIMES 1988 7282 S2DQ1 Yes No COMP2v3 SELDOM 0.536272 0.537034 SOMETIMES 0.463728 0.462966 chi-square value, p value, expected counts (0.005085182174825198, 0.9431506679292518, 1, array([[2301.56384892, 8444.43615108], [1985.43615108, 7284.56384892]]))

Description:

As p-value is less than 0.05, we can reject the null hypothesis and say that there is an association between the drink frequency and father’s drinking status.

However, for post hoc tests, p-values in all three comparisons are larger then 0.17, hence we accept null hypothesis.

0 notes

hellodatascientist-blog · 7 years ago

Text

Coursera Data Analysis Tools HW1

Program:

import pandas import numpy import re import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi # any additional libraries would be imported here

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)

#build code to description map code_dict = {} for code in m: code_dict[code[0]] = code[1]

#setting variables you will be working with to numeric for code in m: data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True) data[code[0]] = pandas.to_numeric(data[code[0]])

data['S2AQ8C'] = data['S2AQ8C'].replace(99, numpy.nan) data['S2DQ1'] = data['S2DQ1'].replace(9, numpy.nan)

data['S2DQ1'] = pandas.Categorical(data.S2DQ1) data['S2DQ1']=data['S2DQ1'].cat.rename_categories(['Yes', 'No'])

# using ols function for calculating the F-statistic and associated p value model1 = smf.ols(formula='S2AQ8C ~ C(S2DQ1)', data=data) results1 = model1.fit() print(results1.summary())

sub1 = data[['S2AQ8C', 'S2DQ1']].dropna()

print('means for {0} by {1}'.format(code_dict['S2AQ8C'], code_dict['S2DQ1'])) m1= sub1.groupby('S2DQ1').mean() print (m1)

print ('standard deviations for {0} by {1}'.format(code_dict['S2AQ8C'], code_dict['S2DQ1'])) sd1 = sub1.groupby('S2DQ1').std() print (sd1)

Output:

43093 3010 OLS Regression Results ============================================================================== Dep. Variable: S2AQ8C R-squared: 0.009 Model: OLS Adj. R-squared: 0.009 Method: Least Squares F-statistic: 223.9 Date: Sat, 08 Dec 2018 Prob (F-statistic): 2.06e-50 Time: 19:09:02 Log-Likelihood: -74320. No. Observations: 25265 AIC: 1.486e+05 Df Residuals: 25263 BIC: 1.487e+05 Df Model: 1 Covariance Type: nonrobust ================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------- Intercept 5.0211 0.062 81.110 0.000 4.900 5.142 C(S2DQ1)[T.No] -1.0470 0.070 -14.965 0.000 -1.184 -0.910 ============================================================================== Omnibus: 24732.885 Durbin-Watson: 1.983 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2443146.664 Skew: 4.566 Prob(JB): 0.00 Kurtosis: 50.302 Cond. No. 4.08 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS by BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER S2AQ8C S2DQ1 Yes 5.021149 No 3.974166 standard deviations for LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS by BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER S2AQ8C S2DQ1 Yes 5.505642 No 4.294600

Description:

As p-value is less than 0.05, we can reject the null hypothesis and say that there is an association between the largest number of drink and father’s drinking status.

0 notes

hellodatascientist-blog · 7 years ago

Text

Coursera Data Management and Visualization HW4

Program:

import pandas

import numpy import re import seaborn import matplotlib.pyplot as plt # any additional libraries would be imported here

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)

#build code to description map code_dict = {} for code in m: code_dict[code[0]] = code[1]

#setting variables you will be working with to numeric for code in m: data[code[0]].replace(r'\s+', numpy.nan, regex=True, inplace=True) data[code[0]] = pandas.to_numeric(data[code[0]])

#counts and percentages (i.e. frequency distributions) for each variable for code in m: c = data[code[0]].value_counts(sort=False) p = data[code[0]].value_counts(sort=False, normalize=True)

#coding in valid data #recode missing values to numeric value, in this example replace NaN with 101 for code in coding_101: data[code].fillna(101, inplace=True) #recode 99 values as missing data[code]=data[code].replace(99, numpy.nan) c = data[code].value_counts(sort=False, dropna=False)

for code in coding_drop9: data[code]=data[code].replace(9, numpy.nan) c = data[code].value_counts(sort=False, dropna=False)

for code in m: data[code[0]] = data[code[0]].astype('category') seaborn.countplot(x=code[0], data=data) plt.title(code[1]) plt.show()

#creating 2 level smokegroup variable def DRINKGRP (row): if row['S2AQ21D'] > 5 : return 0 else : return 1

data['DRINKGRP'] = data.apply(lambda row: DRINKGRP(row), axis=1) data['S2DQ1'] = data['S2DQ1'].astype('category') data['S2DQ1']=data['S2DQ1'].cat.rename_categories(['Yes', 'No']) seaborn.factorplot(x="S2DQ1", y="DRINKGRP", data=data, kind="bar", ci=None) plt.xlabel('BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER') plt.ylabel('Frequent drinker')

univariate graph:

bivariate graph:

Description:

X: whether father is even an drinker

Y: frequent of drinker during period of haviest drinking

Whether father is an drinker does not affect too much on frequence of drinking. The group having drinker father even has less frequent drinking.

0 notes

hellodatascientist-blog · 7 years ago

Text

Coursera Data Management and Visualization HW3

program:

import pandas import numpy import re # any additional libraries would be imported here

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)

#build code to description map code_dict = {} for code in m: code_dict[code[0]] = code[1]

#setting variables you will be working with to numeric for code in m: data[code[0]] = data[code[0]].convert_objects(convert_numeric=True)

#counts and percentages (i.e. frequency distributions) for each variable for code in m: c = data[code[0]].value_counts(sort=False) p = data[code[0]].value_counts(sort=False, normalize=True)

coding_101 = ['S2AQ8A', 'S2AQ8B', 'S2AQ8C', 'S2AQ8D', 'S2AQ8E', 'S2AQ9', 'S2AQ10', 'S2AQ11', 'S2AQ4B', 'S2AQ5B', 'S2AQ6B', 'S2AQ7B', 'S2AQ14', 'S2AQ16A'] coding_1001 = ['S2AQ15R1'] coding_drop = ['S2AQ19', 'S2AQ20'] #coding in valid data #recode missing values to numeric value, in this example replace NaN with 101 for code in coding_101: data[code].fillna(101, inplace=True) #recode 99 values as missing data[code]=data[code].replace(99, numpy.nan) c = data[code].value_counts(sort=False, dropna=False) print ('counts for {0} - {1}'.format(code, code_dict[code])) print(c)

output:

counts for S2AQ7B - HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS 10.000000 3110 8.000000 1154 4.000000 667 1.000000 388 2.000000 233 9.000000 2845 5.000000 1085 3.000000 494 6.000000 1536 101.000000 29768 7.000000 1813 Name: S2AQ7B, dtype: int64 counts for S2AQ14 - NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS 1.000000 3106 8.000000 662 4.000000 1318 2.000000 2612 20.000000 2002 32.000000 65 24.000000 106 16.000000 175 64.000000 1 57.000000 13 25.000000 575 65.000000 9 9.000000 293 29.000000 46 11.000000 227 26.000000 77 61.000000 3 27.000000 83 47.000000 32 41.000000 25 10.000000 3192 18.000000 252 7.000000 627 80.000000 1 33.000000 53 58.000000 14 66.000000 1 62.000000 3 51.000000 12 67.000000 1

17.000000 181 59.000000 6 101.000000 16947 13.000000 230 50.000000 230 60.000000 67 72.000000 1 45.000000 67 19.000000 137 34.000000 51 69.000000 2 63.000000 2 43.000000 19 28.000000 110 30.000000 871 5.000000 2283 68.000000 5 14.000000 236 49.000000 19 21.000000 137 77.000000 1 35.000000 140 71.000000 3 15.000000 1274 38.000000 31 52.000000 16 53.000000 18 23.000000 110 70.000000 10 39.000000 19 Name: S2AQ14, Length: 76, dtype: int64 counts for S2AQ16A - AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS 16.000000 3301 8.000000 76 12.000000 382 32.000000 105 10.000000 132 64.000000 5 57.000000 4 25.000000 1152 43.000000 27 65.000000 9 18.000000 7042 58.000000 8 11.000000 79 61.000000 6 27.000000 223 78.000000 1 50.000000 70 47.000000 18 49.000000 9 20.000000 2661 13.000000 532 7.000000 57 73.000000 4 33.000000 71 29.000000 102 66.000000 1 51.000000 9 75.000000 6 22.000000 1250 31.000000 59

59.000000 5 101.000000 9202 26.000000 262 34.000000 65 60.000000 39 72.000000 2 45.000000 68 83.000000 1 19.000000 2547 68.000000 3 69.000000 8 63.000000 5 70.000000 10 14.000000 1020 15.000000 1649 5.000000 240 28.000000 242 42.000000 26 82.000000 2 46.000000 17 35.000000 231 79.000000 2 30.000000 598 38.000000 44 52.000000 8 44.000000 9 53.000000 8 23.000000 721 39.000000 32 71.000000 5 Name: S2AQ16A, Length: 77, dtype: int64

Descrition:

Unknow values are dropped and former drinker or lifetime abstainer is coded as a new category.

0 notes

hellodatascientist-blog · 7 years ago

Text

Coursera Data Management and Visualization HW2

Program:

import pandas

import numpy import re # any additional libraries would be imported here

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)

#setting variables you will be working with to numeric for code in m: data[code[0]] = data[code[0]].convert_objects(convert_numeric=True)

#counts and percentages (i.e. frequency distributions) for each variable for code in m: c = data[code[0]].value_counts(sort=False) p = data[code[0]].value_counts(sort=False, normalize=True) print ('counts for {0} - {1}'.format(code[0], code[1])) print(c) print ('percentages for {0} - {1}'.format(code[0], code[1])) print(p)

Output for three variables:

counts for S2AQ22 - HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING 11.000000 20698 8.000000 532 4.000000 1856 1.000000 2090 2.000000 957 9.000000 968 5.000000 1908 99.000000 498 6.000000 1208 10.000000 1330 3.000000 1764 7.000000 1018 Name: S2AQ22, dtype: int64 percentages for S2AQ22 - HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING 11.000000 0.594309 8.000000 0.015276 4.000000 0.053292 1.000000 0.060011 2.000000 0.027479 9.000000 0.027795 5.000000 0.054785 99.000000 0.014299 6.000000 0.034686 10.000000 0.038189 3.000000 0.050650 7.000000 0.029230 Name: S2AQ22, dtype: float64 counts for S2AQ23 - MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING 2.000000 12351 4.000000 6248 1.000000 1802 9.000000 10745 3.000000 3681 Name: S2AQ23, dtype: int64 percentages for S2AQ23 - MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING 2.000000 0.354639 4.000000 0.179401 1.000000 0.051741 9.000000 0.308525 3.000000 0.105694 Name: S2AQ23, dtype: float64 counts for S2DQ1 - BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER 1 8124 2 32445 9 2524 Name: S2DQ1, dtype: int64 percentages for S2DQ1 - BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER 1 0.188522 2 0.752907 9 0.058571 Name: S2DQ1, dtype: float64

Description:

For HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING

1. Every day 2. Nearly every day 3. 3 to 4 times a week 4. 2 times a week 5. Once a week 6. 2 to 3 times a month 7. Once a month 8. 7 to 11 times a year 9. 3 to 6 times a year 10. 1 or 2 times a year 11. Never 99. Unknown

‘Never’ is the largest category.

For MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING

1. Coolers 2. Beer 3. Wine 4. Liquor 9. Unknown

‘Beer’ is the larget category.

For BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER

1. Yes 2. No 9. Unknown

‘No’ is the largest category.

0 notes

hellodatascientist-blog · 7 years ago

Text

Coursera Data Management and Visualization HW1

data set: NESARC dataset.

research question: alcohol usage

hypothesis: frequent alcohol users are related to their family

codebook:

HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS

NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS

LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS

HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS

HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS

HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY)

HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS

HOW MANY DRINKS CAN HOLD WITHOUT FEELING INTOXICATED

HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS

HOW OFTEN DRANK BEER IN LAST 12 MONTHS

HOW OFTEN DRANK WINE IN LAST 12 MONTHS

HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS

NUMBER OF YEARS DRANK SAME AS IN LAST 12 MONTHS

NUMBER OF MONTHS SINCE LAST DRINK

AGE WHEN STARTED DRINKING, NOT COUNTING SMALL TASTES OR SIPS

AGE AT START OF PERIOD OF HEAVIEST DRINKING

DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING

HOW OFTEN DRANK ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING

NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING

LARGEST NUMBER OF DRINKS OF ANY ALCOHOL CONSUMED ON DAYS WHEN DRANK ALCOHOL DURING PERIOD OF HEAVIEST DRINKING

HOW OFTEN DRANK LARGEST NUMBER OF DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING

HOW OFTEN DRANK 5+ DRINKS OF ANY ALCOHOL DURING PERIOD OF HEAVIEST DRINKING

MAIN TYPE OF ALCOHOL CONSUMED DURING PERIOD OF HEAVIEST DRINKING

BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER

BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER

ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS

ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS

ANY NATURAL SONS EVER ALCOHOLICS OR PROBLEM DRINKERS

ANY NATURAL DAUGHTERS EVER ALCOHOLICS OR PROBLEM DRINKERS

literature review:

Some studies that examined parent influence on adolescent alcohol use found that parent modeling and/or attitude are related to adolescent drinking. In a cross-sectional study of inner-city secondary school students.

literature reference:

“The influence of parent, sibling, and peer modeling and attitudes on adolescent use of alcohol”, Dennis V. Ary, Elizabeth Tildesley, Hyman Hops, and Judy Andrews, The International journal of the addictions, August 1993.

literature results:

The clearest finding of this study is that the best predictor of future adoles- cent alcohol use is current use. That is. current behavior predicts future behavior far better than do measures of social innuence. With the exception of parent influence, all social influence on future alcohol use was Inediated by current use. Another way of stating this finding is that the only social influence factor that directly influenced change in youth alcohol use was parent attitude and modeling. This is also a notable finding in that it underscores the impor- tance of parent attitudes regarding alcohol use. and suggests that parents can innuence the future use of alcohol by their children by communicating their attitudes about adolescent alcohol use and by modeling nonuse of alcohol. Con- sistent with previous work in this area. peer and sibling influences were significant factors. peer influences were greatcr than those of sibling(s). How- ever, peer and sibling influences were concurrent: they had no significant influence on change in alcohol use by the focal adolescent. Parent attitudes and modeling, on the other hand. directly influenced change in adolescent alcohOl use I year later. .

1 note · View note