codersarea-blog - Tumblr blog

codersarea-blog · 5 years ago

Text

Visualizing Data

data = pd.read_csv("C:\Users\Sai Kumar\PycharmProjects\Data Analysis\data_sets\nesarc_pds.csv", low_memory=False)

#bug fix pd.set_option('display.float_format', lambda x:'%f'%x)

#setting variables to numeric data['TAB12MDX'] = pd.to_numeric(data['TAB12MDX']) data['CHECK321'] = pd.to_numeric(data['CHECK321']) data['S3AQ3B1'] = pd.to_numeric(data['S3AQ3B1']) data['S3AQ3C1'] = pd.to_numeric(data['S3AQ3C1']) data['AGE'] = pd.to_numeric(data['AGE'])

#subset data to young adults age 18 to 25 who have smoked in the past 12 months sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)]

#make a copy of my new subsetted data sub2 = sub1.copy()

SETTING MISSING DATA

recode missing values to python missing (NaN)

sub2['S3AQ3B1'] = sub2['S3AQ3B1'].replace(9, np.nan)

recode missing values to python missing (NaN)

sub2['S3AQ3C1'] = sub2['S3AQ3C1'].replace(99, np.nan)

recode1 = {1: 6, 2: 5, 3: 4, 4: 3, 5: 2, 6: 1} sub2['USFREQ'] = sub2['S3AQ3B1'].map(recode1)

recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1} sub2['USFREQMO'] = sub2['S3AQ3B1'].map(recode2)

A secondary variable multiplying the number of days smoked/month and the approx number of cig smoked/day

sub2['NUMCIGMO_EST'] = sub2['USFREQMO'] * sub2['S3AQ3C1']

univariate bar graph for categorical variables

First hange format from numeric to categorical

sub2["TAB12MDX"] = sub2["TAB12MDX"].astype('category')

sns.countplot(x="TAB12MDX", data=sub2) plt.xlabel('Nicotine Dependence past 12 months') plt.title('Nicotine Dependence in the Past 12 Months Among Young Adult Smokers in the NESARC Study')

outputs:

0 notes

codersarea-blog · 5 years ago

Text

Coding valid data

data = pd.read_csv("C:\Users\Sai Kumar\PycharmProjects\Data Analysis\data_sets\nesarc_pds.csv", low_memory=False)

#bug fix pd.set_option('display.float_format', lambda x:'%f'%x)

#subset data to young adults age 18 to 25 who have smoked in the past 12 months sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)]

#make a copy of my new subsetted data sub2 = sub1.copy() c1 = sub2['S3AQ3B1'].value_counts(sort = False, dropna=False) print(c1)

#coding missing values sub2['S3AQ3B1'] = sub2['S3AQ3B1'].replace(9, np.nan) sub2['S3AQ3C1'] = sub2['S3AQ3C1'].replace(99,np.nan)

c2 = sub2['S3AQ3B1'].value_counts(sort=False, dropna = False) print(c2) #coding in valid data #recoding missing values to numeric sub2['S2AQ8A'].fillna(11, inplace=True) #recode 99 value as missing sub2['S2AQ8A'] = sub2['S2AQ8A'].replace(99,np.nan)

#check coding chk = sub2['S2AQ8A'].value_counts(sort=False,dropna=False) print(chk) ds = sub2['S2AQ8A'].describe() print(ds) #recoding values for S3AQ3B1 into a new variable, USFREQ recode1 = {1: 6, 2: 5, 3: 4, 4: 3, 5: 2, 6: 1} sub2['USFREQ']= sub2['S3AQ3B1'].map(recode1) #recoding values for S3AQ3B1 into a new variable, USFREQMO recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1} sub2['USFREQMO']= sub2['S3AQ3B1'].map(recode2) #secondary variable multiplying the number of days smoked/month and the approx number of cig smoked/day sub2['NUMCIGMO_EST']=sub2['USFREQMO'] * sub2['S3AQ3C1']

quartile split (use qcut function & ask for 4 groups - gives you quartile split)

print('AGE - 4 categories - quartiles') sub2['AGEGROUP4']=pd.qcut(sub2.AGE, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"]) c4 = sub2['AGEGROUP4'].value_counts(sort=False, dropna=True) print(c4)

categorize quantitative variable based on customized splits using cut function

splits into 3 groups (18-20, 21-22, 23-25) - remember that Python starts counting from 0, not 1

sub2['AGEGROUP3'] = pd.cut(sub2.AGE, [17, 20, 22, 25]) c5 = sub2['AGEGROUP3'].value_counts(sort=False, dropna=True) print(c5)

#crosstabs evaluating which ages were put into which AGEGROUP3 print (pd.crosstab(sub2['AGEGROUP3'], sub2['AGE']))

#frequency distribution for AGEGROUP3 print('counts for AGEGROUP3') c10 = sub2['AGEGROUP3'].value_counts(sort=False) print(c10)

print('percentages for AGEGROUP3') p10 = sub2['AGEGROUP3'].value_counts(sort=False, normalize=True) print (p10)

outputs

nan 25080 1.000000 14836 4.000000 747 2.000000 460 9.000000 102 5.000000 409 3.000000 687 6.000000 772 Name: S3AQ3B1, dtype: int64 c2 = sub2['S3AQ3B1'].value_counts(sort=False, dropna = False) print(c2) 1.000000 1320 2.000000 68 4.000000 88 3.000000 91 5.000000 65 6.000000 71 9.000000 3 Name: S3AQ3B1, dtype: int64 1.000000 1320 2.000000 68 4.000000 88 3.000000 91 5.000000 65 6.000000 71 nan 3 Name: S3AQ3B1, dtype: int64 99 8 9 134 8 85 4 229 6 248 1 76 180 10 118 5 216 7 134 3 194 2 84 Name: S2AQ8A, dtype: int64 ds = sub2['S2AQ8A'].describe() print(ds) count 1706 unique 12 top 6 freq 248 Name: S2AQ8A, dtype: object AGE - 4 categories - quartiles 1=0%tile 582 2=25%tile 467 3=50%tile 231 4=75%tile 426 Name: AGEGROUP4, dtype: int64 (17, 20] 582 (20, 22] 467 (22, 25] 657 Name: AGEGROUP3, dtype: int64 AGE 18 19 20 21 22 23 24 25 AGEGROUP3 (17, 20] 161 200 221 0 0 0 0 0 (20, 22] 0 0 0 239 228 0 0 0 (22, 25] 0 0 0 0 0 231 241 185 counts for AGEGROUP3 (17, 20] 582 (20, 22] 467 (22, 25] 657 Name: AGEGROUP3, dtype: int64 percentages for AGEGROUP3 (17, 20] 0.341149 (20, 22] 0.273740 (22, 25] 0.385111 Name: AGEGROUP3, dtype: float64

0 notes

codersarea-blog · 5 years ago

Text

Frequency Analysis of selected Variables

data1 = pd.read_csv("C:\Users\Sai Kumar\PycharmProjects\Data Analysis\data_sets\student-mat.csv", low_memory=False)

c1 = data1['Dalc'].value_counts(sort = False) print(c1) p1 = data1['Dalc'].value_counts(sort = False, normalize = True) print(p1)

c2 = data1['Walc'].value_counts(sort = False) print(c2) p2 = data1['Walc'].value_counts(sort = False, normalize = True) print(p2)

c4 = data1['health'].value_counts(sort = False) print(c4) p4 = data1.health.value_counts(sort= False, normalize = True) print(p4)

print(data1.isnull().sum()) print(data1.shape)

Output of selected variables**

**Dalc: daily alcohol consumption** 1 276 2 75 3 26 4 9 5 9 Name: Dalc, dtype: int64 1 0.698734 2 0.189873 3 0.065823 4 0.022785 5 0.022785 Name: Dalc, dtype: float64

1 151

2 85

3 80

4 51

5 28

Name: Walc, dtype: int64

1 0.382278

2 0.215190

3 0.202532

4 0.129114

5 0.070886 Name: Walc, dtype: float64

health 1 47 2 45 3 91 4 66 5 146 Name: health, dtype: int64 1 0.118987 2 0.113924 3 0.230380 4 0.167089 5 0.369620 Name: health, dtype: float64

output for null value is no nan available all are filled with values

may be in further process i can find correlation between them and if needed i will select other variables too

0 notes

codersarea-blog · 5 years ago

Text

Research Question

Hi all, hope all are doing well

here i’m selecting my own data set named STUDENT-ALCOHOL-CONSUMPTION from kaggle

in this i want to predict the performance of a student based on the variables in

data set

code Book

Content:

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)

sex - student's sex (binary: 'F' - female or 'M' - male)

age - student's age (numeric: from 15 to 22)

address - student's home address type (binary: 'U' - urban or 'R' - rural)

famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)

Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)

Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')

guardian - student's guardian (nominal: 'mother', 'father' or 'other')

traveltime - home to school travel time (numeric: 1 - 1 hour)

studytime - weekly study time (numeric: 1 - 10 hours)

failures - number of past class failures (numeric: n if 1<=n<3, else 4)

schoolsup - extra educational support (binary: yes or no)

famsup - family educational support (binary: yes or no)

paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

activities - extra-curricular activities (binary: yes or no)

nursery - attended nursery school (binary: yes or no)

higher - wants to take higher education (binary: yes or no)

internet - Internet access at home (binary: yes or no)

romantic - with a romantic relationship (binary: yes or no)

famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

freetime - free time after school (numeric: from 1 - very low to 5 - very high)

goout - going out with friends (numeric: from 1 - very low to 5 - very high)

Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

health - current health status (numeric: from 1 - very bad to 5 - very good)

absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

G1 - first period grade (numeric: from 0 to 20)

G2 - second period grade (numeric: from 0 to 20)

G3 - final grade (numeric: from 0 to 20, output target)

Additional note: there are several (382) students that belong to both datasets . These students can be identified by searching for identical attributes that characterize each student, as shown in the annexed R file.

Source Information

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Fabio Pagnotta, Hossain Mohammad Amran. Email:[email protected], mohammadamra.hossain '@' studenti.unicam.it University Of Camerino

https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION

my research question is ......

i would like to research on student performance according to alcohol consumption?

Second Question: i would like to do further analysis on health according to their alcohol consumption?

Hypothesis: Based on My knowledge what i observed is according to alcohol consumption of a student there is a change in their health and study performance.

1 note · View note