codersarea-blog
codersarea-blog
Sai Kumar Dintakurthi
4 posts
Don't wanna be here? Send us removal request.
codersarea-blog · 5 years ago
Text
Visualizing Data
data = pd.read_csv("C:\Users\Sai Kumar\PycharmProjects\Data Analysis\data_sets\nesarc_pds.csv", low_memory=False)
#bug fix pd.set_option('display.float_format', lambda x:'%f'%x)
#setting variables to numeric data['TAB12MDX'] = pd.to_numeric(data['TAB12MDX']) data['CHECK321'] = pd.to_numeric(data['CHECK321']) data['S3AQ3B1'] = pd.to_numeric(data['S3AQ3B1']) data['S3AQ3C1'] = pd.to_numeric(data['S3AQ3C1']) data['AGE'] = pd.to_numeric(data['AGE'])
#subset data to young adults age 18 to 25 who have smoked in the past 12 months sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)]
#make a copy of my new subsetted data sub2 = sub1.copy()
SETTING MISSING DATA
recode missing values to python missing (NaN)
sub2['S3AQ3B1'] = sub2['S3AQ3B1'].replace(9, np.nan)
recode missing values to python missing (NaN)
sub2['S3AQ3C1'] = sub2['S3AQ3C1'].replace(99, np.nan)
recode1 = {1: 6, 2: 5, 3: 4, 4: 3, 5: 2, 6: 1} sub2['USFREQ'] = sub2['S3AQ3B1'].map(recode1)
recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1} sub2['USFREQMO'] = sub2['S3AQ3B1'].map(recode2)
A secondary variable multiplying the number of days smoked/month and the approx number of cig smoked/day
sub2['NUMCIGMO_EST'] = sub2['USFREQMO'] * sub2['S3AQ3C1']
univariate bar graph for categorical variables
First hange format from numeric to categorical
sub2["TAB12MDX"] = sub2["TAB12MDX"].astype('category')
sns.countplot(x="TAB12MDX", data=sub2) plt.xlabel('Nicotine Dependence past 12 months') plt.title('Nicotine Dependence in the Past 12 Months Among Young Adult Smokers in the NESARC Study')
outputs:
Tumblr media Tumblr media
0 notes
codersarea-blog · 5 years ago
Text
Coding valid data
data = pd.read_csv("C:\Users\Sai Kumar\PycharmProjects\Data Analysis\data_sets\nesarc_pds.csv", low_memory=False)
#bug fix pd.set_option('display.float_format', lambda x:'%f'%x)
#setting variables to numeric data['TAB12MDX'] = pd.to_numeric(data['TAB12MDX']) data['CHECK321'] = pd.to_numeric(data['CHECK321']) data['S3AQ3B1'] = pd.to_numeric(data['S3AQ3B1']) data['S3AQ3C1'] = pd.to_numeric(data['S3AQ3C1']) data['AGE'] = pd.to_numeric(data['AGE'])
#subset data to young adults age 18 to 25 who have smoked in the past 12 months sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)]
#make a copy of my new subsetted data sub2 = sub1.copy() c1 = sub2['S3AQ3B1'].value_counts(sort = False, dropna=False) print(c1)
#coding missing values sub2['S3AQ3B1'] = sub2['S3AQ3B1'].replace(9, np.nan) sub2['S3AQ3C1'] = sub2['S3AQ3C1'].replace(99,np.nan)
c2 = sub2['S3AQ3B1'].value_counts(sort=False, dropna = False) print(c2) #coding in valid data #recoding missing values to numeric sub2['S2AQ8A'].fillna(11, inplace=True) #recode 99 value as missing sub2['S2AQ8A'] = sub2['S2AQ8A'].replace(99,np.nan)
#check coding chk = sub2['S2AQ8A'].value_counts(sort=False,dropna=False) print(chk) ds = sub2['S2AQ8A'].describe() print(ds) #recoding values for S3AQ3B1 into a new variable, USFREQ recode1 = {1: 6, 2: 5, 3: 4, 4: 3, 5: 2, 6: 1} sub2['USFREQ']= sub2['S3AQ3B1'].map(recode1) #recoding values for S3AQ3B1 into a new variable, USFREQMO recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1} sub2['USFREQMO']= sub2['S3AQ3B1'].map(recode2) #secondary variable multiplying the number of days smoked/month and the approx number of cig smoked/day sub2['NUMCIGMO_EST']=sub2['USFREQMO'] * sub2['S3AQ3C1']
quartile split (use qcut function & ask for 4 groups - gives you quartile split)
print('AGE - 4 categories - quartiles') sub2['AGEGROUP4']=pd.qcut(sub2.AGE, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"]) c4 = sub2['AGEGROUP4'].value_counts(sort=False, dropna=True) print(c4)
categorize quantitative variable based on customized splits using cut function
splits into 3 groups (18-20, 21-22, 23-25) - remember that Python starts counting from 0, not 1
sub2['AGEGROUP3'] = pd.cut(sub2.AGE, [17, 20, 22, 25]) c5 = sub2['AGEGROUP3'].value_counts(sort=False, dropna=True) print(c5)
#crosstabs evaluating which ages were put into which AGEGROUP3 print (pd.crosstab(sub2['AGEGROUP3'], sub2['AGE']))
#frequency distribution for AGEGROUP3 print('counts for AGEGROUP3') c10 = sub2['AGEGROUP3'].value_counts(sort=False) print(c10)
print('percentages for AGEGROUP3') p10 = sub2['AGEGROUP3'].value_counts(sort=False, normalize=True) print (p10)
outputs
nan         25080 1.000000    14836 4.000000      747 2.000000      460 9.000000      102 5.000000      409 3.000000      687 6.000000      772 Name: S3AQ3B1, dtype: int64 c2 = sub2['S3AQ3B1'].value_counts(sort=False, dropna = False) print(c2) 1.000000    1320 2.000000      68 4.000000      88 3.000000      91 5.000000      65 6.000000      71 9.000000       3 Name: S3AQ3B1, dtype: int64 1.000000    1320 2.000000      68 4.000000      88 3.000000      91 5.000000      65 6.000000      71 nan            3 Name: S3AQ3B1, dtype: int64 99      8 9     134 8      85 4     229 6     248 1      76      180 10    118 5     216 7     134 3     194 2      84 Name: S2AQ8A, dtype: int64 ds = sub2['S2AQ8A'].describe() print(ds) count     1706 unique      12 top          6 freq       248 Name: S2AQ8A, dtype: object AGE - 4 categories - quartiles 1=0%tile     582 2=25%tile    467 3=50%tile    231 4=75%tile    426 Name: AGEGROUP4, dtype: int64 (17, 20]    582 (20, 22]    467 (22, 25]    657 Name: AGEGROUP3, dtype: int64 AGE         18   19   20   21   22   23   24   25 AGEGROUP3                                         (17, 20]   161  200  221    0    0    0    0    0 (20, 22]     0    0    0  239  228    0    0    0 (22, 25]     0    0    0    0    0  231  241  185 counts for AGEGROUP3 (17, 20]    582 (20, 22]    467 (22, 25]    657 Name: AGEGROUP3, dtype: int64 percentages for AGEGROUP3 (17, 20]   0.341149 (20, 22]   0.273740 (22, 25]   0.385111 Name: AGEGROUP3, dtype: float64
0 notes
codersarea-blog · 5 years ago
Text
Frequency Analysis of selected Variables
data1 = pd.read_csv("C:\Users\Sai Kumar\PycharmProjects\Data Analysis\data_sets\student-mat.csv", low_memory=False)
c1 = data1['Dalc'].value_counts(sort = False) print(c1) p1 = data1['Dalc'].value_counts(sort = False, normalize = True) print(p1)
c2 = data1['Walc'].value_counts(sort = False) print(c2) p2 = data1['Walc'].value_counts(sort = False, normalize = True) print(p2)
c4 = data1['health'].value_counts(sort = False) print(c4) p4 = data1.health.value_counts(sort= False, normalize = True) print(p4)
print(data1.isnull().sum()) print(data1.shape)
Output of selected variables**
**Dalc: daily alcohol consumption** 1    276 2     75 3     26 4      9 5      9 Name: Dalc, dtype: int64 1    0.698734 2    0.189873 3    0.065823 4    0.022785 5    0.022785 Name: Dalc, dtype: float64
1    151 
2     85 
3     80 
4     51 
5     28
 Name: Walc, dtype: int64 
1    0.382278 
2    0.215190 
3    0.202532 
4    0.129114 
5    0.070886 Name: Walc, dtype: float64
health 1     47 2     45 3     91 4     66 5    146 Name: health, dtype: int64 1    0.118987 2    0.113924 3    0.230380 4    0.167089 5    0.369620 Name: health, dtype: float64
output for null value is no nan available all are filled with values
may be in further process i can find correlation between them and if needed i will select other variables too
0 notes
codersarea-blog · 5 years ago
Text
Research Question
Hi all, hope all are doing well
here i’m selecting my own data set named STUDENT-ALCOHOL-CONSUMPTION from kaggle
in this i want to predict the performance of a student based on the variables in 
data set
code Book
Content:
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
sex - student's sex (binary: 'F' - female or 'M' - male)
age - student's age (numeric: from 15 to 22)
address - student's home address type (binary: 'U' - urban or 'R' - rural)
famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
guardian - student's guardian (nominal: 'mother', 'father' or 'other')
traveltime - home to school travel time (numeric: 1 - 1 hour)
studytime - weekly study time (numeric: 1 - 10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
These grades are related with the course subject, Math or Portuguese:
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)
Additional note: there are several (382) students that belong to both datasets . These students can be identified by searching for identical attributes that characterize each student, as shown in the annexed R file.
Source Information
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
Fabio Pagnotta, Hossain Mohammad Amran. Email:[email protected], mohammadamra.hossain '@' studenti.unicam.it University Of Camerino
https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION
my research question is ......
i would like to research on student performance according to alcohol consumption?
Second Question: i would like to do further analysis on health according to their alcohol consumption?
Hypothesis: Based on My knowledge what i observed is according to alcohol consumption of a student there is a change in their health and study performance.
1 note · View note