codingwiz-blog1 - Tumblr blog

codingwiz-blog1 · 5 years ago

Text

Assignment-3

import pandas as pd import numpy as np

In [2]:

data = pd.read_csv("nesarc_pds.csv", low_memory=False)

In [3]:

data.columns=map(str.upper,data.columns)

In [4]:

data.head()

Out[4]:UNNAMED: 0ETHRACE2AETOTLCA2IDNUMPSUSTRATUMWEIGHTCDAYCMONCYEAR…SOLP12ABDEPHAL12ABDEPHALP12ABDEPMAR12ABDEPMARP12ABDEPHER12ABDEPHERP12ABDEPOTHB12ABDEPOTHBP12ABDEPNDSYMPTOMS

005140074033928.6135051482001…000000000NaN

1150.0014260456043638.6918451212002…000000000NaN

22531204212185779.03202523112001…000000000NaN

33541709917041071.754303992001…000000000NaN

44251709917044986.95237718102001…000000000NaN

5 rows × 3010 columns

Variables i’ll taking in this codebook:

CONSUMER = Drinking Status

S1Q2C2 = Raised by relatives before 18 age

SMOKER = Tobacco use status.

S3AQ52 = Age started smoking cigars everyday.

S2AQ19 = Age at start of period of Heaviest drinking.

NOTE :

Since, I’ve not used Spyder IDE therefore codes syntax have slight changes as compared to Video Lectures. Hope, you’ll understand each and every code, I’ve created comments for your reference wherever needed.

In [5]:

sub= data[['CONSUMER','S1Q2C2', 'SMOKER', 'S3AQ52', 'S2AQ19']]

In [6]:

sub1 = sub.copy()

In [7]:

sub1.head()

Out[7]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19

033

11321

233

32316

42318

CONSUMER :

1. Current drinker 2. Ex-drinker 3. Lifetime Abstainer

Since this is counter intuitive we can change this to :

0) Lifetime Abstrainer(One who never did drinking) 1) Ex-Drinker 2) Current Drinker

In [8]:

print("Before labels (CONSUMER) : ") print(sorted(sub1['CONSUMER'].unique()))

Before labels (CONSUMER) : [1, 2, 3]

In [9]:

def recode1(val): if val==1: return 2 if val==2: return 1 if val==3: return 0

In [10]:

sub1['CONSUMER_NEWL'] = sub1['CONSUMER'].apply(lambda x : recode1(x))

In [11]:

print("After labels (CONSUMER_NEWL) : ") print(sorted(sub1['CONSUMER_NEWL'].unique()))

After labels (CONSUMER_NEWL) : [0, 1, 2]

In [12]:

sub1.head()

Out[12]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19CONSUMER_NEWL

0330

113212

2330

323161

423181

SMOKER :

1. Current user 2. Ex-user 3. Lifetime nonsmoker

Since, this is also counter intuitive we can change this to :

0) Lifetime nonsmoker 1) Ex-user 2) Current user

In [13]:

print("Before labels (SMOKER) : ") print(sorted(sub1['SMOKER'].unique()))

Before labels (SMOKER) : [1, 2, 3]

In [14]:

#using above 'recode1' function here too. sub1['SMOKER_NEWL'] = sub1['SMOKER'].apply(lambda x : recode1(x))

In [15]:

sub1.head()

Out[15]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

03300

1132120

23300

3231610

4231810

In [16]:

print("After labels (SMOKER_NEWL) : ") print(sorted(sub1['SMOKER_NEWL'].unique()))

After labels (SMOKER_NEWL) : [0, 1, 2]

In [17]:

columnsTitles = ['CONSUMER','SMOKER', 'S1Q2C2', 'S3AQ52', 'S2AQ19', 'CONSUMER_NEWL','SMOKER_NEWL'] sub1 = sub1.reindex(columns=columnsTitles)

In [18]:

sub1.head()

Out[18]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

03300

1132120

23300

3231610

4231810

In [19]:

sub1['CONSUMER_NEWL'].value_counts(sort=False)

Out[19]:

0 8266 1 7881 2 26946 Name: CONSUMER_NEWL, dtype: int64

In [20]:

sub1['SMOKER_NEWL'].value_counts(sort=False)

Out[20]:

0 23901 1 8074 2 11118 Name: SMOKER_NEWL, dtype: int64

Managing variable - S3AQ52 (AGE STARTED SMOKING CIGARS EVERY DAY)

In [21]:

sub1['S3AQ52'].unique()

Out[21]:

array([' ', '21', '16', '20', '30', '40', '17', '25', '15', '35', '38', '37', '26', '53', '24', '54', '18', '28', '55', '45', '32', '22', '48', '39', '50', '34', '99', '36', '12', '60', '42', '51', '23', '64', '47', '29', '19', '9', '70', '41', '52', '33', '46', '31', '59', '8', '10', '44', '43', '65', '57', '69', '58', '27', '66', '14', '84', '5', '11', '13', '49', '62', '63', '80', '56'], dtype=object)

In [22]:

sub1[sub1['S3AQ52']==" "]

Out[22]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

03300

1132120

23300

3231610

4231810

……………………

430883300

43089131820

43090111722

430911122422

43092231710

42374 rows × 7 columns

In [23]:

#Converting blank values or People who never smoked to 0 sub1.loc[sub1['S3AQ52']==" ", 'S3AQ52'] = 0

In [24]:

#Converting String values of Dataframe to Numeric sub1['S3AQ52']= pd.to_numeric(sub1['S3AQ52'])

In [25]:

sub1['S3AQ52'].unique()

Out[25]:

array([ 0, 21, 16, 20, 30, 40, 17, 25, 15, 35, 38, 37, 26, 53, 24, 54, 18, 28, 55, 45, 32, 22, 48, 39, 50, 34, 99, 36, 12, 60, 42, 51, 23, 64, 47, 29, 19, 9, 70, 41, 52, 33, 46, 31, 59, 8, 10, 44, 43, 65, 57, 69, 58, 27, 66, 14, 84, 5, 11, 13, 49, 62, 63, 80, 56], dtype=int64)

In [26]:

#Converting '99' (People who not answered this question in survey) to NaN. sub1.loc[sub1['S3AQ52']==99, 'S3AQ52'] = np.nan

In [27]:

sub1['S3AQ52'].unique()

Out[27]:

array([ 0., 21., 16., 20., 30., 40., 17., 25., 15., 35., 38., 37., 26., 53., 24., 54., 18., 28., 55., 45., 32., 22., 48., 39., 50., 34., nan, 36., 12., 60., 42., 51., 23., 64., 47., 29., 19., 9., 70., 41., 52., 33., 46., 31., 59., 8., 10., 44., 43., 65., 57., 69., 58., 27., 66., 14., 84., 5., 11., 13., 49., 62., 63., 80., 56.])

In [28]:

sub1.head()

Out[28]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

0330.000

1130.02120

2330.000

3230.01610

4230.01810

Now, Column S3AQ52 is managed and prepared.

Managing variable - S2AQ19 (AGE AT START OF PERIOD OF HEAVIEST DRINKING)

In [29]:

sub1['S2AQ19'].unique()

Out[29]:

array([' ', '21', '16', '18', '30', '17', '28', '43', '26', '23', '20', '51', '19', '40', '35', '27', '42', '22', '15', '36', '25', '24', '68', '99', '29', '52', '31', '33', '57', '38', '39', '32', '90', '49', '50', '37', '34', '59', '63', '58', '55', '53', '79', '56', '77', '41', '64', '8', '73', '6', '70', '13', '72', '44', '47', '54', '14', '46', '48', '61', '65', '10', '76', '69', '5', '45', '71', '60', '67', '12', '62', '74', '86', '66', '81', '82', '9', '75', '83', '80', '78', '7', '87', '11', '85', '84', '91', '88'], dtype=object)

In [30]:

#Pepole who are lifetime abstainer sub1[sub1['S2AQ19']==" "]

Out[30]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

0330.000

2330.000

22310.002

23330.000

26330.000

……………………

43070310.002

43071330.000

43072330.000

43082330.000

43088330.000

8266 rows × 7 columns

In [31]:

#Converting blank values or People who are lifetime abstainer to 0 sub1.loc[sub1['S2AQ19']==" ", 'S2AQ19'] = 0

In [32]:

#Converting String values of Dataframe to Numeric sub1['S2AQ19']= pd.to_numeric(sub1['S2AQ19'])

In [33]:

sub1['S3AQ52'].unique()

Out[33]:

In [34]:

sub1.head()

Out[34]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

0330.0000

1130.02120

2330.0000

3230.01610

4230.01810

Now, Column S2AQ19 is also managed and prepared.

Managing variable - S1Q2C2 (RAISED BY RELATIVES BEFORE AGE 18)

In [35]:

sub1['S1Q2C2'].unique()

Out[35]:

array([' ', '1', '2', '9'], dtype=object)

RAISED BY ADOPTIVE PARENTS BEFORE AGE 18 1. Yes 2. No 9. Unknown BL. NA, lived with biological parent(s) before age 18 We can change this to only 3 categories since we have to deal with only those people are were raised by relative: 1. ->1. Yes BL.NA & 2 ->0. No + NA(Lived with biological parent(s) before age of 18) 9. ->NaN. Those who didn’t answered this question in survey.In [36]:

sub1['S1Q2C2'].value_counts(dropna=False)

Out[36]:

41679 1 649 2 553 9 212 Name: S1Q2C2, dtype: int64

In [47]:

#Converting blank values or People who are raised by parent(s) to 0. sub1.loc[sub1['S1Q2C2']==" ", 'S1Q2C2'] = 0

In [38]:

sub1['S1Q2C2'].value_counts(dropna=False)

Out[38]:

0 41679 1 649 2 553 9 212 Name: S1Q2C2, dtype: int64

In [39]:

#Converting value = 2 to 0. sub1.loc[sub1['S1Q2C2']=="2", 'S1Q2C2'] = 0

In [40]:

#Converting value = 9 to NaN. sub1.loc[sub1['S1Q2C2']=="9", 'S1Q2C2'] = np.nan

In [43]:

sub1['S1Q2C2'].unique()

Out[43]:

array([0, 1, nan], dtype=object)

In [45]:

sub1['S1Q2C2'].value_counts(dropna=False)

Out[45]:

0.0 42232 1.0 649 NaN 212 Name: S1Q2C2, dtype: int64

In [46]:

sub1.head()

Out[46]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

03300.0000

11300.02120

23300.0000

32300.01610

42300.01810

Now, Column S1Q2C2 is also managed and prepared.

And we can now remove cols CONSUMER and SMOKER for further analysis.

In [ ]:

0 notes

codingwiz-blog1 · 5 years ago

Text

Assignment-2

import pandas as pd import numpy as np

In [2]:

data = pd.read_csv("nesarc_pds.csv", low_memory=False)

In [3]:

data.columns=map(str.upper,data.columns)

In [4]:

data.head()

Out[4]:UNNAMED: 0ETHRACE2AETOTLCA2IDNUMPSUSTRATUMWEIGHTCDAYCMONCYEAR…SOLP12ABDEPHAL12ABDEPHALP12ABDEPMAR12ABDEPMARP12ABDEPHER12ABDEPHERP12ABDEPOTHB12ABDEPOTHBP12ABDEPNDSYMPTOMS

005140074033928.6135051482001…000000000NaN

1150.0014260456043638.6918451212002…000000000NaN

22531204212185779.03202523112001…000000000NaN

33541709917041071.754303992001…000000000NaN

44251709917044986.95237718102001…000000000NaN

5 rows × 3010 columns

Variables i’ll taking in this codebook:

CONSUMER = Drinking Status

S1Q2C2 = Raised by relatives before 18 age

SMOKER = Tobacco use status.

S3AQ52 = Age started smoking cigars everyday.

S2AQ19 = Age at start of period of Heaviest drinking.

NOTE :

In [5]:

sub= data[['CONSUMER','S1Q2C2', 'SMOKER', 'S3AQ52', 'S2AQ19']]

In [6]:

sub1 = sub.copy()

In [7]:

sub1.head()

Out[7]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19

033

11321

233

32316

42318

CONSUMER :

1. Current drinker 2. Ex-drinker 3. Lifetime Abstainer

Since this is counter intuitive we can change this to :

0) Lifetime Abstrainer(One who never did drinking) 1) Ex-Drinker 2) Current Drinker

In [8]:

print("Before labels (CONSUMER) : ") print(sorted(sub1['CONSUMER'].unique()))

Before labels (CONSUMER) : [1, 2, 3]

In [9]:

def recode1(val): if val==1: return 2 if val==2: return 1 if val==3: return 0

In [10]:

sub1['CONSUMER_NEWL'] = sub1['CONSUMER'].apply(lambda x : recode1(x))

In [11]:

print("After labels (CONSUMER_NEWL) : ") print(sorted(sub1['CONSUMER_NEWL'].unique()))

After labels (CONSUMER_NEWL) : [0, 1, 2]

In [12]:

sub1.head()

Out[12]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19CONSUMER_NEWL

0330

113212

2330

323161

423181

SMOKER :

1. Current user 2. Ex-user 3. Lifetime nonsmoker

Since, this is also counter intuitive we can change this to :

0) Lifetime nonsmoker 1) Ex-user 2) Current user

In [13]:

print("Before labels (SMOKER) : ") print(sorted(sub1['SMOKER'].unique()))

Before labels (SMOKER) : [1, 2, 3]

In [14]:

#using above 'recode1' function here too. sub1['SMOKER_NEWL'] = sub1['SMOKER'].apply(lambda x : recode1(x))

In [15]:

sub1.head()

Out[15]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

03300

1132120

23300

3231610

4231810

In [16]:

print("After labels (SMOKER_NEWL) : ") print(sorted(sub1['SMOKER_NEWL'].unique()))

After labels (SMOKER_NEWL) : [0, 1, 2]

In [17]:

columnsTitles = ['CONSUMER','SMOKER', 'S1Q2C2', 'S3AQ52', 'S2AQ19', 'CONSUMER_NEWL','SMOKER_NEWL'] sub1 = sub1.reindex(columns=columnsTitles)

In [18]:

sub1.head()

Out[18]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

03300

1132120

23300

3231610

4231810

In [19]:

sub1['CONSUMER_NEWL'].value_counts(sort=False)

Out[19]:

0 8266 1 7881 2 26946 Name: CONSUMER_NEWL, dtype: int64

In [20]:

sub1['SMOKER_NEWL'].value_counts(sort=False)

Out[20]:

0 23901 1 8074 2 11118 Name: SMOKER_NEWL, dtype: int64

Managing variable - S3AQ52 (AGE STARTED SMOKING CIGARS EVERY DAY)

In [21]:

sub1['S3AQ52'].unique()

Out[21]:

In [22]:

sub1[sub1['S3AQ52']==" "]

Out[22]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

03300

1132120

23300

3231610

4231810

……………………

430883300

43089131820

43090111722

430911122422

43092231710

42374 rows × 7 columns

In [23]:

#Converting blank values or People who never smoked to 0 sub1.loc[sub1['S3AQ52']==" ", 'S3AQ52'] = 0

In [24]:

#Converting String values of Dataframe to Numeric sub1['S3AQ52']= pd.to_numeric(sub1['S3AQ52'])

In [25]:

sub1['S3AQ52'].unique()

Out[25]:

In [26]:

#Converting '99' (People who not answered this question in survey) to NaN. sub1.loc[sub1['S3AQ52']==99, 'S3AQ52'] = np.nan

In [27]:

sub1['S3AQ52'].unique()

Out[27]:

In [28]:

sub1.head()

Out[28]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

0330.000

1130.02120

2330.000

3230.01610

4230.01810

Now, Column S3AQ52 is managed and prepared.

Managing variable - S2AQ19 (AGE AT START OF PERIOD OF HEAVIEST DRINKING)

In [29]:

sub1['S2AQ19'].unique()

Out[29]:

In [30]:

#Pepole who are lifetime abstainer sub1[sub1['S2AQ19']==" "]

Out[30]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

0330.000

2330.000

22310.002

23330.000

26330.000

……………………

43070310.002

43071330.000

43072330.000

43082330.000

43088330.000

8266 rows × 7 columns

In [31]:

#Converting blank values or People who are lifetime abstainer to 0 sub1.loc[sub1['S2AQ19']==" ", 'S2AQ19'] = 0

In [32]:

#Converting String values of Dataframe to Numeric sub1['S2AQ19']= pd.to_numeric(sub1['S2AQ19'])

In [33]:

sub1['S3AQ52'].unique()

Out[33]:

In [34]:

sub1.head()

Out[34]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

0330.0000

1130.02120

2330.0000

3230.01610

4230.01810

Now, Column S2AQ19 is also managed and prepared.

Managing variable - S1Q2C2 (RAISED BY RELATIVES BEFORE AGE 18)

In [35]:

sub1['S1Q2C2'].unique()

Out[35]:

array([' ', '1', '2', '9'], dtype=object)

sub1['S1Q2C2'].value_counts(dropna=False)

Out[36]:

41679 1 649 2 553 9 212 Name: S1Q2C2, dtype: int64

In [47]:

#Converting blank values or People who are raised by parent(s) to 0. sub1.loc[sub1['S1Q2C2']==" ", 'S1Q2C2'] = 0

In [38]:

sub1['S1Q2C2'].value_counts(dropna=False)

Out[38]:

0 41679 1 649 2 553 9 212 Name: S1Q2C2, dtype: int64

In [39]:

#Converting value = 2 to 0. sub1.loc[sub1['S1Q2C2']=="2", 'S1Q2C2'] = 0

In [40]:

#Converting value = 9 to NaN. sub1.loc[sub1['S1Q2C2']=="9", 'S1Q2C2'] = np.nan

In [43]:

sub1['S1Q2C2'].unique()

Out[43]:

array([0, 1, nan], dtype=object)

In [45]:

sub1['S1Q2C2'].value_counts(dropna=False)

Out[45]:

0.0 42232 1.0 649 NaN 212 Name: S1Q2C2, dtype: int64

In [46]:

sub1.head()

Out[46]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL

03300.0000

11300.02120

23300.0000

32300.01610

42300.01810

Now, Column S1Q2C2 is also managed and prepared.

And we can now remove cols CONSUMER and SMOKER for further analysis.

In [ ]:

0 notes