bartimeux - Tumblr blog

bartimeux · 2 years ago

Text

Final week !

Now let's display some charts to end this course.

First, an easy one :

import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt data = pd.read_csv('nesarc_pds.csv', low_memory=False) data['S2AQ10'] = data['S2AQ10'].replace(' ', np.nan) data['S2AQ10'] = data['S2AQ10'].replace('99', np.nan) data['S2AQ10'] = data['S2AQ10'].astype(float) sb.countplot(x="S2AQ10", data=data)

The other one is even easier :

data['NUMREL'] = data['NUMREL'].astype(float) sb.countplot(x="NUMREL", data=data)

Now let's a two-variables graph with these 2 vars :

cor = sb.regplot(x='NUMREL', y='S2AQ10', fit_reg=False, data=data)

Here the aim was to prove that there is no correlation between the number of relatives under the same roof and the alcohol consumption, and I think even though the graph is not quite helpful (because there are not enough people with a high number of relatives nor much people with a heavy drinking habit to draw solid conclusions), we can conclude that there is no correlation indeed.

The only conclusion we could make is the following : the more relatives you live with, the less chance you have to be a heavy drinker.

0 notes

bartimeux · 2 years ago

Text

Processing and displaying data

Following https://www.tumblr.com/bartimeux/724463236514234368/frequency-distribution?source=share post, now let's work on understanding and displaying the data in a clan way.

d1 = df['NUMREL'].value_counts() print(d1) NUMREL 1 13952 2 12591 3 6736 4 5708 5 2572 6 963 7 325 8 152 9 50 10 25 11 8 12 6 13 2 15 1 14 1 17 1 Name: count, dtype: int64

Here we can notice 2 things : the variables are displayed in the wrong order, and the index 16 doesn't exist. Let's fix this.

d1 = df['NUMREL'].value_counts(normalize=True) d1[16]=0 print(d1.sort_index()) NUMREL 1 0.323765 2 0.292182 3 0.156313 4 0.132458 5 0.059685 6 0.022347 7 0.007542 8 0.003527 9 0.001160 10 0.000580 11 0.000186 12 0.000139 13 0.000046 14 0.000023 15 0.000023 16 0.000000 17 0.000023 Name: proportion, dtype: float64

Now let's go to the other variable :

d2 = df['S2AQ10'].value_counts() print(d2) S2AQ10 11 17436 16147 10 4336 9 1690 7 902 8 601 6 539 5 510 4 272 99 200 3 184 1 167 2 109 Name: count, dtype: int64

Here, the blank line could be mapped to a 12 because if corresponds to someone not drinking. And value 99 can be dropped as it teaches us nothing.

d2 = df['S2AQ10'].value_counts(normalize=True) d2 = d2.rename(index={' ':'12'}) d2 = d2.drop('99') print(d2.sort_index()) S2AQ10 1 0.003875 10 0.100620 11 0.404613 12 0.374701 2 0.002529 3 0.004270 4 0.006312 5 0.011835 6 0.012508 7 0.020931 8 0.013947 9 0.039218 Name: proportion, dtype: float64

Here we can see that the values are sorted by alphanumerical order, and not as integers. Let's fix this.

d2.index = d2.index.astype(int) print(d2.sort_index()) S2AQ10 1 0.003875 2 0.002529 3 0.004270 4 0.006312 5 0.011835 6 0.012508 7 0.020931 8 0.013947 9 0.039218 10 0.100620 11 0.404613 12 0.374701 Name: proportion, dtype: float64

0 notes

bartimeux · 2 years ago

Text

Frequency distribution

Here we go again. Following this post https://www.tumblr.com/bartimeux/724462312027799552/lets-try-some-code?source=share I have another assignment. Now I have to examinate frequency distribution.

So let's do it !

Here is my code

d1 = df['NUMREL'].value_counts(sort=True) print(d1) NUMREL 1 8240 2 7977 3 4363 4 3861 5 1665 6 540 7 177 8 74 9 25 10 15 11 3 12 3 15 1 14 1 17 1 Name: count, dtype: int64

As expected, the high number of relatives is a rare situation. Most people live with few relatives.

d2 = df['S2AQ10'].value_counts(sort=True) print(d2) S2AQ10 11 17436 10 4336 9 1690 7 902 8 601 6 539 5 510 4 272 99 200 3 184 1 167 2 109 Name: count, dtype: int64

Now if I may say, 4,6% of the population (about one over 20 people) seems to get drunk at least once a week. Wow, you guys have a problem.

0 notes

bartimeux · 2 years ago

Text

Let's try some code

Ok, so I've been following this Coursera course to prove my team that I know how to code for data in Python, and now I have to perform a stupid assignment that anyone can review. You can feel by now that I'm pretty happy to start this blog, don't you ?

Oh, also, my passion is music, so no offense but I'm only here for the money.

Now let's dive into the assignment by itself. I chose to use the NESARC data. Don't ask me why, the choice was totally random.

Now what I want to check is if the number of related persons in household (NUMREL) has an incidence on alcohol consumption, and more precisely on how often the respondant drank enough to feel intoxicated in the last 12 months (S2AQ10).

We could tend to think that when people live with a lot of relatives they may be in an unstable situation that could push them to drown into alcohol, let's try to work on this.

Here is my code :

import pandas as pd import numpy as np # Importing data df = pd.read_csv('nesarc_pds.csv') # Focusing on interesting columns df = df[['NUMREL','S2AQ10']] # Cleaning the data df.replace(' ', np.nan, inplace=True) df = df.dropna() # Processing the correlation df.corr() NUMREL S2AQ10 NUMREL 1.000000 -0.011635 S2AQ10 -0.011635 1.000000

We can see that there is no correlation, hence the number of relatives in the household has no impact on alcohol consumption. Good !

1 note · View note