varun-data-analysis
varun-data-analysis
Travelling_Through_Data_Analysis
16 posts
Don't wanna be here? Send us removal request.
varun-data-analysis · 5 years ago
Text
Variables
Response Variable:
Outcome: Diabetes is positive or negative
Class variable (0 or 1)
0-negative 1-Positive
Explanatory Variable:
Pregnancies: Number of times pregnant
Blood Pressure: Diastolic blood pressure (mm Hg)
           scale: generally 0-200
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
          scale- 0-200
Age: Age (years) 
BMI: Body mass index (weight in kg/(height in m)^2) 
Insulin: 2-Hour serum insulin (mu U/ml)
Skin Thickness:Triceps skin fold thickness (mm)
Diabetes pedigree function : a function which scores likelihood of diabetes based on family history. It is in percentage.
           0-indicates no one in family is diabetic
           1- everyone in family is diabetic
0 notes
varun-data-analysis · 5 years ago
Text
Procedure
The NIDDK Data repository is a web-enabled resource cataloguing  clinical trial data and supporting information from NIDDK supported  studies. The Data Repository allows for the co-location of multiple  electronic datasets that were created as part of clinical  investigations. The Data Repository does not serve the role of a Data  Coordinating Center, but rather as a warehouse for the clinical findings  once the trials have been completed.
samples are collected from many of the studies, a data  management system for the cataloguing and retrieval of samples was  developed.
Basically, it is collected from multiple hospitals and stored in the repository. It is an observational data not Experimental data.
Keypoints:
The study that designed the data is through data reporting in various healthcare centers.
The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
samples are collected from many of the health care centres and stored in NIDDK repository and some statistical tools are used to shortlist the current observations.
The data was collected during 1995-1998.
0 notes
varun-data-analysis · 5 years ago
Text
Sample Description
Sample:
This dataset is originally from the National Institute of Diabetes and  Digestive and Kidney Diseases. The objective of the dataset is to  diagnostically predict whether or not a patient has diabetes, based on  certain diagnostic measurements included in the dataset. Several  constraints were placed on the selection of these instances from a  larger database. In particular, all patients in this data analytic sample are females at least  21 years old of Pima Indian heritage(N=768).
keypoints:
Diabetes was studied.
Females of age 1 and above of Pima Indian Heritage
NUmber of obervations, N=768
0 notes
varun-data-analysis · 5 years ago
Text
Course-3- Writing about your Data set
Assignment-1
Dataset: Pima Indians Diabetes Database.
Sample:
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients in this data analytic sample are females at least 21 years old of Pima Indian heritage(N=768).
Procedure:
The NIDDK Data repository is a web-enabled resource cataloguing clinical trial data and supporting information from NIDDK supported studies. The Data Repository allows for the co-location of multiple electronic datasets that were created as part of clinical investigations. The Data Repository does not serve the role of a Data Coordinating Center, but rather as a warehouse for the clinical findings once the trials have been completed.
samples are collected from many of the studies, a data management system for the cataloguing and retrieval of samples was developed.
Basically, it is collected from multiple hospitals and stored in the repository. It is an observational data not Experimental data.
Measures:
The datasets consists of several medical predictor variables and one target variable,Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
0 notes
varun-data-analysis · 5 years ago
Text
Regression Modelling in Practice
Hey guys!! It’s been a while since my last update I have successfully completed two courses in Data Analysis Master course. Now, I am stating a new course named”Regression Modelling In Practice”
Course content-
Introduction To Regression
Basics of Linear Regression
Multiple Regression
Logistics Regression.
First Assignment is writing about the data your analyzing. It includes:
Describing the Sample
Procedure of sample collection
Measures
I am selecting a new dataset of Diabetes.
0 notes
varun-data-analysis · 5 years ago
Text
Course2-Final Assignment-Moderator
Research Question- The effect of alcohol consumption(moderator) on association between Income of a person(Explanatory) and life expectancy(response).
Dataset: Gapminder
Income of a person(Explanatory) - Quantitative
life expectancy(response)- Quantitative
Alcohol Consumption- Quantitative---->binned into two groups.
1. Non-addicted
2.Addicted
Analysis type: Pearson Correlation (Q-Q)
Program:
import pandas as pd import seaborn import scipy.stats import matplotlib.pyplot as plt
data = pd.read_csv('gapfinder_pds.csv', low_memory=False) data['incomeperperson']=pd.to_numeric(data['incomeperperson'], errors='coerce') data['alcconsumption']=pd.to_numeric(data['alcconsumption'], errors='coerce') data['lifeexpectancy']=pd.to_numeric(data['lifeexpectancy'], errors='coerce')
data_clean=data.dropna()
print ('association between incomeperperson and lifeexpectancy') print(scipy.stats.pearsonr(data_clean['incomeperperson'],data_clean['lifeexpectancy']))
def alcohol_group (row):    if row['alcconsumption'] <= 8:        return 'non-addicted'    elif row['alcconsumption'] > 8:        return 'addicted'
data_clean['drinkertype'] = data_clean.apply(lambda row:alcohol_group(row), axis=1)
chk1 = data_clean['drinkertype'].value_counts(sort=False, dropna=False) print(chk1)
sub1=data_clean[(data_clean['drinkertype']=='non-addicted')] sub2=data_clean[(data_clean['drinkertype']=='addicted')]
print ('association between incomeperperson and lifeexpectancy for non-alcohol addicted countries') print (scipy.stats.pearsonr(sub1['incomeperperson'], sub1['lifeexpectancy'])) print ('       ') print ('association between incomeperperson and lifeexpectancy for alcohol addicted countries') print (scipy.stats.pearsonr(sub2['incomeperperson'], sub2['lifeexpectancy']))
scat1 = seaborn.regplot(x="incomeperperson", y="lifeexpectancy", fit_reg=True,data=sub1) plt.xlabel('Incomeperperson') plt.ylabel('lifeexpectancy') plt.title('Scatterplot for the Association Between Income and life expectancy Rate for non-alcohol addicted countries') print (scat1)
scat2 = seaborn.regplot(x="incomeperperson", y="lifeexpectancy", fit_reg=True, data=sub2) plt.xlabel('Incomeperperson') plt.ylabel('lifeexpectancy') plt.title('Scatterplot for the Association Between Income and life expectancy Rate for alcohol addicted countries') print (scat2)
Output:
Tumblr media
Bi-variate graphs:
Tumblr media Tumblr media
Discussion:
1) The association between income of a person and life expectancy independent of moderator yielded:
r=0.5953 ----> Positive and moderately strong relationship
p-value= 8.87e-18 ---->Significant
2) The association between income of a person and life expectancy independent for non-alcohol addicted countries.
r=0.508----> Positive and moderately strong relationship
p-value=1.64e-08 ------> Significant
3) The association between income of a person and life expectancy independent for alcohol addicted countries.
r=0.609----> Positive and moderately strong relationship
p-value=1.47e-07 ------> Significant.
Conclusion:
1) Overall, all the three types of analyses are positive and strongly related. So, with or without alcohol addiction association between income and life expectancy are not effected much.
2) By observing both the plots, one can say that parabola would fit the points more effective than straight line.
0 notes
varun-data-analysis · 5 years ago
Text
Assignment3- Pearson Correlation
Research Question- Association between Income of a person and type of Alcohol consumer.
Dataset: Gapminder
Null Hypothesis: 1)There is no association between income of a person and quantity of alcohol consumed.
2) There is no association between employment rate of a country and average quantity of alcohol consumed per person.
Alternate Hypothesis: There is an association between them.
Program:
Tumblr media
Bivariate scatter plots:
1) scatter plot -1
Tumblr media
2) Scatter plot-2
Tumblr media
Output:
association between incomeperperson and alcoholconsumption (0.28864262546557856, 0.00019533523307706355) association between employrate and alcoholconsumption (-0.118889517968538, 0.13184952360331226)
Conclusion:
1) r=0.2886-------> Positive but Weak relationship
    p-value= 0.00009 means Significant
r^2=0.083means only a 8% variability van be predicted by one of the known variable.
  So Income per person and alcohol consumption are related but weakly.
2)r=-0.1188--------> Negative and weak relationship
   p-value=0.13>0.05 means insignificant.
So employment and alcohol consumption are not related.
overall, income and alcohol consumption are weakly related whereas employment rate and alcohol consumption are not related they have no association.
0 notes
varun-data-analysis · 5 years ago
Text
Assignment-2-chi-square Test
Research Question- Association between Income of a person and type of Alcohol consumer.
Dataset: Gapminder
Null Hypothesis: There is no association between income of a person and type of alcohol consumer.
Alternate Hypothesis: There is an association between them.
Explanatory/Independent Variable: ‘Income_group‘ categorized into four groups
1- < $1000
2- <$12000
3- <$25000
4 - <$105200
Response/Dependent variable:’alcohol_group’ in litres/year categorized into two groups.
0-> 0-8 litres/year->Non-addicted drinker
1->8-25 litres/year-> Addicted drinker
Program:
# -*- coding: utf-8 -*- """ Created on Thu Sep 10 11:54:45 2020
@author: Varun """
import pandas as pd import scipy.stats import seaborn import matplotlib.pyplot as plt import itertools
data_import = pd.read_csv('gapfinder_pds.csv', low_memory=False) data=pd.DataFrame()
data = data_import[['country','incomeperperson', 'alcconsumption', 'employrate']].copy() print("Number of Countries = ", len(data)) #number of observations (rows) print ("Number of Variables =", len(data.columns)) # number of variables (columns)
data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['employrate'] = pd.to_numeric(data['employrate'], errors='coerce')
labels_income = ['1','2','3','4'] data['income_group'] = pd.cut(data['incomeperperson'], bins=[0,1000,12000,25000,105200], labels=labels_income)
labels_alcohol = ['0','1'] data['alcohol_group'] = pd.cut(data['alcconsumption'], bins=[0,8,25], labels=labels_alcohol)
#Contigency table for observed counts ct1=pd.crosstab(data['alcohol_group'],data['income_group']) print(ct1)
#column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
print('chi-sqaure value, p-value and expected values') cs1=scipy.stats.chi2_contingency(ct1) print(cs1)
# set variable types data["income_group"] = data["income_group"].astype('category') # new code for setting variables to numeric: data['alcohol_group'] = pd.to_numeric(data['alcohol_group'], errors='coerce')
seaborn.catplot(x="income_group", y="alcohol_group", data=data, kind="bar", ci=None) plt.xlabel('Income categroies') plt.ylabel('Proportion of highly alcohol addicted people')
# To perform post hoc tests of Berferroni adjustments, good bye to ardous lines of coding for tests using itertools
data['income_group']=pd.to_numeric(data['income_group'],errors='coerce') for pair in itertools.combinations([1,2,3,4],2):    ct=pd.crosstab(data['income_group'].isin(pair), data['alcohol_group'])    print("chi sq test of subcategory: {}".format(pair))    print (scipy.stats.chi2_contingency(ct))
Output:
Tumblr media
Conclusion:
Model Interpretation for Chi-Square Tests:
When examining the association between type of alcohol consumer (categorical response) and type of income_group (categorical explanatory), a chi-square test of independence revealed that , those belonging to income_group -4 are more addicted drinkers (77%) next to catergory-3(73%) and least addicted drinkers are in category-1(%). It has chi-square of 36 and p-value of 5.06e-08.The p-value is significant and as it is 4 category explanatory variable I conducted post hoc test of Bonferroni Adjustments.
Model Interpretation for post hoc Chi-Square Test results:
Four categories so adjusted p-value is 0.05/6=0.008.
(1,2)---->chi-square=16.589  p-value=4.639e-05 <<0.008
(1,3)---->chi-square=6.802    p-value= 0.00910 >0.008
(1,4)---->chi-square=4.014    p-value=  0.0451>0.008
(2,3)---->chi-square=5.172    p-value= 0.022951>0.008
(2,4)---->chi-square=8.194    p-value= 0.0042<0.008
(3,4)---->chi-square=23.182   p-value= 1.47e-06<<<0.008
(1,2), (2,4), (3,4) are significant and other are not significant.
Below is the same shown graphically
Tumblr media
0 notes
varun-data-analysis · 5 years ago
Text
Course2- Assignment1-ANOVA
Research Question- Association between Income of a person and quantity of Alcohol consumption.
Dataset: Gapminder
Null Hypothesis: There is no association between income of a person and quantity of alcohol consumed by him
Alternate Hypothesis: There is an association between them.
Explanatory/Independent Variable: ‘Incomeperperson‘ categorized into four groups
1- < $1000
2- <$12000
3- <$25000
4 - <$105200
Response/Dependent variable:’alcconsumption’ in litres/year
Program:
Tumblr media Tumblr media
Output:
Tumblr media Tumblr media
Summary:
Model Interpretation for ANOVA:
When examining the association between quantity of alcohol consumed in liters/year(quantitative response) and income of the person (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among four income groups, 4thcategory-high income reported alcohol consumption  significantly more liters/year (Mean=9.87, s.d. 3.68) compared to those with 1st category- low income (Mean=4.30, s.d. ±4.20), F(3, 175)=9.783, p=5.32e-06.
So, seeing the value of p<<<<<<0.05 we can say that it is significant and we can reject the null hypothesis.
Model Interpretation for post hoc ANOVA results:
Post hoc results generated using Tukey’s Honestly Significant Difference method show 6 possible possible combinations. Out of six, three cases are safe to reject the null hypothesis whereas last three cannot reject null hypothesis.
Careful observation shows that only combinations involving income group-1, the null hypothesis is rejected and in other cases it cannot be. So, least income group is removed null hypothesis cannot be rejected
0 notes
varun-data-analysis · 5 years ago
Text
Start of Course 2 in Data Analysis
Course name: Data Analysis Tools
In the first course of Data Management and Visualization, we dealt with descriptive statistics like frequency tables, center and spread and basic visualization like uni-variate and bi-variate graphs.
Now in this course, it is based on Inferential Statistics.
Statistical Hypothesis Testing: Assessing evidence provided by the data in favor or against each hypothesis about population.
Inferential statistics is based on drawing results from means of ‘n’ number of samples.
NULL Hypothesis: There is no relationship between Explanatory and Response variables.
Alternate Hypothesis: There is a relationship between both of them.
Steps in Testing Hypothesis:
1. Specify the null and the alternate hypothesis
2. Choose a Sample
3. Assess the evidence
4. Draw Conclusions.
Other Important topics:
P-value
F-value
Types of statistical Tests:
ANOVA- Analysis of variance
X2- chi-square Test of Independence
r- Pearson correlation coefficient
The first assignment is based on ANOVA
0 notes
varun-data-analysis · 5 years ago
Text
Final Assignment - Graphs Visualization
Research Questions-
1. Relationship between Income Per Person and Alcohol Consumption
2. Relationship between Employment rate of a country and Alcohol Consumption
Explanatory/Independent Variables: Income per person,Employment rate
Response/Dependent Variable: Alcohol Consumption
Code:
Tumblr media Tumblr media Tumblr media
Output:
Number of Countries =  213 Number of Variables = 4 count       190.000000 mean       8740.966076 std       14262.809083 min         103.775857 25%         748.245151 50%        2553.496056 75%        9379.891166 max      105147.437700 Name: incomeperperson, dtype: float64 count    187.000000 mean       6.689412 std        4.899617 min        0.030000 25%        2.625000 50%        5.920000 75%        9.925000 max       23.010000 Name: alcconsumption, dtype: float64 count    178.000000 mean      58.635955 std       10.519454 min       32.000000 25%       51.225000 50%       58.699999 75%       64.975000 max       83.199997 Name: employrate, dtype: float64 Income per person - 4 categories <1000       54 <$12000     95 <$25000     17 <$105200    24 NaN         23 Name: income_group, dtype: int64
Graphs:
Note: Q-Quantitative
          C- Categorical
Univariate Histograms as my question is relationship between Q-Q
1. Univariate  Q histogram of Income per person:
Tumblr media
It show that most of the countries are among Low income group. So, it is skewed to right.
2. Univariate Histogram of Alcohol Consumption:
Tumblr media
The trend shows that most of the countries consume lower quantity of alcohol.Skewed to the right distribution.
3. univariate Histogram of Employment rate
Tumblr media
Unimodal and symmetric distribution with peak at 50% employment.
Bivariate  Q-Q scatter plots:
1. Scatter plot between Income per person and Alcohol Consumption
Tumblr media
2. Scatter plot between Employment rate and alcohol consumption:
Tumblr media
Two scatter plots show no signs of any relationship between two plotted variables. It will foolish to draw even the trend line as it is evident from the distributions which are very scattered.
So, let us categorize the explanatory variable and try C-Q bar chart.
Bivariate Categorical to Quantitative Bar charts:
1. Bar chart between Income and Alcohol:
Tumblr media
This bar chart is some what revealing which shows that higher income countries consume more alcohol.
2.Bar chart between employment rate and alcohol
Tumblr media
This bar chart shows that alcohol consumption of a country does not depend on the employment rate of the country.
Conclusion of the Project:
1. The relationship between income and alcohol consumption are linear and positive. So, Higher the income Higher the consumption of the alcohol.
2. There is no association between employment and alcohol consumption. One cannot become the explanation of the other.
0 notes
varun-data-analysis · 5 years ago
Text
Third Assignment
Title:
Association between income per person and average alcohol consumption with or without dependency on average employment rate
Introduction:
People  in countries where average income per person is high are more likely to  have inclination for greater alcohol consumption  although there are other factors(like mental, psychological). We all  know unemployment  rate also has a positive effect on drinking behavior. high  People with  higher income have access to better facilities resulting in higher life  expectancy. The more the money a person has he has more tendency and  option to consume more alcohol and it is also equally true that  unemployed person out of grief and anxiety consumes more alcohol  although the cheaper brands. so my research questions are
Research Question:
Is there an association between average income and alcohol consumption
Is there an association between alcohol consumption and employment rate.
My dataset is GapMinder. This is not a population survey based. It is the data of important parameters of a country like income per person, suicide rate and so on. So, there is no possibility of finding frequency of a certain value of certain category.(example- Income per person of a country is quantitative and   different  countries have diverse value, so even if we do frequency analysis hardly two values are exact, so it is no use) So, I created categorical variable myself and grouped my choice of variables into three categories( like high, medium, low). This is certainly a type of data management ( GROUPING VALUES WITHIN INDIVIDUAL VARIABLE). There is no unknown value in the options for any variable. The answer is either something or left blank(means missing). So, these missing values are already addressed using NAN value in the previous program. So, my point is that data management required for my research is already done to the extent necessary.
What To do!!!?
So I decided to modify my program program so that it is concise and short(like I have used pandas.cut to categorise, now, i will use pandas.qcut). And also to have some fun in learning the operations taught in this week which may not be much relevant or important to the problem but make me learn (which is exactly why we are here!!! ;) )
Program:
# -*- coding: utf-8 -*- """ Created on Tue Sep  1 13:59:29 2020@author: Varun """import pandas as pddata_import = pd.read_csv('gapfinder_pds.csv', low_memory=False) data=pd.DataFrame()data = data_import[['country','incomeperperson', 'alcconsumption', 'employrate']].copy() print("Number of Countries = ", len(data)) #number of observations (rows) print ("Number of Variables =", len(data.columns)) # number of variables (columns)data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['employrate'] = pd.to_numeric(data['employrate'], errors='coerce')# Grouping values with in Individual Variablelabels_income = ['1=0%tile','2=25%tile','3=50%tile','4=75%tile'] data['income_group'] = pd.qcut(data['incomeperperson'], 4, labels=labels_income)print ('Income per person - 4 categories') c1 = data['income_group'].value_counts(sort=False,dropna=False) print (c1)print ('percentages for Income per person') p1 = data['income_group'].value_counts(sort=False,dropna=False, normalize=True) print (p1)labels_alcohol = ['1=0%tile','2=25%tile','3=50%tile','4=75%tile'] data['alcohol_group'] = pd.cut(data['alcconsumption'], 4, labels=labels_alcohol)print (' "alcconsumption" quantity consumed per person- 4 categories') c2 = data['alcohol_group'].value_counts(sort=False,dropna=False) print(c2)print( 'percentages for "alcconsumption" quantity consumed per person') p2 = data['alcohol_group'].value_counts(sort=False,dropna=False, normalize=True) print (p2)labels_employ = ['1=0%tile','2=25%tile','3=50%tile','4=75%tile'] data['employ_group'] = pd.cut(data['employrate'], 4, labels=labels_employ) print ('counts for employment rate - 4 categories') c3 = data['employ_group'].value_counts(sort=False,dropna=False) print(c3)print ('percentages for employment rate - 4 categories') p3 = data['employ_group'].value_counts(sort=False,dropna=False, normalize=True) print (p3)#secondary variable multiplying the employment rate and the income per person data['total_employees_income_of_country_in_Million$']=data['incomeperperson'] * data['employrate']/(1e+6) print(data[['country', 'total_employees_income_of_country_in_Million$']].head(25))data['total_income_group'] = pd.cut(data['total_employees_income_of_country_in_Million$'], 4, labels=labels_employ) print ('total employees income of country in Million$ - 4 categories') c4 = data['total_income_group'].value_counts(sort=False,dropna=False) print(c4)print ('percentages total employees income of country in Million$ - 4 categories') p4 = data['total_income_group'].value_counts(sort=False,dropna=False, normalize=True) print (p4)
Output:
Number of Countries =  213 Number of Variables = 4 Income per person - 4 categories 1=0%tile     48 2=25%tile    47 3=50%tile    47 4=75%tile    48 NaN          23 Name: income_group, dtype: int64 percentages for Income per person 1=0%tile     0.225352 2=25%tile    0.220657 3=50%tile    0.220657 4=75%tile    0.225352 NaN          0.107981 Name: income_group, dtype: float64 "alcconsumption" quantity consumed per person- 4 categories 1=0%tile     90 2=25%tile    64 3=50%tile    29 4=75%tile     4 NaN          26 Name: alcohol_group, dtype: int64 percentages for "alcconsumption" quantity consumed per person 1=0%tile     0.422535 2=25%tile    0.300469 3=50%tile    0.136150 4=75%tile    0.018779 NaN          0.122066 Name: alcohol_group, dtype: float64 counts for employment rate - 4 categories 1=0%tile     18 2=25%tile    64 3=50%tile    69 4=75%tile    27 NaN          35 Name: employ_group, dtype: int64 percentages for employment rate - 4 categories 1=0%tile     0.084507 2=25%tile    0.300469 3=50%tile    0.323944 4=75%tile    0.126761 NaN          0.164319 Name: employ_group, dtype: float64 total employees income of country in Million$ - 4 categories 1=0%tile     133 2=25%tile     13 3=50%tile     12 4=75%tile      8 NaN           47 Name: total_income_group, dtype: int64 percentages total employees income of country in Million$ - 4 categories 1=0%tile     0.624413 2=25%tile    0.061033 3=50%tile    0.056338 4=75%tile    0.037559 NaN          0.220657 Name: total_income_group, dtype: float64
Summary:
1.  Presence of missing data- addressed using NAN value
2.  Coding in valid data and re-coding values-  Not possible in my dataset as there is no such data(mostly there are found in surveys)
3. Creating Secondary variable -Yes, a new variable is created multiplying “employment rate” and “income per person”
4. Grouping Values within Individual variables - Yes, three such variables are grouped.
5. Discussion of Important Points-
   Based on quartile groping of all four variables,
- Income group- All groups are equally sized i.e, around 22% and remaining is missing data
- Alcohol_group- First two groups i.e, first 50% is dominating with 72%
- Employment group- middle two groups i.e, 25-75% are dominating with 63%
1 note · View note
varun-data-analysis · 5 years ago
Text
Second Assignment
1Title:
Association between income per person and average alcohol consumption with or without dependency on average employment rate
Introduction:
People in countries where average income per person is high are more likely to have inclination for greater alcohol consumption although there are other factors(like mental, psychological). We all know unemployment rate also has a positive effect on drinking behavior. high  People with higher income have access to better facilities resulting in higher life expectancy. The more the money a person has he has more tendency and option to consume more alcohol and it is also equally true that unemployed person out of grief and anxiety consumes more alcohol although the cheaper brands. so my research questions are
Research Question:
Is there an association between average income and alcohol consumption
Is there an association between alcohol consumption and employment rate.
I have created a categorical variables(one each for income,employment, alcohol consumption) in gap minder to analyze data statistically as done in NESARC data file.
Income_group-low(income<$1025), medium($1025<income<$25000), high($25000<income<$110000)
alcohol_group-low(<=5 liters/year ), average(>5 liters/year and ,=15 liters/year ), high(>15 liters/year)
Employ_group- low(<=32%), average(>32% and<=50%) , high(>50%)
  All the classification and limits of the groups are based on some research and personal judgement keeping in view the data available.
  Creating Categorical variables in Python is tedious and difficult task but eventually achieved after long search for attributes and objects of Pandas and so many iterations.
Program:
As the program is lengthy in python I have uploaded it in another post for which the link is https://varun-data-analysis.tumblr.com/post/627783146358915072/first-program-in-python-in-data-analysis
Output:
Nan= no information is available
Number of Countries =  213 Number of Variables = 4
Frequency Table 1:
counts for "income per person" in countries based on high medium and low income, high-income       54 medium-income    112 low-income        24 NaN               23 Name: income_group, dtype: int64 percentages for  countries based on developed,developing and under-developed high-income      0.253521 medium-income    0.525822 low-income       0.112676 NaN              0.107981 Name: income_group, dtype: float64
Frequency Table 2:
counts for "alcconsumption" quantity consumed per person low        80 average    95 high       12 NaN        26 Name: alcohol_group, dtype: int64 percentages for "alcconsumption" quantity consumed per person low        0.375587 average    0.446009 high       0.056338 NaN        0.122066 Name: alcohol_group, dtype: float64
Frequency Table 3:
counts for employment rate low         37 average    114 high        27 NaN         35 Name: employ_group, dtype: int64 percentages for employment rate low        0.173709 average    0.535211 high       0.126761 NaN        0.164319 Name: employ_group, dtype: float64
Summary:
Variables income per person, alcohol consumption, employment rate are categorized to carry out frequency calculations. 52% countries fall under medium income group where as 25% and 11% under high and low income respectively.Similar correlation is observed in alcohol consumption and employment rate too, 44% which is the highest among the group fall under average consumers of alcohol and 53% of the countries are moderately employed. The missing data is also accounted for using Nan, 10, 12, 16 % data is missing in income,alcohol and employment variables respectively.
0 notes
varun-data-analysis · 5 years ago
Text
First Program in Python in Data Analysis
""" Created on Sun Aug 23 16:27:43 2020
@author: Varun """
import pandas as pd
data_import = pd.read_csv('gapfinder_pds.csv', low_memory=False) data=pd.DataFrame()
data = data_import[['country','incomeperperson', 'alcconsumption', 'employrate']].copy() print("Number of Countries = ", len(data)) #number of observations (rows) print ("Number of Variables =", len(data.columns)) # number of variables (columns)
data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['employrate'] = pd.to_numeric(data['employrate'], errors='coerce') #ADDING MORE DESCRIPTIVE TITLES
labels_income = ['high-income','medium-income','low-income'] data['income_group'] = pd.cut(data['incomeperperson'], bins=[0,1025,25000,110000], right=False, labels=labels_income)
print ('counts for "income per person" in countries based on high medium and low income,') c1 = data['income_group'].value_counts(sort=False,dropna=False) print (c1)
print ('percentages for  countries based on developed,developing and under-developed') p1 = data['income_group'].value_counts(sort=False,dropna=False, normalize=True) print (p1)
labels_alcohol = ['low','average','high'] data['alcohol_group'] = pd.cut(data['alcconsumption'], bins=[0,5,15,25], right=False, labels=labels_alcohol)
print ('counts for "alcconsumption" quantity consumed per person') c2 = data['alcohol_group'].value_counts(sort=False,dropna=False) print(c2)
print( 'percentages for "alcconsumption" quantity consumed per person') p2 = data['alcohol_group'].value_counts(sort=False,dropna=False, normalize=True) print (p2)
labels_employ = ['low','average','high'] data['employ_group'] = pd.cut(data['employrate'], bins=[0,50,70,100], right=False, labels=labels_employ) print ('counts for employment rate') c3 = data['employ_group'].value_counts(sort=False,dropna=False) print(c3)
print ('percentages for employment rate') p3 = data['employ_group'].value_counts(sort=False,dropna=False, normalize=True) print (p3) print(data)
0 notes
varun-data-analysis · 5 years ago
Text
My First Assignment-choosing a research question.
I went through all the code books provided in the course. I choose to take Gap Finder to frame my research question. After going through the Gap Finder Code book I found my interest in the alcohol consumption and Internet usage rate. Eventually I narrowed it down to Alcohol Consumption.
Usually alcohol consumed by a person depends on many factors like Socio-economic status, employment status, Housing status, Religion, age, mental status and so on. During my graduation, I found some of my friends got addicted to alcohol which mainly kick started as soon as they got employment. They felt a sense of relief as they landed in a job and that none can stop them as they need stretch one’s hands for money to buy it. So, it triggered in my mind that income of a person very much defines whether one fall to this addiction or not. So my second variable is Per capita Income of a person.A study of association between alcohol consumption and Per capita Income is worth a shot.
Literature Review
1. World Health Organization (WHO) comes handy in these cases. In Global Status report on alcohol and health 2018, they categorized countries into High, middle and low income countries. It is reported that High income countries consume more alcohol compared to middle and low.
source: Table 3.11
https://apps.who.int/iris/bitstream/handle/10665/274603/9789241565639-eng.pdf?ua=1
2. An article by Our World in Data, shows a chart describing association between Alcohol consumption and  GDP Per capita of different countries. The trend is  found to be linear and positive with some exception in Asian countries.
Tumblr media
Source: World Bank- World Development Indicators.(2016)
https://ourworldindata.org/grapher/alcohol-consumption-vs-gdp-per-capita
Hypothesis:
After going through Literature, I predict that the association will be linear and positive. The Asian countries will be out of trend with more distributed like a bell-shaped curve. It means alcohol consumption is high in moderate income countries and low in rich and poor countries of Asia.
0 notes
varun-data-analysis · 5 years ago
Text
Here Begins my Journey
Hello friends,
I am your dearest friend, Varun Vidiyala. Data analyst and data scientist are the top seeking jobs in this era. So, I thought like why shouldn’t I have the skill sets required for those profiles.
I started searching online for skills required. The top most skills required is Statistics and Programming. I have a fair understanding of Statistics in my graduation. So, I researched for Programming language required which pointed at two languages- SAS and Python. Owing to the universal acceptability and diverse applications of Python I did a basic course in Python. Hurray!!!! It is my internal feeling inside me after completing my course in Python. But the job is not done yet as the statistics used and application of programming is different in data analysis. It needs some supervise learning through Courses to acquire acquaintance with data sampling, cleaning, algorithms, regressing analysis etc..etc..etc..
I enrolled in Data Management and Visualization offered by Wesleyan University through Coursera.
It is a nice course. Learning through application is the motto of the course. For further updates regarding my journey through data analysis.
Stay Tuned !!!!!
0 notes