utkarshere - Tumblr blog

utkarshere · 7 years ago

Text

Statistical Results

From the GapMinder dataset, the response variable was taken as life expectancy. The multiple exploratory variables include incomeperperson, alcconsumption, co2emissions, employrate, and urbanrate. The initial OLS regression results are shown below:

SUMMARY

Here we can observe an R-squared value to be 0.543, stating that we can predict 54.3% variability of our response variable.

P-VALUES:

The exploratory variables incomeperperson, employrate and urbanrate have significant p-values (less than 0.05). But the variables co2emissions and alcconsumption have insignificant p-values (greater than 0.05).

BETA COEFFICIENTS:

The beta coefficients of co2emissions and employrate are negative. Otherwise all other have positive beta coefficient values. This means that co2emissions and employrate are negatively associated with the response variable.

CONFIDENCE INTERVALS:

The CI of significant variables does not contain 0 in the limits but the non-significant variables have 0 in there CI. This shows that there is a possibility that we can have 0 coefficient to say that this exploratory variable has no effect on our response variable and is insignificant in this Multiple Regression Model.

0 notes

utkarshere · 7 years ago

Text

Test a Multiple Regression

From the GapMinder dataset, the response variable was taken as life expectancy. The multiple exploratory variables include incomeperperson, alcconsumption, co2emissions, employrate, and urbanrate. The initial OLS regression results are shown below:

SUMMARY

Here we can observe an R-squared value to be 0.543, stating that we can predict 54.3% variability of our response variable.

P-VALUES:

The exploratory variables incomeperperson, employrate and urbanrate have significant p-values (less than 0.05). But the variables co2emissions and alcconsumption have insignificant p-values (greater than 0.05).

BETA COEFFICIENTS:

The beta coefficients of co2emissions and employrate are negative. Otherwise all other have positive beta coefficient values. This means that co2emissions and employrate are negatively associated with the response variable.

CONFIDENCE INTERVALS:

The CI of significant variables does not contain 0 in the limits but the non-significant variables have 0 in there CI. This shows that there is a possibility that we can have 0 coefficient to say that this exploratory variable has no effect on our response variable and is insignificant in this Multiple Regression Model.

Analysis of Response variable with the Primary Exploratory variable:

Moving ahead, when we take our primary exploratory variable alcconsumption into consideration, we first plot a scatter plot with the linear line of best fit in it. The graph obtained is shown below:

The graph above clearly depicts that there is no clear association between these both variables.

Also, in order to try a better fit line, we will add a quadratic polynomial line. The new line obtained is shown in the graph below:

As we can see that the curve is no more a straight line but has changed to a curved path. This happened because of changing the order as 2 of the polynomial line added.

The analysis table of individually alcconsumption with response lifeexpectancy is shown below:

Now here we see that the r-squared value comes to be just 0.098, which means we can predict the variability by just 9.8%. And the p-value comes to be significant as it is 0. Also, the beta coefficient of 0.618 gives a good positive association.

Next, we will add one more polynomial to improve our model. The squared value of alcconsumption is also added to the model and the new results obtained are as follows:

Here we see an increase in the r-squared value to 10%, but the new variable shows a negative association with negative beta-value and is also insignificant.

Adding additional exploratory variable:

Next, we add an additional exploratory variable to check for the confounding variables. The variable urbanrate is added here as an additional one. The new results obtained are as follows:

As we see that both the previous variables turned to be insignificant on the addition of new exploratory variable, they can be stated to be confounding.

PLOTS:

Q-Q PLOT:

This plot gives us the residuals i.e. the difference between observed and predicted values. As the points deviate from the straight line, we say that they do not follow normaily. It means curvilinear association we observed, may not be fully estimated by quadratic urban rate term.

STANDARDIZED RESIDUAL OF ALL OBSERVATIONS:

This plot shows us the deviations of residual points from the central tendency. Here we can observe a max deviation of 2 SD, which means 95% of them lie between -2 to +2 SD of the plot. The points other than that are known as the outliers here.

LEVERAGE PLOT:

This plot is also called the influence plot. In this case we see a few outlier points that lie away from the rest of the cluster of influential observations. The graph obtained is shown below:

0 notes

utkarshere · 7 years ago

Text

About Variables

For our research, we have picked 2 variables from the data set that represent the Income per person and the Internet usage rate (per 100 people) of the respective countries. Technically in our dataset, the Incomes have been listed under the variable name “incomeperperson” and the Internet usage rates have been listed under ”internetuserate”. Also, the country name is the unique identifier for both our indicators is technically mentioned as “country”. In our entire research, we will take these three variables (incomeperperson, internetuserate, and country) together and study them to obtain a relation (if any) between the internet rate and income per person country wise.

1. Income per person (Explanatory Quantitative Variable): All the necessary information related to this variable has been provided below: Description of Indicator: It has been taken from 2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries have been taken into account.  Original Source: World Bank work development indicators

2. Internet use rate (Response Quantitative Variable): Now similarly we will go through our next variable and analyze the parameters. All the necessary information related to this variable has been provided below:  Description of Indicator: It has been taken from 2010 Internet users (per 100 people). These internet users are people with access to the worldwide network  Original Source: World Bank

0 notes

utkarshere · 7 years ago

Text

Data Collection

This data is itself a compilation of various datasets from different sources including the US Census Bureau’s International Database, Institute for Health Metrics and Evaluation, World Bank and United Nations Statistics Division. The clubbed data from 215 countries from all over the world has been represented in a single spreadsheet. Therefore the unique identifier in this data is the Country name. The data has been classified under 15 variables including economics, health, living standards and environmental parameters. A detailed description of all 15 variables can be accessed from the links mentioned above.

0 notes

utkarshere · 7 years ago

Text

About Sample

The data used has been collected from the GapMinder Dataset. GapMinder has collected data from 192 United Nations members and 24 from other areas. Data from a total of 215 countries have been generated in this dataset. So the level of analysis was on a group basis at the country level.

0 notes

utkarshere · 7 years ago

Text

Writing about your data

The data used has been collected from the GapMinder Dataset. GapMinder has collected data from 192 United Nations members and 24 from other areas. Data from a total of 215 countries have been generated in this dataset.  The original link to the excel file is Gapminder.csv  The codebook for this dataset can be found at GapminderCodebook.pdf. This data is itself a compilation of various datasets from different sources including the US Census Bureau’s International Database, Institute for Health Metrics and Evaluation, World Bank and United Nations Statistics Division. The clubbed data from 215 countries from all over the world has been represented in a single spreadsheet. Therefore the unique identifier in this data is the Country name. The data has been classified under 15 variables including economics, health, living standards and environmental parameters. A detailed description of all 15 variables can be accessed from the links mentioned above. For our research, we have picked 2 variables from the data set that represent the Income per person and the Internet usage rate (per 100 people) of the respective countries. Technically in our dataset, the Incomes have been listed under the variable name “incomeperperson” and the Internet usage rates have been listed under ”internetuserate”. Also, the country name is the unique identifier for both our indicators is technically mentioned as “country”. In our entire research, we will take these three variables (incomeperperson, internetuserate, and country) together and study them to obtain a relation (if any) between the internet rate and income per person country wise.

Knowing about the variables being included is an essential phase of any research project. A deep study about each variable is necessary before using them and applying operations on them. So we will go through the variables one by one thoroughly and observe them from various angles to get a clear view.

1. Income per person (Explanatory Quantitative Variable): All the necessary information related to this variable has been provided below: Description of Indicator: It has been taken from 2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries have been taken into account.  Original Source: World Bank work development indicators

2. Internet use rate (Response Quantitative Variable): Now similarly we will go through our next variable and analyze the parameters. All the necessary information related to this variable has been provided below:  Description of Indicator: It has been taken from 2010 Internet users (per 100 people). These internet users are people with access to the worldwide network  Original Source: World Bank

0 notes

utkarshere · 7 years ago

Text

Testing a Potential Moderator

The data set chosen by me for this research is: Nesarc Dataset

In this research, we will study the Relation between Nicotine Dependence and Person’s Smoking Habit for various countries. The variables taken into consideration from the dataset are “AGE”, “S3AQ3B1”, “S3AQ3C1”, “CHECK321" and “MAJORDEPLIFE”.

Description of variables:

1. “AGE”: This variable consists of the ages of individuals included in our data. I renamed this variable to ”age“ in my self-filtered data set.

2. “S3AQ3B1”: This variable tells us about the smoking habit of that individual. It has 7 different levels. I renamed this variable to “days” in my self-filtered data set.

3. “S3AQ3C1”: This variable tells us about the quantity of cigarette smoked per day by that individual. It is quantitative variable and is ranged from 1 to 98. I renamed this variable to “quantity” in my self-filtered data set.

4. “CHECK321”: This variable tells us about the past of smoker. It has 3 different levels. I renamed this variable to “past” in my self-filtered data set.

5. “TABLE12MDX”: This variable tells about the nicotine dependence of the person. It has 2 levels of dependence or not. I renamed this variable to “nicotine” in my self-filtered data set.

Potential Moderator: This variable is denoted by “freq” and represents the number of cigarettes smoked per day.

Response Variable: The Nicotine dependence is our response variable i.e. “nicotine”.

INPUT CODE:

#Firstly we will import two libraries: pandas and numpy import pandas import numpy import seaborn import matplotlib.pyplot as p import scipy.stats #Next read the source csv file using read.csv of pandas and save content in 'data' data=pandas.read_csv("my_nesarc.csv",low_memory=False) #Firstly convert all the entries to numeric type under 'days', 'quantity' and 'past' data["age"]=data["age"].convert_objects(convert_numeric=True) data["quantity"]=data["quantity"].convert_objects(convert_numeric=True) data["past"]=data["past"].convert_objects(convert_numeric=True) #Make a subset 'sub1' of data with taking ages of young adults between 18 and 25 who have smoked in past 12 mopnths sub1=data[(data["age"]>=18)&(data["age"]<=25)&(data["past"]==1)] sub2=sub1.copy() #Setting missing data sub2["days"]=sub2["days"].replace(9,numpy.nan) sub2["quantity"]=sub2["quantity"].replace(99,numpy.nan) # Recoding number of days ffrom parameter to actual number of days recode1={1:30,2:22,3:14,4:5,5:2.5,6:1} #Creating a new variable to store these values sub2["dayfreq"]=sub2["days"].map(recode1)

recode2={1:30,2:22,3:14,4:5,5:2.5,6:1} sub2["dayfreq"]=sub2["days"].map(recode1)

def freq (row): if row['days']!=1: return 0 elif row['quantity']<=5: return 3 elif row['quantity']<=10: return 8 elif row['quantity']<=15: return 13 elif row['quantity']<=20: return 18 elif row['quantity']>20: return 37 sub2['freq']=sub2.apply(lambda row: freq (row),axis=1) c5=sub2['freq'].value_counts(sort=False,dropna=True) print(c5)

#contingency table of observed counts ct1=pandas.crosstab(sub2['nicotine'],sub2['freq']) print(ct1)

#columnn percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

#chi-square print('chi-square value, p-value', 'exected counts') cs1=scipy.stats.chi2_contingency(ct1) print(cs1)

#set variable types sub2['freq']=sub2['freq'].astype('category') sub2['nicotine']=sub2['nicotine'].convert_objects(convert_numeric=True)

#bivariate bar graph seaborn.factorplot(x="freq",y="nicotine",data=sub2,kind="bar",ci=None)

OUTPUT:

The frequencies for each category is mentioned against variable ‘freq’:

The divided frequencies on the basis of 0 and 1 response is mentioned in the image below:

The divided percentages on the basis of 0 and 1 response are mentioned in the image below:

The obtained chi-square value and the p-value is displayed in the image below:

The following graph is obtained to show the increasing nicotine dependence with increasing frequency of cigarettes smoked per day by an individual.

This confirms that our third moderate variable ‘freq’ representing the number of cigarettes smoked per day is a potential moderator.

0 notes

utkarshere · 7 years ago

Text

Generating a Correlation Coefficient

Younger, more-educated and higher-income people everywhere have greater access to the web The data set chosen by me for this research is: GapMinder Dataset

This portion of the GapMinder data includes one year of numerous country-level indicators of health, wealth and development. You can go to www.gapminder.org for more information.

In this research, we will study the Relation of Income per Person and Internet Usage for various countries. The variables taken into consideration from the dataset are “country”, “incomeperperson” and “internetuserate”.

Description of variables:

1. “country”: This variable consists of the various country names for whose population the relationship will be studied. This variable is also the unique identifier in our dataset.

2. “incomeperperson”: This variable tells us about the Income per person in that particular country. The income here is in US Dollars($). Also, the inflation but not the differences in the cost of living between countries have been taken into account.

3. “internetuserate”: This variable gives us the information of how many people have access to the internet (per 100 persons).

Research Question: Is there a direct relationship between the Income of a person and the Internet usage?

Hypothesis: No, there is no association between Income and Internet Usage i.e. More Income per person means more Internet usage in countries.

INPUT :

#Firstly we will import two libraries: pandas and numpy import pandas import numpy import seaborn import matplotlib.pyplot as p import scipy #Next read the source csv file using read.csv of pandas and save content in 'data' data=pandas.read_csv("gapminder_data.csv",low_memory=False) #Firstly convert all the entries to numeric type under 'days', 'quantity' and 'past' data["incomeperperson"]=data["incomeperperson"].convert_objects(convert_numeric=True) data["internetuserate"]=data["internetuserate"].convert_objects(convert_numeric=True) seaborn.regplot(x='incomeperperson',y='internetuserate',fit_reg=True,data=data) data_clean=data.dropna() print("Association between incomeperperson and internetuserate") print(scipy.stats.pearsonr(data_clean['incomeperperson'],data_clean['internetuserate']))

OUTPUT :

RESULTS:

1. The pearson r value comes to be 0.751

2. The p-value comes to be 1.894 e-34 which is very close to 0.

CONCLUSIONS:

Since the r-value is 0.751 which is significantly close to 1, there is a relation between incomes and internet use rates.

The p-value is very close to 0 and hence fall in the critical region, we reject the null hypothesis.

Therefore, our Null Hypothesis is rejected and hence, “There is a direct relationship between the Income per person and Internet Use Rate.”

Value of r-square:

Calculating r^2 value for r=0.751 gives r^2=0.564

This means if we have income of a person then we can predict 56.4% of variability of internet use rate. This is a great number because more than half of variability can be predicted.

0 notes

utkarshere · 7 years ago

Text

Research on Smoking Habit and Nicotine Dependence

The data set chosen by me for this research is: Nesarc Dataset

In this research, we will study the Relation between Nicotine Dependence and Person’s Smoking Habit for various countries. The variables taken into consideration from the dataset are “AGE”, “S3AQ3B1”, “S3AQ3C1”, “CHECK321" and “MAJORDEPLIFE”.

Description of variables:

1. “AGE”: This variable consists of the ages of individuals included in our data. I renamed this variable to ”age“ in my self-filtered data set.

2. “S3AQ3B1”: This variable tells us about the smoking habit of that individual. It has 7 different levels. I renamed this variable to "days” in my self-filtered data set.

3. “S3AQ3C1”: This variable tells us about the quantity of cigarette smoked per day by that individual. It is quantitative variable and is ranged from 1 to 98. I renamed this variable to “quantity” in my self-filtered data set.

4. “CHECK321”: This variable tells us about the past of smoker. It has 3 different levels. I renamed this variable to “past” in my self-filtered data set.

5. “TABLE12MDX”: This variable tells about the nicotine dependence of the person. It has 2 levels of dependence or not. I renamed this variable to “nicotine” in my self-filtered data set.

Research Question: Is there an association between the Depression State of a Person and the number of cigarettes smoked?

Hypothesis: There is no association between the Depression State of a Person and the number of cigarettes smoked.

Explanatory Variable: The depression state is our Exp variable i.e. “dayfreq” with 5 levels showing the number of days a person smokes in a month.

Response Variable: The Nicotine dependence is our response variable i.e. “nicotine”.

PYTHON CODE:

#Firstly we will import two libraries: pandas and numpy import pandas import numpy import seaborn import matplotlib.pyplot as p import scipy.stats #Next read the source csv file using read.csv of pandas and save content in 'data' data=pandas.read_csv("my_nesarc.csv",low_memory=False) #Firstly convert all the entries to numeric type under 'days', 'quantity' and 'past' data["age"]=data["age"].convert_objects(convert_numeric=True) data["quantity"]=data["quantity"].convert_objects(convert_numeric=True) data["past"]=data["past"].convert_objects(convert_numeric=True) #Make a subset 'sub1' of data with taking ages of young adults between 18 and 25 who have smoked in past 12 mopnths sub1=data[(data["age"]>=18)&(data["age"]<=25)&(data["past"]==1)] sub2=sub1.copy() #Setting missing data sub2["days"]=sub2["days"].replace(9,numpy.nan) sub2["quantity"]=sub2["quantity"].replace(99,numpy.nan) # Recoding number of days ffrom parameter to actual number of days recode1={1:30,2:22,3:14,4:5,5:2.5,6:1} #Creating a new variable to store these values sub2["dayfreq"]=sub2["days"].map(recode1) #Contingency Table of observed counts ct1=pandas.crosstab(sub2["nicotine"],sub2["dayfreq"]) print(ct1) #Column Percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct) # Chi-square calculations print('chi-squared value, p value, expected counts') cs1=scipy.stats.chi2_contingency(ct1) print(cs1) sub2["dayfreq"]=sub2["dayfreq"].astype('category') sub2["nicotine"]=sub2["nicotine"].convert_objects(convert_numeric=True) seaborn.factorplot(x="dayfreq",y="nicotine",data=sub2,kind="bar",ci=None)

OUTPUT:

These are the tables obtained in which we can observe the numbers in the first table and percentages in seond table.

The graph obtained is:

This shows increasing nicotine dependence with increasing number of days of smoking.

RESULTS:

1. The f-statistic comes to be 165.27

2. The p-value comes to be 7.436 e-34 which lie very close to 0.

CONCLUSIONS:

This small p-value makes us reject our null hypothesis and accept the alternate one.

Therefore, we say that Smoking habits and nicotine dependence are significantly related.

POST HOC COMPARISONS:

Our explanatory variable has 5 levels, so we need to conduct post hoc comparisons by BONFERRONI ADJUSTMENT.

For 5 categories we have to compare 2 levels individually. And therefore a total of 15 comparisons need to be done.

For 15 comparisons the Bonferroni p-value is 0.05/15=0.003

We have to accept the null hypothesis if the p-value is greater than 0.003.

One comparison between first two levels has been shown below:

This gives us the output of the following table:

With a great p-value, we say that for this we have to accept the null hypothesis if the p-value is greater than 0.003.

Similarly, we conduct the test for rest 14 sets also.

0 notes

utkarshere · 7 years ago

Text

Research on the Association of Depression and Person’s Smoking Habit

The data set chosen by me for this research is: Nesarc Dataset

In this research, we will study the Relation between Depression and Person’s Smoking Habit for various countries. The variables taken into consideration from the dataset are “AGE”, “S3AQ3B1”, “S3AQ3C1”, "CHECK321" and “MAJORDEPLIFE”.

Description of variables:

1. “AGE”: This variable consists of the ages of individuals included in our data. I renamed this variable to "age" in my self-filtered data set.

2. “S3AQ3B1”: This variable tells us about the smoking habit of that individual. It has 7 different levels. I renamed this variable to "days" in my self-filtered data set.

3. “S3AQ3C1”: This variable tells us about the quantity of cigarette smoked per day by that individual. It is quantitative variable and is ranged from 1 to 98. I renamed this variable to "quantity" in my self-filtered data set.

4. "CHECK321": This variable tells us about the past of smoker. It has 3 different levels. I renamed this variable to "past" in my self-filtered data set.

5. "MAJORDEPLIFE": This variable tells if the smoker is in a state of depression or not. It has 2 different levels. I renamed this variable to "depression" in my self-filtered data set.

Research Question: Is there an association between the Depression State of a Person and the number of cigarettes smoked?

Hypothesis: There is no association between the Depression State of a Person and the number of cigarettes smoked.

Explanatory Variable: The depression state is our Exp variable i.e. "depression".

Response Variable: The number of cigarettes smoked is our response variable. We'll introduce a new simplified variable for this ahead in our report.

PYTHON CODE:

#Firstly we will import two libraries: pandas and numpy

import pandas

import numpy

import statsmodels.formula.api as smf

#Next read the source csv file using read.csv of pandas and save content in 'data'

#Low memory is also set False to increase efficiency

data=pandas.read_csv("my_nesarc.csv",low_memory=False)

#Firstly convert all the entries to numeric type under 'days', 'quantity' and 'past'

data["days"]=data["days"].convert_objects(convert_numeric=True)

data["quantity"]=data["quantity"].convert_objects(convert_numeric=True)

data["past"]=data["past"].convert_objects(convert_numeric=True)

#Make a subset 'sub1' of data with taking ages of young adults between 18 and 25 who have smoked in past 12 mopnths

sub1=data[(data["age"]>=18)&(data["age"]<=25)&(data["past"]==1)]

#Setting missing data

sub1["days"]=sub1["days"].replace(9,numpy.nan)

sub1["quantity"]=sub1["quantity"].replace(99,numpy.nan)

# Recoding number of days ffrom parameter to actual number of days

recode1={1:30,2:22,3:14,4:5,5:2.5,6:1}

#Creating a new variable to store these values

sub1["dayfreq"]=sub1["days"].map(recode1)

# Converting dayfreq to numeric type

sub1["dayfreq"]=sub1["dayfreq"].convert_objects(convert_numeric=True)

# New variable to create total frequency by multiplying number of days to number of cigarettes

sub1["totfreq"]=sub1["dayfreq"]*sub1["quantity"]

# Converting totfreq to numeric type

sub1["totfreq"]=sub1["totfreq"].convert_objects(convert_numeric=True)

ct1=sub1.groupby('totfreq').size()

print(ct1)

#using ols function from smf library to calculate F-statistic and p-value

model1=smf.ols(formula='totfreq~C(depression)',data=sub1)

results1=model1.fit()

print(results1.summary())

sub2=sub1[['totfreq','depression']].dropna()

print('means for totfreq by depression status')

m1=sub2.groupby('depression').mean()

print(m1)

print('SD for totfreq by depression status')

sd1=sub2.groupby('depression').std()

print(sd1)

OUTPUT:

RESULTS OF C1:

2. OUTPUT OF OLS:

RESULTS:

1. The f-statistic comes to be 3.55

2. The p-value comes to be 0.0597 which is 0.06 (approx)

CONCLUSIONS:

Since the p-value is greater than 0.05 and hence fall in the acceptance region, we can not reject the null hypothesis.

Therefore, our Null Hypothesis persists and hence, "There is no association between the Depression State of a Person and the number of cigarettes smoked."

0 notes

utkarshere · 7 years ago

Text

Creating Graphs for Data

Firstly two graphs have been displayed separately for both the variables: "incomegroups" and "internetuserate"

1. Univariate graphs for Income:

For providing an overview or description about the variable “incomeperperson”, following code is written:

And the following output is obtained:

1.A: Categorically count plot:

Input Code:

sub2["incomegroups"]=pandas.cut(sub2.incomeperperson,[0,5000,10000,15000,20000,25000,30000,35000,40000,82000]) p4=sub2["incomegroups"].value_counts(sort=False, dropna=True) print(p4) sub2["internetgroups"]=pandas.cut(sub2.internetuserate,[0,20,40,60,80,100]) p5=sub2["internetgroups"].value_counts(sort=False, dropna=True) print(p5) sub2["incomeperperson"]=sub2["incomeperperson"].astype('category') seaborn.countplot(x="incomeperperson",data=sub2)

Output Frequency:

Output Graph:

1.B: Distributive plot:

Input:

Output:

2. Univariate graphs for Internet usage:

For providing an overview or description about the variable “internetuserate”, following code is written:

And the following output is obtained:

2.A: Categorically count plot:

Input Code:

sub2["incomegroups"]=pandas.cut(sub2.incomeperperson,[0,5000,10000,15000,20000,25000,30000,35000,40000,82000]) p4=sub2["incomegroups"].value_counts(sort=False, dropna=True) print(p4) sub2["internetgroups"]=pandas.cut(sub2.internetuserate,[0,20,40,60,80,100]) p5=sub2["internetgroups"].value_counts(sort=False, dropna=True) print(p5) sub2[" internetgroups"]=sub2[" internetgroups"].astype('category') seaborn.countplot(x=" internetgroups",data=sub2)

Output Frequency:

Output Graph:

2.B: Distributive plot:

Input:

Output:

3. Bivariate graph for income and internet:

Input:

Output:

4. Summary:

It can be seen clearly in univariate graphs that most of the counts converge in the starting groups of income and internet use rate identically. Which means that people with lesser incomes are much in number and economy distribution is not identical everywhere. And this fact is justified with our graphs clearly for both income and intenet use rates.

Talking about the bivariate relationship in the scatter plot, it can be observed that the points are close to the regression line throughout the plot approximately, except the very beginning phase. The reason behind is this the availability of bulk points in that phase which makes it difficult to show a trend in starting. But as we move forward, in the mid-phase the points are close to the line. In the last phase, few points are quite far from the line because they represent the extreme points and are categorized as outliers, that's why they are not a matter of concern at all.

Therefore an overall relation can be observed between income per person and internet use rate of different countries. Showing that with increasing economies the internet usage rate per 100 people also increases.

0 notes

utkarshere · 7 years ago

Text

Making data management decisions

INPUT:

In this program, two parts of the code have been added to fulfill the purpose of categorizing the groups as per their respective quartiles. The categorization of “incomeperperson” and “internetuserate” has been done by 2 new variables: “incomegroups” and “internetgroups”.

The detailed output has been shown and discussed below:

Discussions: This shows the classification of entire values of both variables into 4 parts. 1. Income groups: The division of 46 countries (approx) per group has been done. The thing to note is that income per person for first 46 countries lies between a small range from 103 to 775 i.e. 672. And as we proceed further this range increases to 1800, 6800 and 72,300. This shows the distribution of incomes in different economies.

2. Internet groups: The division of 46 countries (approx) per group has been done. The observation to note is that internet use rate for first 46 countries lies between a small range from 0.2 to 9.95 i.e. 9.74. And as we proceed further this range increases to 21, 24 and 40. This roughly -depicts that with increasing economies, the internet use rate is also increasing.

Summary: Missing data: The outputs do not show the NaN entries of the data under both variables. The reason behind this is that while making the subset the data was taken with positive values for both the variables. Doing this gave the data with just 183 countries with no missing data. Otherwise, there were clear 9 missing data entries under variables under concern.

Conclusions: So, now it can be roughly seen that with increasing economies, the internet use rate is also increasing.

0 notes

utkarshere · 7 years ago

Text

MY FIRST PYTHON PROGRAM!!

INPUT :

#Firstly we will import two libraries: pandas and numpy import pandas import numpy #Next read the source csv file using read.csv of pandas and save content in 'data' #Low memory is also set False to increase efficiency data=pandas.read_csv("gapminder_data.csv",low_memory=False) #Firstly convert all the entries to numeric type under 'incomeperperson' and 'internetuserate' data["incomeperperson"]=data["incomeperperson"].convert_objects(convert_numeric=True) data["internetuserate"]=data["internetuserate"].convert_objects(convert_numeric=True) #Make a subset 'sub1' of data with only positive values of both the columns sub1=data[(data["incomeperperson"]>0)&(data["internetuserate"]>0)] #Copy data from 'sub1' into 'sub2' sub2=sub1.copy() #Store the filtered contents of subset in 3 literals and printing them p1=sub2["country"] p2=sub2["incomeperperson"] p3=sub2["internetuserate"] print(p1,p2,p3)

OUTPUTS :

As per the data that has been taken into consideration, there does not exist any such variable for which frequency counts will yield some initial result or observations. Because neither ‘incomeperperson’ nor ‘internetuserate’ has any specific values whose count can be grouped together. Therefore in place of any such frequency distributions, the refined results are shown below after removing the blank entries. It is observed that only 183 rows out of 213 rows have positive values for both ‘incomeperperson’ and ‘internetuserate’.

1. Countries:

These outputs show the list of countries after refining the result of removing the blank entries. And therefore a gap of numbering is observed at many places: i.e. Between 7 and 9, between 180 and 182, etc.

2. Income per person:

These outputs show the list of income per person (in US Dollars $) of respective countries after refining the result of removing the blank entries.

3. Internet usage rate:

These outputs show the list of internet use rate (per 100 people) of respective countries after refining the result of removing the blank entries.

Conclusion:

In all the three outputs it can be seen that the total number of entries shown by the length at the last is 183.This new subset deletes the entries with blank entries. And because the variables included can’t be grouped to get frequency counts, so no frequency tables have been mentioned in the output.

0 notes

utkarshere · 7 years ago

Text

Research on Internet Usage and Income of a Person

Younger, more-educated and higher-income people everywhere have greater access to the web The data set chosen by me for this research is: GapMinder Dataset

This portion of the GapMinder data includes one year of numerous country-level indicators of health, wealth and development. You can go to www.gapminder.org for more information.

In this research, we will study the Relation of Income per Person and Internet Usage for various countries. The variables taken into consideration from the dataset are “country”, “incomeperperson” and “internetuserate”.

Description of variables:

1. “country”: This variable consists of the various country names for whose population the relationship will be studied. This variable is also the unique identifier in our dataset.

2. “incomeperperson”: This variable tells us about the Income per person in that particular country. The income here is in US Dollars($). Also, the inflation but not the differences in the cost of living between countries have been taken into account.

3. “internetuserate”: This variable gives us the information of how many people have access to the internet (per 100 persons).

Research Question: Is there a direct relationship between the Income of a person and the Internet usage?

Hypothesis: Yes, there is an association between Income and Internet Usage i.e. More Income per person means more Internet usage in countries.

Literature Review:

1. From the previous research by Pew Research Center on the inter-dependency of economy and internet usage, it has been found that they both are strongly related to each other.

Reference:

http://www.pewglobal.org/2016/02/22/internet-access-growing-worldwide-but-remains-higher-in-advanced-economies/technology-report-02-06c/

The source of the above research is Spring 2015 Global Attitudes Survey Q70 and Q72. The major findings are summarized below:

1. Internet use increasing in emerging and developing economies 2. Younger, more-educated and higher-income people everywhere have greater access to the web 3. Men have greater access to the internet than women in many nations 4. Daily internet use is fairly common globally

2. In another research work by Research Gate, the study was conducted to find the same relation, particularly in the emerging economy Malaysia.

Reference:

https://www.researchgate.net/publication/301549358_The_Relationship_between_Internet_Usage_and_Gross_National_Income_of_an_Emerging_Economy

Conclusion: The findings from this study show that there is a significant long-term and short-term relationship between gross income and internet usage rate in Malaysia.

1 note · View note