btbarc-blog - Tumblr blog

btbarc-blog · 5 years ago

Text

Creating graphs for your data

As I noted previously, I chose to work with the Gapminder data set and will focus on three variables:

income per person, measured in terms of GDP per capita (variable: incomeperperson);

rate of Internet use, measured in terms of Internet users per 100 people (variable: internetuserate);

urbanisation rate, measured as a share of the population living in urban areas (variable: urbanrate).

Specifically, I chose to examine the relationship between income per person and the rate of Internet use as well as the relationship between income per person and the urbanization rate. I chose to use Python for this course. The program I wrote for describing the data this assignment and its full output are at the bottom of this post after the summaries.

Univariate Summaries

For income per person, the mean is US$8,741 with a standard deviation of US$14,263, which suggests a large spread. As the univariate chart/histogram generated from Python shows, the data is left-skewed.

For the rate of Internet use, the mean is 35.6 Internet users per 100 people with a standard deviation of 27.8, which suggests a large spread. As the chart shows, the data is also left-skewed.

For the urbanization rate, the mean is 56.8% with a standard deviation of 23.8%, which suggests a large spread. As the chart shows, the data is somewhat bimodal, with two peaks around 40 and 70.

Bivariate Summaries

Using Python to create a scatterplot between the rate of Internet use (as the independent variable on the x-axis) and income per person (as the dependent variable on the y-axis), I find there seems to be a positive relationship between the two variables, as suggested by the upward-sloping best-fit line. This appears to support my hypothesis that a higher rate of Internet use is associated with higher income per person.

I also create a scatterplot between the urbanization rate (as the independent variable on the x-axis) and income per person (as the dependent variable on the y-axis). Again, there seems to be a positive relationship, with the best-fit line sloping upwards. This also appears to support my second hypothesis that a higher urbanization rate is associated with higher income per person.

Output

describe income per person

count 190.000000

mean 8740.966076

std 14262.809083

min 103.775857

25% 748.245151

50% 2553.496056

75% 9379.891166

max 105147.437700

Name: incomeperperson, dtype: float64

describe rate of Internet use

count 192.000000

mean 35.632716

std 27.780285

min 0.210066

25% 9.999604

50% 31.810121

75% 56.416046

max 95.638113

Name: internetuserate, dtype: float64

describe urbanization rate

count 203.000000

mean 56.769360

std 23.844933

min 10.400000

25% 36.830000

50% 57.940000

75% 74.210000

max 100.000000

Name: urbanrate, dtype: float64

Program

import pandas

import numpy

import seaborn

import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)

pandas.set_option('display.max_columns', None)

pandas.set_option('display.max_rows', None)

# converting text data to numeric data

data["incomeperperson"] = data["incomeperperson"].apply(pandas.to_numeric,errors="coerce")

data["internetuserate"] = data["internetuserate"].apply(pandas.to_numeric,errors="coerce")

data["urbanrate"] = data["urbanrate"].apply(pandas.to_numeric,errors="coerce")

# describing the data

print("describe income per person")

desc1=data['incomeperperson'].describe()

print(desc1)

print("\n")

print("describe rate of Internet use")

desc2=data['internetuserate'].describe()

print(desc2)

print("\n")

print("describe urbanization rate")

desc3=data['urbanrate'].describe()

print(desc3)

print("\n")

0 notes

btbarc-blog · 5 years ago

Text

Making Data Management Decisions

As I noted previously, I chose to work with the Gapminder data set and will focus on three variables:

income per person, measured in terms of GDP per capita (variable: incomeperperson);

rate of Internet use, measured in terms of Internet users per 100 people (variable: internetuserate);

urbanisation rate, measured as a share of the population living in urban areas (variable: urbanrate).

Specifically, I chose to examine the relationship between income per person and the rate of Internet use as well as the relationship between income per person and the urbanization rate. I decided that I did not need to bin or replace any data. However, it would be easier for me to examine the data if I grouped the data.

I chose to use Python for this course. The program I wrote for this assignment and its full output are at the bottom of this post after the summary.

Summary

The earlier Week 2 lessons on creating frequency tables seem to be geared towards categorical data, where there are a limited number of possible responses, rather than quantitative data which could result in virtually limitless readings. As I was studying quantitative data, this meant that the frequencies of almost all of my data were 1 – because it’s quite rare, for example, that a country would have the exact same income per person as another. The sole exception was with the urbanisation rate where six countries had an urbanisation rate of 100%: Bermuda, Cayman Islands, Hong Kong, Macao, Monaco, and Singapore.

This is obviously not very useful in helping us to understand the frequency distribution of the data. To make the results more useful, I grouped the data.

For income per person (variable: incomeperperson), I grouped the data into four groups (incomegroup4) corresponding to the World Bank’s income groups. The World Bank defines low-income economies as those with income per person of USD1,025 or less; lower middle-income economies as those between USD1,026 and USD3,995; upper middle-income economies as those between USD3,996 and USD12,375; and high-income economies as those at USD12,376 or more (see link). Based on these thresholds, there were 54 low-income economies (25.4%), 54 lower middle-income economies (25.4%), 41 upper middle-income economies (19.2%), and 41 high-income economies (19.2%). There were 23 countries with no data.

For the rate of Internet use (variable: internetuserate), I grouped the data into five groups (internetgroup5) corresponding to readings of 0-20, 21-40, 41-60, 61-80, and 81-100 Internet users per 100 people. 79 countries had 0-20 Internet users per 100 people (37.1%), 37 countries had 21-40 (17.4%), 30 countries had 41-60 (14.1%), 31 countries had 61-80 (14.6%), while 15 countries had 81-100 (7.0%). There were 21 countries with no data.

For the urbanization rate (variable: urbanrate), I also grouped the data into five groups (urbangroup5) corresponding to readings of 0-20%, 21-40%, 41-60%, 61-80%, and 81-100%. 14 countries had an urbanization rate of 0-20% (6.6%), 46 countries had 21-40% (21.6%), 52 countries had 41-60% (24.4%), 53 countries had 61-80% (24.9%), while 38 countries had 81-100% (17.8%). There were 10 countries with no data.

Output

frequency counts for data groups incomegroup4 (0, 1026] 54 (1026, 3996] 54 (3996, 12376] 41 (12376, 150000] 41 dtype: int64

internetgroup5 (0, 21] 79 (21, 41] 37 (41, 61] 30 (61, 81] 31 (81, 100] 15 dtype: int64

urbangroup5 (0, 21] 14 (21, 41] 46 (41, 61] 52 (61, 81] 53 (81, 100] 38 dtype: int64

percentages for data groups incomegroup4 (0, 1026] 25.352113 (1026, 3996] 25.352113 (3996, 12376] 19.248826 (12376, 150000] 19.248826 dtype: float64

internetgroup5 (0, 21] 37.089202 (21, 41] 17.370892 (41, 61] 14.084507 (61, 81] 14.553991 (81, 100] 7.042254 dtype: float64

urbangroup5 (0, 21] 6.572770 (21, 41] 21.596244 (41, 61] 24.413146 (61, 81] 24.882629 (81, 100] 17.840376 dtype: float64

Program

import pandas import numpy

data = pandas.read_csv('gapminder.csv', low_memory=False)

# converting text data to numeric data data["incomeperperson"] = data["incomeperperson"].apply(pandas.to_numeric,errors="coerce") data["internetuserate"] = data["internetuserate"].apply(pandas.to_numeric,errors="coerce") data["urbanrate"] = data["urbanrate"].apply(pandas.to_numeric,errors="coerce")

# grouping the data data['incomegroup4'] = pandas.cut(data.incomeperperson, [0, 1026, 3996, 12376, 150000]) data['internetgroup5'] = pandas.cut(data.internetuserate, [0, 21, 41, 61, 81, 100]) data['urbangroup5'] = pandas.cut(data.urbanrate, [0, 21, 41, 61, 81, 100])

# displaying frequency counts for the data groups print("frequency counts for data groups") c1 = data.groupby('incomegroup4').size() print(c1) print("\n")

c2 = data.groupby('internetgroup5').size() print(c2) print("\n")

c3 = data.groupby('urbangroup5').size() print(c3) print("\n")

# displaying percentages for the data groups print("percentages for data groups") p1 = data.groupby('incomegroup4').size() * 100/len(data) print(p1) print("\n")

p2 = data.groupby('internetgroup5').size() * 100/len(data) print(p2) print("\n")

p3 = data.groupby('urbangroup5').size() * 100/len(data) print(p3) print("\n")

0 notes

btbarc-blog · 5 years ago

Text

Running Your First Program

As I noted in the first assignment, I chose to work with the Gapminder data set and will focus on three variables:

income per person, measured in terms of GDP per capita (variable: incomeperperson);

rate of Internet use, measured in terms of Internet users per 100 people (variable: internetuserate);

urbanisation rate, measured as a share of the population living in urban areas (variable: urbanrate).

I chose to use Python for this course. I ran into a few problems. I got a “404 not found” error message when I tried to open the link https://continuum.io/downloads provided in the course readings to download Anaconda. In the end, I used Google to find somewhere else to download Anaconda from. As I was writing the program, I then discovered was unable to convert the text data to numeric data using the syntax described in the video lessons for the course. Upon further investigation, it appears that the syntax is outdated and no longer in use. I’m not sure if other students encountered the same problem, but it may have been related to my downloading a newer version of Anaconda instead of the one available from the link provided in the course readings. This was incredibly frustrating and disappointing as it felt like the course was teaching outdated material and it seemed impossible to move forward with the assignment. In the end, I found a workable solution from another student’s post in the discussion forums.

The program I wrote for this assignment and its full output are at the bottom of this post after the summary.

Summary

First, I used Python to calculate that there were 213 observations and 16 columns in the data set as suggested in the Week 2 video lessons.

Then I calculated the frequency distributions of the three variables. There was missing data in all three variables I studied: 23 for income per person (variable: incomeperperson), 21 for rate of Internet use (variable: internetuserate), and 10 for urbanisation rate (variable: urbanrate).

The Week 2 lessons on creating frequency tables seem to be geared towards categorical data, where there are a limited number of possible responses, rather than quantitative data which could result in virtually limitless readings. As I was studying quantitative data, this meant that the frequencies of almost all of my data were 1 – because it’s quite rare, for example, that a country would have the exact same income per person as another. The sole exception was with the urbanisation rate (variable: urbanrate) where six countries had an urbanisation rate of 100%: Bermuda, Cayman Islands, Hong Kong, Macao, Monaco, and Singapore.

This is obviously not very useful in helping us to understand the frequency distribution of the data. To make the results more useful, I watched the Week 3 video lessons and learned how to group variables. I included this additional step in my program.

For income per person, I grouped the data into four groups (incomegroup4) corresponding to the World Bank’s income groups. The World Bank defines low-income economies as those with income per person of USD1,025 or less; lower middle-income economies as those between USD1,026 and USD3,995; upper middle-income economies as those between USD3,996 and USD12,375; and high-income economies as those at USD12,376 or more (see link). Based on these thresholds, there were 54 low-income economies (25.4%), 54 lower middle-income economies (25.4%), 41 upper middle-income economies (19.2%), and 41 high-income economies (19.2%). As mentioned above, there were 23 countries with no data.

For the rate of Internet use, I grouped the data into five groups (internetgroup5) corresponding to readings of 0-20, 21-40, 41-60, 61-80, and 81-100 Internet users per 100 people. 79 countries had 0-20 Internet users per 100 people (37.1%), 37 countries had 21-40 (17.4%), 30 countries had 41-60 (14.1%), 31 countries had 61-80 (14.6%), while 15 countries had 81-100 (7.0%). There were 21 countries with no data.

For the urbanization rate, I also grouped the data into five groups (urbangroup5) corresponding to readings of 0-20%, 21-40%, 41-60%, 61-80%, and 81-100%. 14 countries had an urbanization rate of 0-20% (6.6%), 46 countries had 21-40% (21.6%), 52 countries had 41-60% (24.4%), 53 countries had 61-80% (24.9%), while 38 countries had 81-100% (17.8%). There were 10 countries with no data.

Frequency tables

frequency counts for data groups

incomegroup4

(0, 1026] 54

(1026, 3996] 54

(3996, 12376] 41

(12376, 150000] 41

dtype: int64

internetgroup5

(0, 21] 79

(21, 41] 37

(41, 61] 30

(61, 81] 31

(81, 100] 15

dtype: int64

urbangroup5

(0, 21] 14

(21, 41] 46

(41, 61] 52

(61, 81] 53

(81, 100] 38

dtype: int64

percentages for data groups

incomegroup4

(0, 1026] 25.352113

(1026, 3996] 25.352113

(3996, 12376] 19.248826

(12376, 150000] 19.248826

dtype: float64

internetgroup5

(0, 21] 37.089202

(21, 41] 17.370892

(41, 61] 14.084507

(61, 81] 14.553991

(81, 100] 7.042254

dtype: float64

urbangroup5

(0, 21] 6.572770

(21, 41] 21.596244

(41, 61] 24.413146

(61, 81] 24.882629

(81, 100] 17.840376

dtype: float64

Program

import pandas

import numpy

data = pandas.read_csv('gapminder.csv', low_memory=False)

# displaying the number of observations and columns in the data set

print("number of observations")

print(len(data))

print("\n")

print("number of columns")

print(len(data.columns))

print("\n")

# converting text data to numeric data

data["incomeperperson"] = data["incomeperperson"].apply(pandas.to_numeric,errors="coerce")

data["internetuserate"] = data["internetuserate"].apply(pandas.to_numeric,errors="coerce")

data["urbanrate"] = data["urbanrate"].apply(pandas.to_numeric,errors="coerce")

# displaying frequency counts

print("counts")

c1 = data["incomeperperson"].value_counts(sort=False, dropna=False)

print(c1)

print("\n")

c2 = data["internetuserate"].value_counts(sort=False, dropna=False)

print(c2)

print("\n")

c3 = data["urbanrate"].value_counts(sort=False, dropna=False)

print(c3)

print("\n")

# displaying percentages

print("percentages")

p1 = data["incomeperperson"].value_counts(sort=False, normalize=True)

print(p1)

print("\n")

p2 = data["internetuserate"].value_counts(sort=False, normalize=True)

print(p2)

print("\n")

p3 = data["urbanrate"].value_counts(sort=False, normalize=True)

print(p3)

print("\n")

# displaying frequency count table

print("frequency table")

ct1 = data.groupby('incomeperperson').size()

print(ct1)

print("\n")

ct2 = data.groupby('internetuserate').size()

print(ct2)

print("\n")

ct3 = data.groupby('urbanrate').size()

print(ct3)

print("\n")

# displaying percentage table

print("percentage table")

pt1 = data.groupby('incomeperperson').size() * 100/len(data)

print(pt1)

print("\n")

pt2 = data.groupby('internetuserate').size() * 100/len(data)

print(pt2)

print("\n")

pt3 = data.groupby('urbanrate').size() * 100/len(data)

print(pt3)

print("\n")

# grouping the data

data['incomegroup4'] = pandas.cut(data.incomeperperson, [0, 1026, 3996, 12376, 150000])

data['internetgroup5'] = pandas.cut(data.internetuserate, [0, 21, 41, 61, 81, 100])

data['urbangroup5'] = pandas.cut(data.urbanrate, [0, 21, 41, 61, 81, 100])

# displaying frequency counts for the data groups

print("frequency counts for data groups")

ct4 = data.groupby('incomegroup4').size()

print(ct4)

print("\n")

ct5 = data.groupby('internetgroup5').size()

print(ct5)

print("\n")

ct6 = data.groupby('urbangroup5').size()

print(ct6)

print("\n")

# displaying percentages for the data groups

print("percentages for data groups")

pt4 = data.groupby('incomegroup4').size() * 100/len(data)

print(pt4)

print("\n")

pt5 = data.groupby('internetgroup5').size() * 100/len(data)

print(pt5)

print("\n")

pt6 = data.groupby('urbangroup5').size() * 100/len(data)

print(pt6)

print("\n")

Output

number of observations

213

number of columns

counts

NaN 23

18982.269290 1

786.700098 1

1714.942890 1

561.708585 1

180.083376 1

1143.831514 1

27595.091350 1

372.728414 1

377.039699 1

Name: incomeperperson, Length: 191, dtype: int64

81.000000 1

66.000000 1

45.000000 1

NaN 21

65.000000 1

28.430033 1

6.965038 1

36.562553 1

29.879921 1

31.568098 1

Name: internetuserate, Length: 193, dtype: int64

92.00 1

100.00 6

74.50 1

NaN 10

73.50 1

56.02 1

57.18 1

73.92 1

25.46 1

28.38 1

Name: urbanrate, Length: 195, dtype: int64

percentages

220.891248 0.005263

18982.269290 0.005263

786.700098 0.005263

1714.942890 0.005263

561.708585 0.005263

180.083376 0.005263

1143.831514 0.005263

27595.091350 0.005263

372.728414 0.005263

377.039699 0.005263

Name: incomeperperson, Length: 190, dtype: float64

81.000000 0.005208

66.000000 0.005208

45.000000 0.005208

65.000000 0.005208

80.000000 0.005208

28.430033 0.005208

6.965038 0.005208

36.562553 0.005208

29.879921 0.005208

31.568098 0.005208

Name: internetuserate, Length: 192, dtype: float64

92.00 0.004926

100.00 0.029557

74.50 0.004926

73.50 0.004926

17.00 0.004926

56.02 0.004926

57.18 0.004926

73.92 0.004926

25.46 0.004926

28.38 0.004926

Name: urbanrate, Length: 194, dtype: float64

frequency table

incomeperperson

103.775857 1

115.305996 1

131.796207 1

155.033231 1

161.317137 1

39972.352770 1

52301.587180 1

62682.147010 1

81647.100030 1

105147.437700 1

Length: 190, dtype: int64

internetuserate

0.210066 1

0.720009 1

0.749996 1

0.829997 1

0.999959 1

90.016190 1

90.079527 1

90.703555 1

93.277508 1

95.638113 1

Length: 192, dtype: int64

urbanrate

10.40 1

12.54 1

12.98 1

13.22 1

14.32 1

95.64 1

97.36 1

98.32 1

98.36 1

100.00 6

Length: 194, dtype: int64

percentage table

incomeperperson

103.775857 0.469484

115.305996 0.469484

131.796207 0.469484

155.033231 0.469484

161.317137 0.469484

39972.352770 0.469484

52301.587180 0.469484

62682.147010 0.469484

81647.100030 0.469484

105147.437700 0.469484

Length: 190, dtype: float64

internetuserate

0.210066 0.469484

0.720009 0.469484

0.749996 0.469484

0.829997 0.469484

0.999959 0.469484

90.016190 0.469484

90.079527 0.469484

90.703555 0.469484

93.277508 0.469484

95.638113 0.469484

Length: 192, dtype: float64

urbanrate

10.40 0.469484

12.54 0.469484

12.98 0.469484

13.22 0.469484

14.32 0.469484

95.64 0.469484

97.36 0.469484

98.32 0.469484

98.36 0.469484

100.00 2.816901

Length: 194, dtype: float64

frequency counts for data groups

incomegroup4

(0, 1026] 54

(1026, 3996] 54

(3996, 12376] 41

(12376, 150000] 41

dtype: int64

internetgroup5

(0, 21] 79

(21, 41] 37

(41, 61] 30

(61, 81] 31

(81, 100] 15

dtype: int64

urbangroup5

(0, 21] 14

(21, 41] 46

(41, 61] 52

(61, 81] 53

(81, 100] 38

dtype: int64

percentages for data groups

incomegroup4

(0, 1026] 25.352113

(1026, 3996] 25.352113

(3996, 12376] 19.248826

(12376, 150000] 19.248826

dtype: float64

internetgroup5

(0, 21] 37.089202

(21, 41] 17.370892

(41, 61] 14.084507

(61, 81] 14.553991

(81, 100] 7.042254

dtype: float64

urbangroup5

(0, 21] 6.572770

(21, 41] 21.596244

(41, 61] 24.413146

(61, 81] 24.882629

(81, 100] 17.840376

dtype: float64

0 notes

btbarc-blog · 5 years ago

Text

Getting Your Research Project Started

I chose to work with the Gapminder data set and will focus on the factors affecting income per person in a country, measured in terms of GDP per capita. There appear to be a number of factors in the Gapminder data set/codebook which could determine income per person (variable: incomeperperson).

Rate of Internet use

I am interested in examining whether a higher rate of Internet use – measured in terms of Internet users per 100 people – (variable: Internetuserate) is associated with a higher income per person. The causation potentially runs both ways. Greater Internet use is likely linked to greater digitalisation of the economy, which in turn should improve labour productivity and therefore the income that each person can earn. At the same time, greater income per person would likely also allow for an economy to accumulate more savings that can be invested in digitalising the economy and society, thus increasing the rate of Internet use.

A literature review by the World Bank for example highlighted that increased Internet penetration tends to correspond with higher GDP per capita but acknowledged – as noted above – that the direction of causation is unclear and could go both ways (Minges, 2015). Edquist et al (2017) have also estimated that a 10% increase in mobile broadband penetration tends to correspond with higher GDP of 0.6-2.8%. That said, Mayer et al (2020) have argued instead that Internet speed may be more relevant than the rate of Internet use in boosting economic growth – unfortunately, the Gapminder data set does not include data on Internet speed and thus an examination of the influence of Internet speed would be out of the scope for this study for now.

Urbanisation rate

The second research topic I am interested in is whether a higher urbanisation rate – measured as a share of the population living in urban areas – (variable: urbanrate) is also associated with a higher income per person. As more people live closer together, they likely enjoy agglomeration effects that boost labour productivity and thus income per person. For example, the larger number of firms and households in a certain area would imply greater demand or a larger market for goods and services produced in that area. The larger number of employers would also make it easier for workers to find jobs, especially more specialised skilled workers who are able to ask for higher wages as more employers compete for their skills. Greater population density may also imply more resources and demand for investment in infrastructure – including Internet access, in line with my first research topic – which would boost labour productivity and thus income per person.

Notably, the World Bank has highlighted a strong correlation between urbanisation rates and GDP per capita, noting that almost all economies hit urbanisation rates of at least 50% before they achieved middle-income status and that all of the economies considered to be high-income have urbanisation rates of 70-80% (Spence et al, 2009). Chen et al (2014) also acknowledge the relationship between urbanisation rates and GDP per capita, but caution that simply increasing the number of people living in urban areas does not guarantee economic growth if agglomeration effects such as institutional and educational development fail to materialise.

Bibliography

Chen M., Zhang H., Liu W., & Zhang W. (2014). The Global Pattern of Urbanization and Economic Growth: Evidence from the Last Three Decades. PLoS ONE 9(8): e103799. https://doi.org/10.1371/journal.pone.0103799

Edquist, H., Goodridge, P. R., Haskel, J., Li, X., & Lindquist, E. (2017). How important are mobile broadband networks for global economic development? Imperial College Business School. http://hdl.handle.net/10044/1/46208

Mayer, W., Madden, G., & Wu, C. (2020). "Broadband and economic growth: a reassessment," Information Technology for Development, Taylor & Francis Journals, 26(1), 128-145.

Minges, M (2015, January). Exploring the relationship between broadband and economic growth. The World Bank. http://documents.worldbank.org/curated/en/178701467988875888/pdf/102955-WP-Box394845B-PUBLIC-WDR16-BP-Exploring-the-Relationship-between-Broadband-and-Economic-Growth-Minges.pdf

Spence, M., Annez, P. C., Buckley, R. M. (2009). Urbanization and growth. The World Bank.

1 note · View note