Don't wanna be here? Send us removal request.
Text
Creating graphs for your data
As I noted previously, I chose to work with the Gapminder data set and will focus on three variables:
income per person, measured in terms of GDP per capita (variable: incomeperperson);
rate of Internet use, measured in terms of Internet users per 100 people (variable: internetuserate);
urbanisation rate, measured as a share of the population living in urban areas (variable: urbanrate).
Specifically, I chose to examine the relationship between income per person and the rate of Internet use as well as the relationship between income per person and the urbanization rate. I chose to use Python for this course. The program I wrote for describing the data this assignment and its full output are at the bottom of this post after the summaries.
Univariate Summaries
For income per person, the mean is US$8,741 with a standard deviation of US$14,263, which suggests a large spread. As the univariate chart/histogram generated from Python shows, the data is left-skewed.

For the rate of Internet use, the mean is 35.6 Internet users per 100 people with a standard deviation of 27.8, which suggests a large spread. As the chart shows, the data is also left-skewed.

For the urbanization rate, the mean is 56.8% with a standard deviation of 23.8%, which suggests a large spread. As the chart shows, the data is somewhat bimodal, with two peaks around 40 and 70.

Bivariate Summaries
Using Python to create a scatterplot between the rate of Internet use (as the independent variable on the x-axis) and income per person (as the dependent variable on the y-axis), I find there seems to be a positive relationship between the two variables, as suggested by the upward-sloping best-fit line. This appears to support my hypothesis that a higher rate of Internet use is associated with higher income per person.

I also create a scatterplot between the urbanization rate (as the independent variable on the x-axis) and income per person (as the dependent variable on the y-axis). Again, there seems to be a positive relationship, with the best-fit line sloping upwards. This also appears to support my second hypothesis that a higher urbanization rate is associated with higher income per person.

Output
describe income per person
count 190.000000
mean 8740.966076
std 14262.809083
min 103.775857
25% 748.245151
50% 2553.496056
75% 9379.891166
max 105147.437700
Name: incomeperperson, dtype: float64
describe rate of Internet use
count 192.000000
mean 35.632716
std 27.780285
min 0.210066
25% 9.999604
50% 31.810121
75% 56.416046
max 95.638113
Name: internetuserate, dtype: float64
describe urbanization rate
count 203.000000
mean 56.769360
std 23.844933
min 10.400000
25% 36.830000
50% 57.940000
75% 74.210000
max 100.000000
Name: urbanrate, dtype: float64
Program
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
pandas.set_option('display.max_columns', None)
pandas.set_option('display.max_rows', None)
# converting text data to numeric data
data["incomeperperson"] = data["incomeperperson"].apply(pandas.to_numeric,errors="coerce")
data["internetuserate"] = data["internetuserate"].apply(pandas.to_numeric,errors="coerce")
data["urbanrate"] = data["urbanrate"].apply(pandas.to_numeric,errors="coerce")
# describing the data
print("describe income per person")
desc1=data['incomeperperson'].describe()
print(desc1)
print("\n")
print("describe rate of Internet use")
desc2=data['internetuserate'].describe()
print(desc2)
print("\n")
print("describe urbanization rate")
desc3=data['urbanrate'].describe()
print(desc3)
print("\n")
0 notes
Text
Making Data Management Decisions
As I noted previously, I chose to work with the Gapminder data set and will focus on three variables:
income per person, measured in terms of GDP per capita (variable: incomeperperson);
rate of Internet use, measured in terms of Internet users per 100 people (variable: internetuserate);
urbanisation rate, measured as a share of the population living in urban areas (variable: urbanrate).
Specifically, I chose to examine the relationship between income per person and the rate of Internet use as well as the relationship between income per person and the urbanization rate. I decided that I did not need to bin or replace any data. However, it would be easier for me to examine the data if I grouped the data.
I chose to use Python for this course. The program I wrote for this assignment and its full output are at the bottom of this post after the summary.
Summary
The earlier Week 2 lessons on creating frequency tables seem to be geared towards categorical data, where there are a limited number of possible responses, rather than quantitative data which could result in virtually limitless readings. As I was studying quantitative data, this meant that the frequencies of almost all of my data were 1 – because it’s quite rare, for example, that a country would have the exact same income per person as another. The sole exception was with the urbanisation rate where six countries had an urbanisation rate of 100%: Bermuda, Cayman Islands, Hong Kong, Macao, Monaco, and Singapore.
This is obviously not very useful in helping us to understand the frequency distribution of the data. To make the results more useful, I grouped the data.
For income per person (variable: incomeperperson), I grouped the data into four groups (incomegroup4) corresponding to the World Bank’s income groups. The World Bank defines low-income economies as those with income per person of USD1,025 or less; lower middle-income economies as those between USD1,026 and USD3,995; upper middle-income economies as those between USD3,996 and USD12,375; and high-income economies as those at USD12,376 or more (see link). Based on these thresholds, there were 54 low-income economies (25.4%), 54 lower middle-income economies (25.4%), 41 upper middle-income economies (19.2%), and 41 high-income economies (19.2%). There were 23 countries with no data.
For the rate of Internet use (variable: internetuserate), I grouped the data into five groups (internetgroup5) corresponding to readings of 0-20, 21-40, 41-60, 61-80, and 81-100 Internet users per 100 people. 79 countries had 0-20 Internet users per 100 people (37.1%), 37 countries had 21-40 (17.4%), 30 countries had 41-60 (14.1%), 31 countries had 61-80 (14.6%), while 15 countries had 81-100 (7.0%). There were 21 countries with no data.
For the urbanization rate (variable: urbanrate), I also grouped the data into five groups (urbangroup5) corresponding to readings of 0-20%, 21-40%, 41-60%, 61-80%, and 81-100%. 14 countries had an urbanization rate of 0-20% (6.6%), 46 countries had 21-40% (21.6%), 52 countries had 41-60% (24.4%), 53 countries had 61-80% (24.9%), while 38 countries had 81-100% (17.8%). There were 10 countries with no data.
Output
frequency counts for data groups incomegroup4 (0, 1026] 54 (1026, 3996] 54 (3996, 12376] 41 (12376, 150000] 41 dtype: int64
internetgroup5 (0, 21] 79 (21, 41] 37 (41, 61] 30 (61, 81] 31 (81, 100] 15 dtype: int64
urbangroup5 (0, 21] 14 (21, 41] 46 (41, 61] 52 (61, 81] 53 (81, 100] 38 dtype: int64
percentages for data groups incomegroup4 (0, 1026] 25.352113 (1026, 3996] 25.352113 (3996, 12376] 19.248826 (12376, 150000] 19.248826 dtype: float64
internetgroup5 (0, 21] 37.089202 (21, 41] 17.370892 (41, 61] 14.084507 (61, 81] 14.553991 (81, 100] 7.042254 dtype: float64
urbangroup5 (0, 21] 6.572770 (21, 41] 21.596244 (41, 61] 24.413146 (61, 81] 24.882629 (81, 100] 17.840376 dtype: float64
Program
import pandas import numpy
data = pandas.read_csv('gapminder.csv', low_memory=False)
# converting text data to numeric data data["incomeperperson"] = data["incomeperperson"].apply(pandas.to_numeric,errors="coerce") data["internetuserate"] = data["internetuserate"].apply(pandas.to_numeric,errors="coerce") data["urbanrate"] = data["urbanrate"].apply(pandas.to_numeric,errors="coerce")
# grouping the data data['incomegroup4'] = pandas.cut(data.incomeperperson, [0, 1026, 3996, 12376, 150000]) data['internetgroup5'] = pandas.cut(data.internetuserate, [0, 21, 41, 61, 81, 100]) data['urbangroup5'] = pandas.cut(data.urbanrate, [0, 21, 41, 61, 81, 100])
# displaying frequency counts for the data groups print("frequency counts for data groups") c1 = data.groupby('incomegroup4').size() print(c1) print("\n")
c2 = data.groupby('internetgroup5').size() print(c2) print("\n")
c3 = data.groupby('urbangroup5').size() print(c3) print("\n")
# displaying percentages for the data groups print("percentages for data groups") p1 = data.groupby('incomegroup4').size() * 100/len(data) print(p1) print("\n")
p2 = data.groupby('internetgroup5').size() * 100/len(data) print(p2) print("\n")
p3 = data.groupby('urbangroup5').size() * 100/len(data) print(p3) print("\n")
0 notes
Text
Running Your First Program
As I noted in the first assignment, I chose to work with the Gapminder data set and will focus on three variables:
income per person, measured in terms of GDP per capita (variable: incomeperperson);
rate of Internet use, measured in terms of Internet users per 100 people (variable: internetuserate);
urbanisation rate, measured as a share of the population living in urban areas (variable: urbanrate).
I chose to use Python for this course. I ran into a few problems. I got a “404 not found” error message when I tried to open the link https://continuum.io/downloads provided in the course readings to download Anaconda. In the end, I used Google to find somewhere else to download Anaconda from. As I was writing the program, I then discovered was unable to convert the text data to numeric data using the syntax described in the video lessons for the course. Upon further investigation, it appears that the syntax is outdated and no longer in use. I’m not sure if other students encountered the same problem, but it may have been related to my downloading a newer version of Anaconda instead of the one available from the link provided in the course readings. This was incredibly frustrating and disappointing as it felt like the course was teaching outdated material and it seemed impossible to move forward with the assignment. In the end, I found a workable solution from another student’s post in the discussion forums.
The program I wrote for this assignment and its full output are at the bottom of this post after the summary.
Summary
First, I used Python to calculate that there were 213 observations and 16 columns in the data set as suggested in the Week 2 video lessons.
Then I calculated the frequency distributions of the three variables. There was missing data in all three variables I studied: 23 for income per person (variable: incomeperperson), 21 for rate of Internet use (variable: internetuserate), and 10 for urbanisation rate (variable: urbanrate).
The Week 2 lessons on creating frequency tables seem to be geared towards categorical data, where there are a limited number of possible responses, rather than quantitative data which could result in virtually limitless readings. As I was studying quantitative data, this meant that the frequencies of almost all of my data were 1 – because it’s quite rare, for example, that a country would have the exact same income per person as another. The sole exception was with the urbanisation rate (variable: urbanrate) where six countries had an urbanisation rate of 100%: Bermuda, Cayman Islands, Hong Kong, Macao, Monaco, and Singapore.
This is obviously not very useful in helping us to understand the frequency distribution of the data. To make the results more useful, I watched the Week 3 video lessons and learned how to group variables. I included this additional step in my program.
For income per person, I grouped the data into four groups (incomegroup4) corresponding to the World Bank’s income groups. The World Bank defines low-income economies as those with income per person of USD1,025 or less; lower middle-income economies as those between USD1,026 and USD3,995; upper middle-income economies as those between USD3,996 and USD12,375; and high-income economies as those at USD12,376 or more (see link). Based on these thresholds, there were 54 low-income economies (25.4%), 54 lower middle-income economies (25.4%), 41 upper middle-income economies (19.2%), and 41 high-income economies (19.2%). As mentioned above, there were 23 countries with no data.
For the rate of Internet use, I grouped the data into five groups (internetgroup5) corresponding to readings of 0-20, 21-40, 41-60, 61-80, and 81-100 Internet users per 100 people. 79 countries had 0-20 Internet users per 100 people (37.1%), 37 countries had 21-40 (17.4%), 30 countries had 41-60 (14.1%), 31 countries had 61-80 (14.6%), while 15 countries had 81-100 (7.0%). There were 21 countries with no data.
For the urbanization rate, I also grouped the data into five groups (urbangroup5) corresponding to readings of 0-20%, 21-40%, 41-60%, 61-80%, and 81-100%. 14 countries had an urbanization rate of 0-20% (6.6%), 46 countries had 21-40% (21.6%), 52 countries had 41-60% (24.4%), 53 countries had 61-80% (24.9%), while 38 countries had 81-100% (17.8%). There were 10 countries with no data.
Frequency tables
frequency counts for data groups
incomegroup4
(0, 1026] 54
(1026, 3996] 54
(3996, 12376] 41
(12376, 150000] 41
dtype: int64
internetgroup5
(0, 21] 79
(21, 41] 37
(41, 61] 30
(61, 81] 31
(81, 100] 15
dtype: int64
urbangroup5
(0, 21] 14
(21, 41] 46
(41, 61] 52
(61, 81] 53
(81, 100] 38
dtype: int64
percentages for data groups
incomegroup4
(0, 1026] 25.352113
(1026, 3996] 25.352113
(3996, 12376] 19.248826
(12376, 150000] 19.248826
dtype: float64
internetgroup5
(0, 21] 37.089202
(21, 41] 17.370892
(41, 61] 14.084507
(61, 81] 14.553991
(81, 100] 7.042254
dtype: float64
urbangroup5
(0, 21] 6.572770
(21, 41] 21.596244
(41, 61] 24.413146
(61, 81] 24.882629
(81, 100] 17.840376
dtype: float64
Program
import pandas
import numpy
data = pandas.read_csv('gapminder.csv', low_memory=False)
# displaying the number of observations and columns in the data set
print("number of observations")
print(len(data))
print("\n")
print("number of columns")
print(len(data.columns))
print("\n")
# converting text data to numeric data
data["incomeperperson"] = data["incomeperperson"].apply(pandas.to_numeric,errors="coerce")
data["internetuserate"] = data["internetuserate"].apply(pandas.to_numeric,errors="coerce")
data["urbanrate"] = data["urbanrate"].apply(pandas.to_numeric,errors="coerce")
# displaying frequency counts
print("counts")
c1 = data["incomeperperson"].value_counts(sort=False, dropna=False)
print(c1)
print("\n")
c2 = data["internetuserate"].value_counts(sort=False, dropna=False)
print(c2)
print("\n")
c3 = data["urbanrate"].value_counts(sort=False, dropna=False)
print(c3)
print("\n")
# displaying percentages
print("percentages")
p1 = data["incomeperperson"].value_counts(sort=False, normalize=True)
print(p1)
print("\n")
p2 = data["internetuserate"].value_counts(sort=False, normalize=True)
print(p2)
print("\n")
p3 = data["urbanrate"].value_counts(sort=False, normalize=True)
print(p3)
print("\n")
# displaying frequency count table
print("frequency table")
ct1 = data.groupby('incomeperperson').size()
print(ct1)
print("\n")
ct2 = data.groupby('internetuserate').size()
print(ct2)
print("\n")
ct3 = data.groupby('urbanrate').size()
print(ct3)
print("\n")
# displaying percentage table
print("percentage table")
pt1 = data.groupby('incomeperperson').size() * 100/len(data)
print(pt1)
print("\n")
pt2 = data.groupby('internetuserate').size() * 100/len(data)
print(pt2)
print("\n")
pt3 = data.groupby('urbanrate').size() * 100/len(data)
print(pt3)
print("\n")
# grouping the data
data['incomegroup4'] = pandas.cut(data.incomeperperson, [0, 1026, 3996, 12376, 150000])
data['internetgroup5'] = pandas.cut(data.internetuserate, [0, 21, 41, 61, 81, 100])
data['urbangroup5'] = pandas.cut(data.urbanrate, [0, 21, 41, 61, 81, 100])
# displaying frequency counts for the data groups
print("frequency counts for data groups")
ct4 = data.groupby('incomegroup4').size()
print(ct4)
print("\n")
ct5 = data.groupby('internetgroup5').size()
print(ct5)
print("\n")
ct6 = data.groupby('urbangroup5').size()
print(ct6)
print("\n")
# displaying percentages for the data groups
print("percentages for data groups")
pt4 = data.groupby('incomegroup4').size() * 100/len(data)
print(pt4)
print("\n")
pt5 = data.groupby('internetgroup5').size() * 100/len(data)
print(pt5)
print("\n")
pt6 = data.groupby('urbangroup5').size() * 100/len(data)
print(pt6)
print("\n")
Output
number of observations
213
number of columns
16
counts
NaN 23
18982.269290 1
786.700098 1
1714.942890 1
561.708585 1
..
180.083376 1
1143.831514 1
27595.091350 1
372.728414 1
377.039699 1
Name: incomeperperson, Length: 191, dtype: int64
81.000000 1
66.000000 1
45.000000 1
NaN 21
65.000000 1
..
28.430033 1
6.965038 1
36.562553 1
29.879921 1
31.568098 1
Name: internetuserate, Length: 193, dtype: int64
92.00 1
100.00 6
74.50 1
NaN 10
73.50 1
..
56.02 1
57.18 1
73.92 1
25.46 1
28.38 1
Name: urbanrate, Length: 195, dtype: int64
percentages
220.891248 0.005263
18982.269290 0.005263
786.700098 0.005263
1714.942890 0.005263
561.708585 0.005263
180.083376 0.005263
1143.831514 0.005263
27595.091350 0.005263
372.728414 0.005263
377.039699 0.005263
Name: incomeperperson, Length: 190, dtype: float64
81.000000 0.005208
66.000000 0.005208
45.000000 0.005208
65.000000 0.005208
80.000000 0.005208
28.430033 0.005208
6.965038 0.005208
36.562553 0.005208
29.879921 0.005208
31.568098 0.005208
Name: internetuserate, Length: 192, dtype: float64
92.00 0.004926
100.00 0.029557
74.50 0.004926
73.50 0.004926
17.00 0.004926
56.02 0.004926
57.18 0.004926
73.92 0.004926
25.46 0.004926
28.38 0.004926
Name: urbanrate, Length: 194, dtype: float64
frequency table
incomeperperson
103.775857 1
115.305996 1
131.796207 1
155.033231 1
161.317137 1
..
39972.352770 1
52301.587180 1
62682.147010 1
81647.100030 1
105147.437700 1
Length: 190, dtype: int64
internetuserate
0.210066 1
0.720009 1
0.749996 1
0.829997 1
0.999959 1
..
90.016190 1
90.079527 1
90.703555 1
93.277508 1
95.638113 1
Length: 192, dtype: int64
urbanrate
10.40 1
12.54 1
12.98 1
13.22 1
14.32 1
..
95.64 1
97.36 1
98.32 1
98.36 1
100.00 6
Length: 194, dtype: int64
percentage table
incomeperperson
103.775857 0.469484
115.305996 0.469484
131.796207 0.469484
155.033231 0.469484
161.317137 0.469484
39972.352770 0.469484
52301.587180 0.469484
62682.147010 0.469484
81647.100030 0.469484
105147.437700 0.469484
Length: 190, dtype: float64
internetuserate
0.210066 0.469484
0.720009 0.469484
0.749996 0.469484
0.829997 0.469484
0.999959 0.469484
90.016190 0.469484
90.079527 0.469484
90.703555 0.469484
93.277508 0.469484
95.638113 0.469484
Length: 192, dtype: float64
urbanrate
10.40 0.469484
12.54 0.469484
12.98 0.469484
13.22 0.469484
14.32 0.469484
95.64 0.469484
97.36 0.469484
98.32 0.469484
98.36 0.469484
100.00 2.816901
Length: 194, dtype: float64
frequency counts for data groups
incomegroup4
(0, 1026] 54
(1026, 3996] 54
(3996, 12376] 41
(12376, 150000] 41
dtype: int64
internetgroup5
(0, 21] 79
(21, 41] 37
(41, 61] 30
(61, 81] 31
(81, 100] 15
dtype: int64
urbangroup5
(0, 21] 14
(21, 41] 46
(41, 61] 52
(61, 81] 53
(81, 100] 38
dtype: int64
percentages for data groups
incomegroup4
(0, 1026] 25.352113
(1026, 3996] 25.352113
(3996, 12376] 19.248826
(12376, 150000] 19.248826
dtype: float64
internetgroup5
(0, 21] 37.089202
(21, 41] 17.370892
(41, 61] 14.084507
(61, 81] 14.553991
(81, 100] 7.042254
dtype: float64
urbangroup5
(0, 21] 6.572770
(21, 41] 21.596244
(41, 61] 24.413146
(61, 81] 24.882629
(81, 100] 17.840376
dtype: float64
0 notes
Text
Getting Your Research Project Started
I chose to work with the Gapminder data set and will focus on the factors affecting income per person in a country, measured in terms of GDP per capita. There appear to be a number of factors in the Gapminder data set/codebook which could determine income per person (variable: incomeperperson).
Rate of Internet use
I am interested in examining whether a higher rate of Internet use – measured in terms of Internet users per 100 people – (variable: Internetuserate) is associated with a higher income per person. The causation potentially runs both ways. Greater Internet use is likely linked to greater digitalisation of the economy, which in turn should improve labour productivity and therefore the income that each person can earn. At the same time, greater income per person would likely also allow for an economy to accumulate more savings that can be invested in digitalising the economy and society, thus increasing the rate of Internet use.
A literature review by the World Bank for example highlighted that increased Internet penetration tends to correspond with higher GDP per capita but acknowledged – as noted above – that the direction of causation is unclear and could go both ways (Minges, 2015). Edquist et al (2017) have also estimated that a 10% increase in mobile broadband penetration tends to correspond with higher GDP of 0.6-2.8%. That said, Mayer et al (2020) have argued instead that Internet speed may be more relevant than the rate of Internet use in boosting economic growth – unfortunately, the Gapminder data set does not include data on Internet speed and thus an examination of the influence of Internet speed would be out of the scope for this study for now.
Urbanisation rate
The second research topic I am interested in is whether a higher urbanisation rate – measured as a share of the population living in urban areas – (variable: urbanrate) is also associated with a higher income per person. As more people live closer together, they likely enjoy agglomeration effects that boost labour productivity and thus income per person. For example, the larger number of firms and households in a certain area would imply greater demand or a larger market for goods and services produced in that area. The larger number of employers would also make it easier for workers to find jobs, especially more specialised skilled workers who are able to ask for higher wages as more employers compete for their skills. Greater population density may also imply more resources and demand for investment in infrastructure – including Internet access, in line with my first research topic – which would boost labour productivity and thus income per person.
Notably, the World Bank has highlighted a strong correlation between urbanisation rates and GDP per capita, noting that almost all economies hit urbanisation rates of at least 50% before they achieved middle-income status and that all of the economies considered to be high-income have urbanisation rates of 70-80% (Spence et al, 2009). Chen et al (2014) also acknowledge the relationship between urbanisation rates and GDP per capita, but caution that simply increasing the number of people living in urban areas does not guarantee economic growth if agglomeration effects such as institutional and educational development fail to materialise.
Bibliography
Chen M., Zhang H., Liu W., & Zhang W. (2014). The Global Pattern of Urbanization and Economic Growth: Evidence from the Last Three Decades. PLoS ONE 9(8): e103799. https://doi.org/10.1371/journal.pone.0103799
Edquist, H., Goodridge, P. R., Haskel, J., Li, X., & Lindquist, E. (2017). How important are mobile broadband networks for global economic development? Imperial College Business School. http://hdl.handle.net/10044/1/46208
Mayer, W., Madden, G., & Wu, C. (2020). "Broadband and economic growth: a reassessment," Information Technology for Development, Taylor & Francis Journals, 26(1), 128-145.
Minges, M (2015, January). Exploring the relationship between broadband and economic growth. The World Bank. http://documents.worldbank.org/curated/en/178701467988875888/pdf/102955-WP-Box394845B-PUBLIC-WDR16-BP-Exploring-the-Relationship-between-Broadband-and-Economic-Growth-Minges.pdf
Spence, M., Annez, P. C., Buckley, R. M. (2009). Urbanization and growth. The World Bank.
1 note
·
View note