miguelsorensen-blog
miguelsorensen-blog
Data Science Rookie
4 posts
Don't wanna be here? Send us removal request.
miguelsorensen-blog 8 years ago
Photo
Tumblr media
0 notes
miguelsorensen-blog 8 years ago
Text
Week 2, first program (Python)
Hard to catch all those errors, but at last I think I got it. I don't know how to attach a file (my .py file) so I pasted the code. This is the code I wrote in Python:
import pandas import numpy pandas.set_option('display.float_format', lambda x:'%f'%x) mydata = pandas.read_csv('science_dataset.csv', low_memory=False) print(len(mydata)) print(len(mydata.columns)) # Convert numeric data loaded as text variables to numeric variables mydata['SP.POP.SCIE.RD.P6'] = pandas.to_numeric(mydata['SP.POP.SCIE.RD.P6']) mydata['SP.POP.TECH.RD.P6'] = pandas.to_numeric(mydata['SP.POP.TECH.RD.P6']) mydata['IP.JRN.ARTC.SC'] = pandas.to_numeric(mydata['IP.JRN.ARTC.SC']) mydata['BX.GSR.ROYL.CD'] = pandas.to_numeric(mydata['BX.GSR.ROYL.CD']) mydata['BM.GSR.ROYL.CD'] = pandas.to_numeric(mydata['BM.GSR.ROYL.CD']) mydata['IP.PAT.RESD'] = pandas.to_numeric(mydata['IP.PAT.RESD']) mydata['LO.PISA.MAT.0'] = pandas.to_numeric(mydata['LO.PISA.MAT.0']) mydata['LO.PISA.REA.0'] = pandas.to_numeric(mydata['LO.PISA.REA.0']) mydata['LO.PISA.SCI.0'] = pandas.to_numeric(mydata['LO.PISA.SCI.0']) mydata['SE.PRM.ENRR'] = pandas.to_numeric(mydata['SE.PRM.ENRR']) mydata['SE.SEC.ENRR'] = pandas.to_numeric(mydata['SE.SEC.ENRR']) # Data counts and percentages print("counts for Researchers per million people") # act as label for data group c1=mydata["SP.POP.SCIE.RD.P6"].value_counts(sort=False, dropna=False) # counts values and group them print(c1) # prints the groups print() #this statement put a blank line between groups of data print("percentage for Researchers per million people") p1=mydata["SP.POP.SCIE.RD.P6"].value_counts(sort=False, dropna=False, normalize=True) print(p1) print() print("counts for Technicians per million people") c2=mydata["SP.POP.TECH.RD.P6"].value_counts(sort=False, dropna=False) print(c2) print() print("percentage for Technicians per million people") p2=mydata["SP.POP.TECH.RD.P6"].value_counts(sort=False, dropna=False, normalize=True) print(p2) print() print("counts for Journal articles") c3=mydata["IP.JRN.ARTC.SC"].value_counts(sort=False, dropna=False) print(c3) print() print("percentage for Journal articles") p3=mydata["IP.JRN.ARTC.SC"].value_counts(sort=False, dropna=False, normalize=True) print(p3) print() print("counts for Royalties receipts, millions of USD") c4=mydata["BX.GSR.ROYL.CD"].value_counts(sort=False, dropna=False) print(c4) print() print("percentage for Royalties receipts, millions of USD") p4=mydata["BX.GSR.ROYL.CD"].value_counts(sort=False, dropna=False, normalize=True) print(p4) print() print("counts for Royalties payments, millions of USD") c5=mydata["BM.GSR.ROYL.CD"].value_counts(sort=False, dropna=False) print(c5) print() print("percentage for Royalties payments, millions of USD") p5=mydata["BM.GSR.ROYL.CD"].value_counts(sort=False, dropna=False, normalize=True) print(p5) print() print("counts for patents per year") c6=mydata["IP.PAT.RESD"].value_counts(sort=False, dropna=False) print(c6) print() print("percentage for patents per year") p6=mydata["IP.PAT.RESD"].value_counts(sort=False, dropna=False, normalize=True) print(p6) print() print("counts for Students at lowest proficiency on PISA, math") c7=mydata["LO.PISA.MAT.0"].value_counts(sort=False, dropna=False) print(c7) print() print("percentage for Students at lowest proficiency on PISA, math") p7=mydata["LO.PISA.MAT.0"].value_counts(sort=False, dropna=False, normalize=True) print(p7) print() print("counts for Students at lowest proficiency on PISA, reading") c8=mydata["LO.PISA.REA.0"].value_counts(sort=False, dropna=False) print(c8) print() print("percentage for Students at lowest proficiency on PISA, reading") p8=mydata["LO.PISA.REA.0"].value_counts(sort=False, dropna=False, normalize=True) print(p8) print() print("counts for Students at lowest proficiency on PISA, science") c9=mydata["LO.PISA.SCI.0"].value_counts(sort=False, dropna=False) print(c9) print() print("percentage for Students at lowest proficiency on PISA, science") p9=mydata["LO.PISA.SCI.0"].value_counts(sort=False, dropna=False, normalize=True) print(p9) print() print("counts for primary enrollment (as %)") c10=mydata["SE.PRM.ENRR"].value_counts(sort=False, dropna=False) print(c10) print() print("percentage for primary enrollment (%)") p10=mydata["SE.PRM.ENRR"].value_counts(sort=False, dropna=False, normalize=True) print(p10) print() print("counts for secondary enrollment (as %)") c11=mydata["SE.SEC.ENRR"].value_counts(sort=False, dropna=False) print(c11) print() print("percentage for secondary enrollment (%)") p11=mydata["SE.SEC.ENRR"].value_counts(sort=False, dropna=False, normalize=True) print(p11) #I wanted to use groupby() as it order my values, but it doesn't take nas, so I stick to value_counts(). My dataset isn't very big, but very variable nonetheless. All the outputs are big, as few values repeat. Here are the first three (Number of scientifics, number of technicians and number of scientific articles): counts for Researchers per million people nan 104 157.000000 2 168.000000 1 101.000000 1 52.000000 1 362.000000 1 4176.000000 1 166.000000 1 267.000000 1 165.000000 2 698.000000 1 47.000000 1 50.000000 2 428.000000 1 3136.000000 1 152.000000 1 358.000000 1 70.000000 1 750.000000 1 180.000000 1 682.000000 1 45.000000 1 34.000000 1 39.000000 3 303.000000 1 27.000000 1 90.000000 2 68.000000 1 734.000000 1 231.000000 1 6658.000000 1 585.000000 1 3418.000000 1 2716.000000 1 691.000000 1 6868.000000 1 3732.000000 1 2052.000000 1 673.000000 1 4577.000000 1 857.000000 1 597.000000 1 1026.000000 1 1282.000000 1 6986.000000 1 4201.000000 1 2719.000000 1 6899.000000 1 4481.000000 1 8255.000000 1 2007.000000 1 4478.000000 1 1053.000000 1 1803.000000 1 1465.000000 1 1157.000000 1 7198.000000 1 2133.000000 1 4019.000000 1 4519.000000 1 Name: SP.POP.SCIE.RD.P6, Length: 115, dtype: int64 percentage for Researchers per million people nan 0.460177 157.000000 0.008850 168.000000 0.004425 101.000000 0.004425 52.000000 0.004425 362.000000 0.004425 4176.000000 0.004425 166.000000 0.004425 267.000000 0.004425 165.000000 0.008850 698.000000 0.004425 47.000000 0.004425 50.000000 0.008850 428.000000 0.004425 3136.000000 0.004425 152.000000 0.004425 358.000000 0.004425 70.000000 0.004425 750.000000 0.004425 180.000000 0.004425 682.000000 0.004425 45.000000 0.004425 34.000000 0.004425 39.000000 0.013274 303.000000 0.004425 27.000000 0.004425 90.000000 0.008850 68.000000 0.004425 734.000000 0.004425 231.000000 0.004425 6658.000000 0.004425 585.000000 0.004425 3418.000000 0.004425 2716.000000 0.004425 691.000000 0.004425 6868.000000 0.004425 3732.000000 0.004425 2052.000000 0.004425 673.000000 0.004425 4577.000000 0.004425 857.000000 0.004425 597.000000 0.004425 1026.000000 0.004425 1282.000000 0.004425 6986.000000 0.004425 4201.000000 0.004425 2719.000000 0.004425 6899.000000 0.004425 4481.000000 0.004425 8255.000000 0.004425 2007.000000 0.004425 4478.000000 0.004425 1053.000000 0.004425 1803.000000 0.004425 1465.000000 0.004425 1157.000000 0.004425 7198.000000 0.004425 2133.000000 0.004425 4019.000000 0.004425 4519.000000 0.004425 Name: SP.POP.SCIE.RD.P6, Length: 115, dtype: float64 counts for Technicians per million people nan 122 40.000000 1 34.000000 2 39.000000 1 319.000000 1 17.000000 2 26.000000 1 48.000000 1 69.000000 1 444.000000 1 37.000000 1 8.000000 1 314.000000 1 290.000000 1 762.000000 1 676.000000 1 178.000000 1 78.000000 1 355.000000 1 33.000000 1 421.000000 1 134.000000 1 30.000000 1 568.000000 1 18.000000 1 101.000000 2 186.000000 1 25.000000 1 1.000000 2 998.000000 1 1479.000000 1 61.000000 1 2394.000000 1 2028.000000 1 6.000000 1 193.000000 1 9.000000 1 58.000000 1 207.000000 1 14.000000 1 191.000000 1 1248.000000 1 175.000000 1 63.000000 1 866.000000 1 384.000000 1 1134.000000 1 2126.000000 1 645.000000 1 1882.000000 1 681.000000 1 1822.000000 1 1722.000000 1 691.000000 1 543.000000 1 743.000000 1 597.000000 1 2765.000000 1 1379.000000 1 1241.000000 1 Name: SP.POP.TECH.RD.P6, Length: 97, dtype: int64 percentage for Technicians per million people nan 0.539823 40.000000 0.004425 34.000000 0.008850 39.000000 0.004425 319.000000 0.004425 17.000000 0.008850 26.000000 0.004425 48.000000 0.004425 69.000000 0.004425 444.000000 0.004425 37.000000 0.004425 8.000000 0.004425 314.000000 0.004425 290.000000 0.004425 762.000000 0.004425 676.000000 0.004425 178.000000 0.004425 78.000000 0.004425 355.000000 0.004425 33.000000 0.004425 421.000000 0.004425 134.000000 0.004425 30.000000 0.004425 568.000000 0.004425 18.000000 0.004425 101.000000 0.008850 186.000000 0.004425 25.000000 0.004425 1.000000 0.008850 998.000000 0.004425 1479.000000 0.004425 61.000000 0.004425 2394.000000 0.004425 2028.000000 0.004425 6.000000 0.004425 193.000000 0.004425 9.000000 0.004425 58.000000 0.004425 207.000000 0.004425 14.000000 0.004425 191.000000 0.004425 1248.000000 0.004425 175.000000 0.004425 63.000000 0.004425 866.000000 0.004425 384.000000 0.004425 1134.000000 0.004425 2126.000000 0.004425 645.000000 0.004425 1882.000000 0.004425 681.000000 0.004425 1822.000000 0.004425 1722.000000 0.004425 691.000000 0.004425 543.000000 0.004425 743.000000 0.004425 597.000000 0.004425 2765.000000 0.004425 1379.000000 0.004425 1241.000000 0.004425 Name: SP.POP.TECH.RD.P6, Length: 97, dtype: float64 counts for Journal articles 27.000000 1 184.000000 1 nan 22 6.000000 3 23.000000 1 2.000000 3 482.000000 1 16.000000 4 210.000000 2 39.000000 2 10.000000 1 187.000000 1 29.000000 1 89.000000 1 437.000000 1 165.000000 1 120.000000 1 177.000000 1 84.000000 2 8.000000 3 12.000000 4 7.000000 4 14.000000 1 137.000000 1 277.000000 1 169.000000 1 1548.000000 1 11.000000 3 256.000000 1 35.000000 1 .. 9679.000000 1 5169.000000 1 720932.000000 1 48622.000000 1 12031.000000 1 1971.000000 1 646082.000000 1 7636.000000 1 20164.000000 1 7244.000000 1 6994.000000 1 3514.000000 1 7772.000000 1 1679.000000 1 80219.000000 1 10659.000000 1 19362.000000 1 16511.000000 1 47806.000000 1 11164.000000 1 30412.000000 1 6874.000000 1 470427.000000 1 8631.000000 1 1400796.000000 1 4359.000000 1 85554.000000 1 4207.000000 1 141887.000000 1 412542.000000 1 Name: IP.JRN.ARTC.SC, Length: 171, dtype: int64 percentage for Journal articles 27.000000 0.004425 184.000000 0.004425 nan 0.097345 6.000000 0.013274 23.000000 0.004425 2.000000 0.013274 482.000000 0.004425 16.000000 0.017699 210.000000 0.008850 39.000000 0.008850 10.000000 0.004425 187.000000 0.004425 29.000000 0.004425 89.000000 0.004425 437.000000 0.004425 165.000000 0.004425 120.000000 0.004425 177.000000 0.004425 84.000000 0.008850 8.000000 0.013274 12.000000 0.017699 7.000000 0.017699 14.000000 0.004425 137.000000 0.004425 277.000000 0.004425 169.000000 0.004425 1548.000000 0.004425 11.000000 0.013274 256.000000 0.004425 35.000000 0.004425 9679.000000 0.004425 5169.000000 0.004425 720932.000000 0.004425 48622.000000 0.004425 12031.000000 0.004425 1971.000000 0.004425 646082.000000 0.004425 7636.000000 0.004425 20164.000000 0.004425 7244.000000 0.004425 6994.000000 0.004425 3514.000000 0.004425 7772.000000 0.004425 1679.000000 0.004425 80219.000000 0.004425 10659.000000 0.004425 19362.000000 0.004425 16511.000000 0.004425 47806.000000 0.004425 11164.000000 0.004425 30412.000000 0.004425 6874.000000 0.004425 470427.000000 0.004425 8631.000000 0.004425 1400796.000000 0.004425 4359.000000 0.004425 85554.000000 0.004425 4207.000000 0.004425 141887.000000 0.004425 412542.000000 0.004425 Name: IP.JRN.ARTC.SC, Length: 171, dtype: float64 At this point I can't get much of the data displayed. The function used doesn't order the values so I can't see the range of the data (i.e. which are the lowest and highest values). I could use describe(), head(), tail() and sort(), but I'm not sure if I can "cheat" using functions not seen in the course yet. The most obvious characteristic of the data is the wide distribution, from almost zero to hundreds of thousands, in all of the variables shown (and the same is true for the rest of my variables). Since there are a lot of unique values (except for nans) I should find a way to group values, but I can't do this as we haven't seen those techniques.
0 notes
miguelsorensen-blog 8 years ago
Text
Coursera Data Visualization - First Assignment
Specific topic of interest: Science and technology production Data source: The World Bank - http://www.worldbank.org/ Datasets: 5.13_Science_and_technology.xls - http://wdi.worldbank.org/table/5.13 Second topic of interest: quality of primary and secondary education. Datasets: 2.8_Participation_in_education.xls (http://wdi.worldbank.org/table/2.8) and 2.10_Education_completion_and_outcomes.xls (http://wdi.worldbank.org/table/2.10). My data are subsets of these three sets. Innovative countries tend to have higher life standards. Even though there are multiple issues involved, innovation is sustained through Research and Development (R&D), which is measured by number of people working on R&D and their production in terms of publications, especially in scientific journals. Much of this R&D is funded with public money, which is seen as a beneficial use of that money (The economic benefits of publicly funded basic research: a critical review https://doi.org/10.1016/S0048-7333(00)00091-3). There are at least four types or paradigms of innovation. One of them is R&D (Four types of R&D http://www.uis.no/getfile.php/Forskning/Senter%20for%20Innovasjonsforskning/Presentation%20Four%20Types%20of%20R%26D%20Darius.pdf) Relationship between education and research has been focused mainly in higher education (The relationship between research and education: typologies and indicators. A literature review https://brage.bibsys.no/xmlui/bitstream/id/418864/NIFUreport2016-8.pdf). I want to know if there is a relationship between R&D and Primary and Secondary Education. Seems plausible to me, since higher education must be built over basic education. I suppose that poor financial support for basic learning will correspond to low R&D and innovation outcome. Also low levels of enrollment in basic learning will also correspond to low R&D and innovation outcome. The datasets I downloaded have variables with long names so I had to make my own names, everyone a string without spaces. For the first topic I choose the variables 'researcherspermill', 'techniciaspermill', 'journalarticles', 'patents', 'intelpropreceipts' and 'intelproppayments'. Many variables, I know, but I can't decide which one represent more closely the innovation/R&D state of a country. Probably it's a mix of all of them. For the second topic I choose the variables 'pisamath15-1', 'pisareading15-1b', 'pisascience15-1', 'schoolenrollprigross', and 'schoolenrollsecgross'. My codebook is accesible at https://drive.google.com/open?id=0B-VoPHU1OD_FU3BUNnpoY2NiLWM
0 notes
miguelsorensen-blog 8 years ago
Text
First post / Primer post
This is a dummy post just to see what it looks like. This blog was created as a requisite for the Data Visualization Course, Wesleyan University via Coursera. Este es un post de prueba s贸lo para saber c贸mo se ve. Este blog fue creado como requisito para el Curso de Visualizaci贸n de Datos dictado por la Universidad Wesleyan a trav茅s de la plataforma Coursera.
0 notes