ghorbelma - Tumblr blog

ghorbelma · 6 years ago

Text

Machine Learning for data anaysis, week 4 ,K means cluster analysis

Please find the k-mean cluster assignement on the link below

https://drive.google.com/file/d/1be470fUZbsKaxqYMOOvclEh5NByda4FW/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Machine Learning Week3: Lasso regression

Please find the assignement of lasso rsgression in the ling below

https://drive.google.com/file/d/1J4MObHyFt6Iev7yXHDIXDmYTVgx4iSuv/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Regression models in practice Week 4 logistic regression

PLease find the assignement in the link below:

https://drive.google.com/file/d/1T56FqqvDy2_rjKPJfIxalgIjSja61k4v/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Regression Modeling in Practice Week 3 Multiple regression

Please find attached below themultiple regression assignement

https://drive.google.com/file/d/1DiYr4gSxSrYBF8tQk-iNtiHFxWD5B7WX/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Machine Learning for Data Analysis Week 2 Random Forest

please find the assignment in the google drive link

https://drive.google.com/file/d/1-2IbyRi_wLN76v_-iq5nA-_VCi8qCaOJ/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Machine Learning for data analysis week1: tree decision

please find the assignment in the google drive link

https://drive.google.com/file/d/1Mp5jCsURutxcXkmQv2ZNrHwh0ccdtz8W/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Regression Modeling in Practice Week 2 Test a Basic Linear Regression Model

Please find the assignement in google drive link

Thank you

https://drive.google.com/file/d/1Q1TyuvUF558QU86bi0YBJaaTz7oP_ra1/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Regression Modeling in Practice Week 1:Writing About Your Data

Codebook: The Outlook on Life Survey (OOL)

Sample

The sample is from 2012 two Outlook Surveys designed to study political and social attitudes in the United States.

Participant number was 2,294 in first survey and 1,601 in the second.

The target population were adults 18 years of age and older all non-institutionalized and residing in the United States and was comprised of four groups: African American/Black males, African American/Black females, White/other race males, and White/other race females.

The survey considered the ways in which social class, ethnicity, marital status, feminism, religiosity, political orientation, sexual behavior, and cultural beliefs or stereotypes influence opinion and behavior.

Procedure

Participants were drawn from the GfK Knowledge Network, a web panel designed to be representative of the Unites States population. Panel members are randomly recruited through probability-based sampling (The method of collection was a Cross-sectional, Cross-sectional ad-hoc follow-up), and households are provided with access to the Internet and hardware if needed. Random-digit dialing and address-based sampling methodologies are used.

Measures

Explanatory variables is one of the four group to which the participant belons.

The reponse variable is the political or social behaviour of the participant.

Possible answers were divided into categories and were given to Participants. answers could be in a form of frequency, quantity or yes/no answers. For instance, the answers of the question:

How many days in the past week did you watch national news programs on television or on the Internet?

The answers could be none or range from 1 to 7 (or refused)

to the question:

Do you approve or disapprove of the way Barack Obama is handling his job as President?

the answers are yes/no/refused

0 notes

ghorbelma · 6 years ago

Text

Data Analysis Tools Week 4 Potential Moderator

https://drive.google.com/file/d/1fGqrD5IvAp_e-BdkTtX-fj5LnlhOahIP/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Data Analysis tools week3: Pearson Correlation

Please find in google drive the assignment of week 3 of data analysis tools regarding pearson correlation

https://drive.google.com/open?id=1V2q2HBPt4qFwV0ZOkjCMnHx6ZEUgxEbq

0 notes

ghorbelma · 6 years ago

Text

Data Analysis Tools: running Χ2 squared

please, find attached below the assignment of week 2 of data analysis tools

https://drive.google.com/file/d/1IRiwea4RYksyeYm4Clocg1I03YAn5X-Q/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Data Analysis Tools: Running an analysis of variance

Please find in the document below the assignment of week 1 of data analysis tools

https://drive.google.com/file/d/1FsskHzW2IzWil8vrPc9cflS2VU8sJOls/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Data Management and Visualization Week 4 Creating graphs for your data

Please find attached below the document showing the results

https://drive.google.com/file/d/1fmdYIn_TLOlYYvHvDX_QAh8ri8iecund/view?usp=sharing

0 notes

ghorbelma · 6 years ago

Text

Data Management and Visualization Making Data Management Decisions

DataSet: Gapminder

Variable: countries, democracy score, employ rate and female employ rate.

some democracy score are missings, I displayed the frequency by quantile.

employ rate values are correct (between 0 and 100%) but some rows all empty

female employ rate has some dummy values (when i displayed the quantiles, the last quantiles end value is greater than 100%), some data is missing

I checked column by column if there is any empty cells (I am able to conduct analysis between the variables only if for a given countries all the figures are non empty)

I checked if there is any dummy variable employ rate and female rate should be between 0 and 100% and replaced it by Nan

Finally I construct a new dataframe with only a rows containing a non empty and valide Data

######THE CODE####################

import pandas import numpy

data =pandas.read_csv("Gapminder.csv",";")

print ("Total Contries: " + str(len(data))) print ("Total Columns "+ str(len(data.columns)))

#convert_objects seems deprecated, use to_numeric instead #data["employrate"]= data["employrate"].convert_objects(convert_numeric=True) #NAN are kept #First just browse all columns date, then look for some subgroup (sample) #data sorted by value (indexof)and not by the frequency of occurence

##############POLITYSCORE##############

data["polityscore"]= pandas.to_numeric(data["polityscore"], errors='coerce' )

democracy_Score_Data = data["polityscore"].copy()

democracy_Score_Frequency =(pandas.qcut( democracy_Score_Data,4,duplicates='drop', labels=["0-25%","25%-50%","50%-75%","75%-100%"])).value_counts(sort=True,dropna =True).sort_index() print('\n') print('Democracy score summary data') print(democracy_Score_Data.describe()) print('\n') print ("Democracy score quantile") print (democracy_Score_Frequency)

##############EMPLOY RATE############## print('\n') print ("Count for employrate: 2007 total employees age 15+ (% of population)") data["employrate"]= pandas.to_numeric(data["employrate"], errors='coerce' ) data_employRate_discrete=pandas.cut(data["employrate"],10)

employ_Rate_Frequency =data_employRate_discrete.value_counts(sort=True,dropna =False).sort_index()

print (employ_Rate_Frequency)

##############FEMALEEMPLOYRATE############## print ("Count for female employrate: 2007 total female employees age 15+ (% of population)") data["femaleemployrate"]= pandas.to_numeric(data["femaleemployrate"], errors='coerce' )

femaleEmployRateData = data["femaleemployrate"].copy()

data_femaleemployRate_beforeCorrection=pandas.qcut(data["femaleemployrate"],10) female_employ_Rate_Frequency_beforeCorrection =data_femaleemployRate_beforeCorrection.value_counts(sort=True,dropna =False).sort_index() print('\n') print ("Female employ rate, should be between 0 and 100") print (female_employ_Rate_Frequency_beforeCorrection)

#femaleEmployRateData=femaleEmployRateData.apply (lambda x:[x if x <= 100 else numpy.NaN]) femaleEmployRateData.values[femaleEmployRateData > 100] = numpy.NaN data_femaleemployRate_afterCorrection=pandas.qcut(femaleEmployRateData,10)

female_employ_Rate_Frequency_afterCorrection =data_femaleemployRate_afterCorrection.value_counts(sort=True,dropna =False).sort_index() print('\n') print ("Female employ rate, all dummy values greater than 100 were removed") print (female_employ_Rate_Frequency_afterCorrection)

######### BUILDING The DATA FRAME TO USE######################## print('\n') print("Constructing the data frame to be used from the gapminder data") # subset variables in new data frame, sub1 sub1=data[['country','polityscore', 'employrate', 'femaleemployrate' ]] print('\n') print ('Check the last 10 element of the dataset, some dummies or missing are there') print(sub1.tail(10))

def CorrectEmployRate (row): if row['femaleemployrate'] > 100 : return numpy.NaN else: return row['femaleemployrate']

sub1['femaleemployrate']=sub1.apply (lambda row: CorrectEmployRate (row),axis=1) print('\n') print('Dummy female rate after correction again displaying last 10 elements') print(sub1.tail(10))

print('\n') print('The Data Frame with all the correct Data') print('Keep only data with finite value') sub2 = sub1[numpy.isfinite(sub1['femaleemployrate']) & numpy.isfinite(sub1['polityscore']) &numpy.isfinite(sub1['employrate'])] print(sub2.tail(10))

print('\n') print('--------REMAINING DATA SUMMARY------------------') print(sub2.describe())

############The OUTPUT###########

Total Contries: 213 Total Columns 17

Democracy score summary data count 157.000000 mean 3.766691 std 6.244515 min -10.000000 25% -1.000000 50% 6.000000 75% 9.000000 max 10.000000 Name: polityscore, dtype: float64

Democracy score quantile 0-25% 43 25%-50% 36 50%-75% 45 75%-100% 33 Name: polityscore, dtype: int64

Count for employrate: 2007 total employees age 15+ (% of population) (4.657, 12.582] 5 (12.582, 20.429] 1 (20.429, 28.275] 1 (28.275, 36.121] 1 (36.121, 43.968] 12 (43.968, 51.814] 32 (51.814, 59.661] 50 (59.661, 67.507] 44 (67.507, 75.354] 18 (75.354, 83.2] 13 NaN 36 Name: employrate, dtype: int64 Count for female employrate: 2007 total female employees age 15+ (% of population)

Female employ rate, should be between 0 and 100 (11.299000000000001, 30.34] 18 (30.34, 37.5] 18 (37.5, 41.16] 18 (41.16, 45.2] 17 (45.2, 48.45] 18 (48.45, 51.3] 20 (51.3, 54.69] 15 (54.69, 60.82] 18 (60.82, 73.12] 18 (73.12, 9666891666.667] 18 NaN 35 Name: femaleemployrate, dtype: int64

Female employ rate, all dummy values greater than 100 were removed (11.299000000000001, 30.19] 17 (30.19, 37.2] 17 (37.2, 40.24] 17 (40.24, 43.92] 17 (43.92, 47.1] 18 (47.1, 50.64] 16 (50.64, 53.66] 17 (53.66, 58.22] 17 (58.22, 66.7] 17 (66.7, 83.3] 17 NaN 43 Name: femaleemployrate, dtype: int64

Constructing the data frame to be used from the gapminder data

Check the last 10 element of the dataset, some dummies or missing are there country polityscore employrate femaleemployrate 203 United States 10.000000 62.299999 56.000000 204 Uruguay 10.000000 57.500000 46.000000 205 Uzbekistan -9.000000 57.500000 52.599998 206 Vanuatu nan nan nan 207 Venezuela -3.000000 59.900002 45.799999 208 Vietnam -7.000000 71.000000 67.599998 209 West Bank and Gaza nan 32.000000 11.300000 210 Yemen nan 6.265789 234864666.666667 211 Zambia 7.000000 61.000000 53.500000 212 Zimbabwe 1.000000 66.800003 58.099998

Dummy female rate after correction again displaying last 10 elements country polityscore employrate femaleemployrate 203 United States 10.000000 62.299999 56.000000 204 Uruguay 10.000000 57.500000 46.000000 205 Uzbekistan -9.000000 57.500000 52.599998 206 Vanuatu nan nan nan 207 Venezuela -3.000000 59.900002 45.799999 208 Vietnam -7.000000 71.000000 67.599998 209 West Bank and Gaza nan 32.000000 11.300000 210 Yemen nan 6.265789 nan 211 Zambia 7.000000 61.000000 53.500000 212 Zimbabwe 1.000000 66.800003 58.099998

The Data Frame with all the correct Data Keep only data with finite value country polityscore employrate femaleemployrate 200 Ukraine 7.000000 54.400002 49.400002 201 United Arab Emirates -8.000000 75.199997 37.299999 202 United Kingdom 10.000000 59.299999 53.099998 203 United States 10.000000 62.299999 56.000000 204 Uruguay 10.000000 57.500000 46.000000 205 Uzbekistan -9.000000 57.500000 52.599998 207 Venezuela -3.000000 59.900002 45.799999 208 Vietnam -7.000000 71.000000 67.599998 211 Zambia 7.000000 61.000000 53.500000 212 Zimbabwe 1.000000 66.800003 58.099998

--------REMAINING DATA SUMMARY------------------ polityscore employrate femaleemployrate count 152.000000 152.000000 152.000000 mean 3.736842 59.563816 48.292763 std 6.319042 10.127999 14.745075 min -10.000000 37.400002 12.400000 25% -2.000000 52.650001 39.599998 50% 6.500000 58.900002 48.549999 75% 9.000000 65.025000 56.325001 max 10.000000 83.199997 83.300003

0 notes

ghorbelma · 6 years ago

Text

Data Management & Visualisation Week 2 Assignement

I choosed a subset of data minder containing four columns (country, polityscore, employ rate and female employ rate.

the polity score in a political score ranging from -10 (dictatorship) to 10 democracy, 56 entries is missing, the extreme(both positive and negative are the more frequent)

the employ rate and female emply rate are specific for each country that is why almost all of the figures occurs only once. i choose again to include the NAN in the frequency table but I used panda cut and qcut to discretize data into intervals (some data in female emply rate are completely dummy that is why I used qcut that automatically split data betwwen quantiles

In the last part, to select a sample from the entire population, I selected countries (and there data) having at the same time a polity score>5.0 and an employ rate > 70.0 then I displayed their frequency

#######################################

2/The Code

import pandas import numpy

data =pandas.read_csv("Gapminder.csv",";")

print ("Total Rows: " + str(len(data))) print ("Total Columns "+ str(len(data.columns)))

#convert_objects seems deprecated, use to_numeric instead #data["employrate"]= data["employrate"].convert_objects(convert_numeric=True) #NAN are kept #First just browse all columns date, then look for some subgroup (sample) #data sorted by value (indexof)and not by the frequency of occurence

##############POLITYSCORE##############

print ("Count for polityscore: 2009 Democracy score (Polity),-10 (dictatorship) to 10 (democracy)") data["polityscore"]= pandas.to_numeric(data["polityscore"], errors='coerce' )

democracy_Score_Frequency =(data["polityscore"]).value_counts(sort=True,dropna =False).sort_index()

print (democracy_Score_Frequency)

print ("percentage of polityscore: 2009 Democracy score (Polity),-10 to 10") democracy_Score_Percentage =(data["polityscore"]).value_counts(sort=True,normalize=True,dropna =True)

print(democracy_Score_Percentage)

##############EMPLOYRATE############## print ("Count for employrate: 2007 total employees age 15+ (% of population)") data["employrate"]= pandas.to_numeric(data["employrate"], errors='coerce' ) data_employRate_discrete=pandas.cut(data["employrate"],10)

employ_Rate_Frequency =data_employRate_discrete.value_counts(sort=True,dropna =False).sort_index()

print (employ_Rate_Frequency)

print ("percentage of employ rate: 2007 total employees age 15+ (% of population)") employ_Rate_Percentage =data_employRate_discrete.value_counts(sort=True,normalize=True,dropna =False).sort_index()

print(employ_Rate_Percentage)

##############FEMALEEMPLOYRATE############## print ("Count for female employrate: 2007 total female employees age 15+ (% of population)") data["femaleemployrate"]= pandas.to_numeric(data["femaleemployrate"], errors='coerce' ) data_femaleemployRate_discrete=pandas.qcut(data["femaleemployrate"],10)

female_employ_Rate_Frequency =data_femaleemployRate_discrete.value_counts(sort=True,dropna =False).sort_index()

print (female_employ_Rate_Frequency)

print ("percentage of female employrate: 2007 total female employees age 15+ (% of population)") female_employ_Rate_Percentage =data_femaleemployRate_discrete.value_counts(sort=True,normalize=True,dropna =False).sort_index()

print(female_employ_Rate_Percentage)

#######polity score only for democracy (score>5) with employ rate>70

print ('Countries having polity score >5 and emply rate>70')

democracy_sample =data[(data['polityscore']>5.0) & (data['employrate']>70.0)]

sample=democracy_sample.copy()

sample_frequency =sample['polityscore'].value_counts(sort=True,dropna =False).sort_index()

print(sample) print ('printing the sample frequency') print (sample_frequency)

#######################################

3/The Output

Total Rows: 213 Total Columns 4 Count for polityscore: 2009 Democracy score (Polity),-10 (dictatorship) to 10 (democracy) -10.000000 2 -9.000000 3 -8.000000 2 -7.000000 12 -6.000000 3 -5.000000 2 -4.000000 5 -3.000000 6 -2.000000 4 -1.000000 4 0.000000 6 1.000000 3 2.000000 3 2.087848 1 2.282655 1 3.000000 2 4.000000 4 5.000000 6 6.000000 10 7.000000 13 8.000000 18 9.000000 14 10.000000 33 NaN 56 Name: polityscore, dtype: int64 percentage of polityscore: 2009 Democracy score (Polity),-10 to 10 10.000000 0.210191 8.000000 0.114650 9.000000 0.089172 7.000000 0.082803 -7.000000 0.076433 6.000000 0.063694 0.000000 0.038217 -3.000000 0.038217 5.000000 0.038217 -4.000000 0.031847 -1.000000 0.025478 4.000000 0.025478 -2.000000 0.025478 1.000000 0.019108 -6.000000 0.019108 2.000000 0.019108 -9.000000 0.019108 -5.000000 0.012739 3.000000 0.012739 -8.000000 0.012739 -10.000000 0.012739 2.087848 0.006369 2.282655 0.006369 Name: polityscore, dtype: float64 Count for employrate: 2007 total employees age 15+ (% of population) (4.657, 12.582] 5 (12.582, 20.429] 1 (20.429, 28.275] 1 (28.275, 36.121] 1 (36.121, 43.968] 12 (43.968, 51.814] 32 (51.814, 59.661] 50 (59.661, 67.507] 44 (67.507, 75.354] 18 (75.354, 83.2] 13 NaN 36 Name: employrate, dtype: int64 percentage of employ rate: 2007 total employees age 15+ (% of population) (4.657, 12.582] 0.023474 (12.582, 20.429] 0.004695 (20.429, 28.275] 0.004695 (28.275, 36.121] 0.004695 (36.121, 43.968] 0.056338 (43.968, 51.814] 0.150235 (51.814, 59.661] 0.234742 (59.661, 67.507] 0.206573 (67.507, 75.354] 0.084507 (75.354, 83.2] 0.061033 NaN 0.169014 Name: employrate, dtype: float64 Count for female employrate: 2007 total female employees age 15+ (% of population) (11.299000000000001, 30.34] 18 (30.34, 37.5] 18 (37.5, 41.16] 18 (41.16, 45.2] 17 (45.2, 48.45] 18 (48.45, 51.3] 20 (51.3, 54.69] 15 (54.69, 60.82] 18 (60.82, 73.12] 18 (73.12, 9666891666.667] 18 NaN 35 Name: femaleemployrate, dtype: int64 percentage of female employrate: 2007 total female employees age 15+ (% of population) (11.299000000000001, 30.34] 0.084507 (30.34, 37.5] 0.084507 (37.5, 41.16] 0.084507 (41.16, 45.2] 0.079812 (45.2, 48.45] 0.084507 (48.45, 51.3] 0.093897 (51.3, 54.69] 0.070423 (54.69, 60.82] 0.084507 (60.82, 73.12] 0.084507 (73.12, 9666891666.667] 0.084507 NaN 0.164319 Name: femaleemployrate, dtype: float64 Countries having polity score >5 and emply rate>70 country femaleemployrate employrate polityscore 19 Benin 58.200001 71.599998 7.0 22 Bolivia 61.599998 70.400002 7.0 29 Burundi 83.300003 83.199997 6.0 97 Kenya 66.599998 73.199997 7.0 115 Malawi 69.000000 71.800003 6.0 150 Paraguay 65.300003 73.099998 8.0 printing the sample frequency 6.0 2 7.0 3 8.0 1 Name: polityscore, dtype: int64

0 notes

ghorbelma · 6 years ago

Text

Develop a research question

1/Code Book choosen: gapminder

Data Set:

*polityscore 2009 Democracy score (Polity)

*femaleemployrate ( Percentage of female population, age above 15, that has been employed during the given year)

*employrate ( Percentage of total population, age above 15, that has been employed during the given year.

2/ Research Question: Is Democracy associated with gender equality (in term of employment)?

Hypothesis: My Belief is that in democracies, female employment rate and total employment rate should be very close (women have the same access to labor market)

3/Litterature review: Keyword Democracy . Gender equality , women labor participation

Many academic works have studied this field: Democracy an women labor participation, the effect of switching from dictatorship to democracy...

Among the studies:

* UNI ScholarWorks from University of Northern Iowa: Women 's labor force participation in Spain: An analysis from dictatorship to democracy

* Democracy and Gender Equality (Caroline Beer)

4/Findings:

Existing research provides conflicting evidence about the relationship between democracy and gender equality. Various studies have found that the level of democracy, measured by Freedom House, is not significant in determining the percentage of women participation in Labor market whereas others providided evidence for the importance of long-term democracy in women’s participation in labor market.

References

Abrams BA, Settle RF. Women's suffrage and the growth of the welfare state. Public Choice. 1999;100(3- 4):289–300. Alvarez SE.

Engendering democracy in Brazil: women’s movements in politics. Princeton: Princeton University Press; 1990. Barro RJ, Lee J-W.

International data on educational attainment: updates and implications. CID Working Paper No. 042. Harvard University: Cambridge, MA; 2000. Bollen KA.

Cross-national indicators of liberal democracy, 1950-1990. Codebook. University of North Carolina: Chapel Hill, NC; 1998. Bouvard MG. Revolutionizing motherhood: the mothers of the Plaza de Mayo. Wilmington: Scholarly Resources; 1994. Brown DS.

Democracy and gender inequality in education: a cross-national examination. Br J Polit Sci. 2004;34(1):137–52

0 notes

ghorbelma · 6 years ago

Text

Starting Data Analysis and Interpretation Specialization

1 note · View note