ghorbelma
ghorbelma
Data Management & Visualisation
17 posts
Don't wanna be here? Send us removal request.
ghorbelma · 6 years ago
Text
Machine Learning for data anaysis, week 4 ,K means cluster analysis
Please find the k-mean cluster assignement on the link below
https://drive.google.com/file/d/1be470fUZbsKaxqYMOOvclEh5NByda4FW/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Machine Learning Week3: Lasso regression
Please find the assignement of lasso rsgression in the ling below
https://drive.google.com/file/d/1J4MObHyFt6Iev7yXHDIXDmYTVgx4iSuv/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Regression models in practice Week 4 logistic regression
PLease find the assignement in the link below:
https://drive.google.com/file/d/1T56FqqvDy2_rjKPJfIxalgIjSja61k4v/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Regression Modeling in Practice Week 3 Multiple regression
Please find attached below themultiple regression assignement
https://drive.google.com/file/d/1DiYr4gSxSrYBF8tQk-iNtiHFxWD5B7WX/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Machine Learning for Data Analysis Week 2 Random Forest
please find the assignment in the google drive link
https://drive.google.com/file/d/1-2IbyRi_wLN76v_-iq5nA-_VCi8qCaOJ/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Machine Learning for data analysis week1: tree decision
please find the assignment in the google drive link
https://drive.google.com/file/d/1Mp5jCsURutxcXkmQv2ZNrHwh0ccdtz8W/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Regression Modeling in Practice Week 2 Test a Basic Linear Regression Model
Please find the assignement in google drive link
Thank you
https://drive.google.com/file/d/1Q1TyuvUF558QU86bi0YBJaaTz7oP_ra1/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Regression Modeling in Practice Week 1:Writing About Your Data
Codebook: The Outlook on Life Survey (OOL)
Sample
The sample is from 2012 two Outlook Surveys designed to study political and social attitudes in the United States.
Participant number was 2,294 in first survey and 1,601 in the second.
The target population were adults 18 years of age and older all non-institutionalized and residing in the United States and was comprised of four groups: African American/Black males, African American/Black females, White/other race males, and White/other race females.
The survey considered the ways in which social class, ethnicity, marital status, feminism, religiosity, political orientation, sexual behavior, and cultural beliefs or stereotypes influence opinion and behavior.
Procedure
Participants were drawn from the GfK Knowledge Network, a web panel designed to be representative of the Unites States population. Panel members are randomly recruited through probability-based sampling (The method of collection was a Cross-sectional, Cross-sectional ad-hoc follow-up), and households are provided with access to the Internet and hardware if needed. Random-digit dialing and address-based sampling methodologies are used.
 Measures
Explanatory variables is one of the four group to which the participant belons.
The reponse variable is the political or social behaviour of the participant.
Possible answers were divided into categories and were given to Participants. answers could be in a form of frequency, quantity or yes/no answers. For instance, the answers of the question:
How many days in the past week did you watch national news programs on television or on the Internet?
The answers could be none or range from 1 to 7 (or refused)
to the question:
Do you approve or disapprove of the way Barack Obama is handling his job as President?
the answers are yes/no/refused
0 notes
ghorbelma · 6 years ago
Text
Data Analysis Tools Week 4 Potential Moderator
https://drive.google.com/file/d/1fGqrD5IvAp_e-BdkTtX-fj5LnlhOahIP/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Data Analysis tools week3: Pearson Correlation
Please find in google drive the assignment of week 3 of data analysis tools regarding pearson correlation
  https://drive.google.com/open?id=1V2q2HBPt4qFwV0ZOkjCMnHx6ZEUgxEbq
0 notes
ghorbelma · 6 years ago
Text
Data Analysis Tools: running Χ2 squared
please, find attached below the assignment  of week 2 of data analysis tools
https://drive.google.com/file/d/1IRiwea4RYksyeYm4Clocg1I03YAn5X-Q/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Data Analysis Tools: Running an analysis of variance
Please find in the document below the assignment of week 1 of data analysis tools
https://drive.google.com/file/d/1FsskHzW2IzWil8vrPc9cflS2VU8sJOls/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Data Management and Visualization Week 4 Creating graphs for your data
Please find attached below the document showing the results
https://drive.google.com/file/d/1fmdYIn_TLOlYYvHvDX_QAh8ri8iecund/view?usp=sharing
0 notes
ghorbelma · 6 years ago
Text
Data Management and Visualization Making Data Management Decisions
DataSet: Gapminder
Variable: countries, democracy score, employ rate and female employ rate.
some democracy score are missings, I displayed the frequency by quantile.
employ rate values are correct (between 0 and 100%) but some rows all empty
female employ rate has some dummy values (when i displayed the quantiles, the last quantiles end value is greater than 100%), some data is missing
I checked column by column if there is any empty cells (I am able to conduct analysis between the variables only if for a given countries all the figures are non empty)
I checked if there is any dummy variable employ rate and female rate should  be between 0 and 100% and replaced it by Nan
Finally I construct  a new dataframe with only a rows containing a non empty and valide Data
######THE CODE####################
import pandas import numpy
data =pandas.read_csv("Gapminder.csv",";")
print ("Total Contries: " + str(len(data))) print ("Total Columns "+ str(len(data.columns)))
#convert_objects seems deprecated, use to_numeric instead #data["employrate"]= data["employrate"].convert_objects(convert_numeric=True) #NAN are kept #First just browse all columns date, then look for some subgroup (sample) #data sorted by value (indexof)and not by the frequency of occurence
##############POLITYSCORE##############
data["polityscore"]= pandas.to_numeric(data["polityscore"], errors='coerce' )
democracy_Score_Data = data["polityscore"].copy()
democracy_Score_Frequency =(pandas.qcut( democracy_Score_Data,4,duplicates='drop', labels=["0-25%","25%-50%","50%-75%","75%-100%"])).value_counts(sort=True,dropna =True).sort_index() print('\n') print('Democracy score summary data') print(democracy_Score_Data.describe()) print('\n') print ("Democracy score quantile") print (democracy_Score_Frequency)
##############EMPLOY RATE############## print('\n') print ("Count for employrate: 2007 total employees age 15+ (% of population)") data["employrate"]= pandas.to_numeric(data["employrate"], errors='coerce' ) data_employRate_discrete=pandas.cut(data["employrate"],10)
employ_Rate_Frequency =data_employRate_discrete.value_counts(sort=True,dropna =False).sort_index()
print (employ_Rate_Frequency)
##############FEMALEEMPLOYRATE############## print ("Count for female employrate: 2007 total female employees age 15+ (% of population)") data["femaleemployrate"]= pandas.to_numeric(data["femaleemployrate"], errors='coerce' )
femaleEmployRateData = data["femaleemployrate"].copy()
data_femaleemployRate_beforeCorrection=pandas.qcut(data["femaleemployrate"],10) female_employ_Rate_Frequency_beforeCorrection =data_femaleemployRate_beforeCorrection.value_counts(sort=True,dropna =False).sort_index() print('\n') print ("Female employ rate, should be between 0 and 100") print (female_employ_Rate_Frequency_beforeCorrection)
#femaleEmployRateData=femaleEmployRateData.apply (lambda  x:[x if x <= 100 else numpy.NaN]) femaleEmployRateData.values[femaleEmployRateData > 100] = numpy.NaN data_femaleemployRate_afterCorrection=pandas.qcut(femaleEmployRateData,10)
female_employ_Rate_Frequency_afterCorrection =data_femaleemployRate_afterCorrection.value_counts(sort=True,dropna =False).sort_index() print('\n') print ("Female employ rate, all dummy values greater than 100 were removed") print (female_employ_Rate_Frequency_afterCorrection)
#########  BUILDING The DATA FRAME TO USE######################## print('\n') print("Constructing the data frame to be used from the gapminder data") # subset variables in new data frame, sub1 sub1=data[['country','polityscore', 'employrate', 'femaleemployrate' ]] print('\n') print ('Check the last 10 element of the dataset, some dummies or missing are there') print(sub1.tail(10))
def CorrectEmployRate (row):   if row['femaleemployrate'] > 100   :      return numpy.NaN   else:      return  row['femaleemployrate']
sub1['femaleemployrate']=sub1.apply (lambda row: CorrectEmployRate (row),axis=1) print('\n') print('Dummy female rate after correction again displaying last 10 elements') print(sub1.tail(10))
print('\n') print('The Data Frame with all the correct Data') print('Keep only data with finite value') sub2 = sub1[numpy.isfinite(sub1['femaleemployrate']) & numpy.isfinite(sub1['polityscore']) &numpy.isfinite(sub1['employrate'])] print(sub2.tail(10))
print('\n') print('--------REMAINING DATA SUMMARY------------------') print(sub2.describe())
############The OUTPUT###########
Total Contries: 213 Total Columns 17
Democracy score summary data count   157.000000 mean      3.766691 std       6.244515 min     -10.000000 25%      -1.000000 50%       6.000000 75%       9.000000 max      10.000000 Name: polityscore, dtype: float64
Democracy score quantile 0-25%       43 25%-50%     36 50%-75%     45 75%-100%    33 Name: polityscore, dtype: int64
Count for employrate: 2007 total employees age 15+ (% of population) (4.657, 12.582]      5 (12.582, 20.429]     1 (20.429, 28.275]     1 (28.275, 36.121]     1 (36.121, 43.968]    12 (43.968, 51.814]    32 (51.814, 59.661]    50 (59.661, 67.507]    44 (67.507, 75.354]    18 (75.354, 83.2]      13 NaN                 36 Name: employrate, dtype: int64 Count for female employrate: 2007 total female employees age 15+ (% of population)
Female employ rate, should be between 0 and 100 (11.299000000000001, 30.34]    18 (30.34, 37.5]                  18 (37.5, 41.16]                  18 (41.16, 45.2]                  17 (45.2, 48.45]                  18 (48.45, 51.3]                  20 (51.3, 54.69]                  15 (54.69, 60.82]                 18 (60.82, 73.12]                 18 (73.12, 9666891666.667]        18 NaN                            35 Name: femaleemployrate, dtype: int64
Female employ rate, all dummy values greater than 100 were removed (11.299000000000001, 30.19]    17 (30.19, 37.2]                  17 (37.2, 40.24]                  17 (40.24, 43.92]                 17 (43.92, 47.1]                  18 (47.1, 50.64]                  16 (50.64, 53.66]                 17 (53.66, 58.22]                 17 (58.22, 66.7]                  17 (66.7, 83.3]                   17 NaN                            43 Name: femaleemployrate, dtype: int64
Constructing the data frame to be used from the gapminder data
Check the last 10 element of the dataset, some dummies or missing are there                country  polityscore  employrate  femaleemployrate 203       United States    10.000000   62.299999         56.000000 204             Uruguay    10.000000   57.500000         46.000000 205          Uzbekistan    -9.000000   57.500000         52.599998 206             Vanuatu          nan         nan               nan 207           Venezuela    -3.000000   59.900002         45.799999 208             Vietnam    -7.000000   71.000000         67.599998 209  West Bank and Gaza          nan   32.000000         11.300000 210               Yemen          nan    6.265789  234864666.666667 211              Zambia     7.000000   61.000000         53.500000 212            Zimbabwe     1.000000   66.800003         58.099998
Dummy female rate after correction again displaying last 10 elements                country  polityscore  employrate  femaleemployrate 203       United States    10.000000   62.299999         56.000000 204             Uruguay    10.000000   57.500000         46.000000 205          Uzbekistan    -9.000000   57.500000         52.599998 206             Vanuatu          nan         nan               nan 207           Venezuela    -3.000000   59.900002         45.799999 208             Vietnam    -7.000000   71.000000         67.599998 209  West Bank and Gaza          nan   32.000000         11.300000 210               Yemen          nan    6.265789               nan 211              Zambia     7.000000   61.000000         53.500000 212            Zimbabwe     1.000000   66.800003         58.099998
The Data Frame with all the correct Data Keep only data with finite value                  country  polityscore  employrate  femaleemployrate 200               Ukraine     7.000000   54.400002         49.400002 201  United Arab Emirates    -8.000000   75.199997         37.299999 202        United Kingdom    10.000000   59.299999         53.099998 203         United States    10.000000   62.299999         56.000000 204               Uruguay    10.000000   57.500000         46.000000 205            Uzbekistan    -9.000000   57.500000         52.599998 207             Venezuela    -3.000000   59.900002         45.799999 208               Vietnam    -7.000000   71.000000         67.599998 211                Zambia     7.000000   61.000000         53.500000 212              Zimbabwe     1.000000   66.800003         58.099998
--------REMAINING DATA SUMMARY------------------       polityscore  employrate  femaleemployrate count   152.000000  152.000000        152.000000 mean      3.736842   59.563816         48.292763 std       6.319042   10.127999         14.745075 min     -10.000000   37.400002         12.400000 25%      -2.000000   52.650001         39.599998 50%       6.500000   58.900002         48.549999 75%       9.000000   65.025000         56.325001 max      10.000000   83.199997         83.300003
0 notes
ghorbelma · 6 years ago
Text
Data Management & Visualisation Week 2 Assignement
I choosed a subset of data minder containing four columns (country, polityscore, employ rate and female employ rate.
the polity score in a political score ranging from -10 (dictatorship) to 10 democracy, 56 entries is missing, the extreme(both positive and negative are the more frequent) 
the employ rate and female emply rate are specific for each country that is why almost all of the figures occurs only once. i choose again to include the NAN in the frequency table but I used panda cut and qcut to discretize data into intervals (some data in female emply rate are completely dummy that is why I used qcut that automatically split data betwwen quantiles
In the last part, to select a sample from the entire population, I selected countries (and there data) having at the same time a polity score>5.0 and an employ rate > 70.0 then I displayed their frequency
#######################################
2/The Code
import pandas import numpy
data =pandas.read_csv("Gapminder.csv",";")
print ("Total Rows: " + str(len(data))) print ("Total Columns "+ str(len(data.columns)))
#convert_objects seems deprecated, use to_numeric instead #data["employrate"]= data["employrate"].convert_objects(convert_numeric=True) #NAN are kept #First just browse all columns date, then look for some subgroup (sample) #data sorted by value (indexof)and not by the frequency of occurence
##############POLITYSCORE##############
print ("Count for polityscore: 2009 Democracy score (Polity),-10 (dictatorship) to 10 (democracy)") data["polityscore"]= pandas.to_numeric(data["polityscore"], errors='coerce' )
democracy_Score_Frequency =(data["polityscore"]).value_counts(sort=True,dropna =False).sort_index()
print (democracy_Score_Frequency)
print ("percentage of polityscore: 2009 Democracy score (Polity),-10 to 10") democracy_Score_Percentage =(data["polityscore"]).value_counts(sort=True,normalize=True,dropna =True)
print(democracy_Score_Percentage)
##############EMPLOYRATE############## print ("Count for employrate: 2007 total employees age 15+ (% of population)") data["employrate"]= pandas.to_numeric(data["employrate"], errors='coerce' ) data_employRate_discrete=pandas.cut(data["employrate"],10)
employ_Rate_Frequency =data_employRate_discrete.value_counts(sort=True,dropna =False).sort_index()
print (employ_Rate_Frequency)
print ("percentage of employ rate: 2007 total employees age 15+ (% of population)") employ_Rate_Percentage =data_employRate_discrete.value_counts(sort=True,normalize=True,dropna =False).sort_index()
print(employ_Rate_Percentage)
##############FEMALEEMPLOYRATE############## print ("Count for female employrate: 2007 total female employees age 15+ (% of population)") data["femaleemployrate"]= pandas.to_numeric(data["femaleemployrate"], errors='coerce' ) data_femaleemployRate_discrete=pandas.qcut(data["femaleemployrate"],10)
female_employ_Rate_Frequency =data_femaleemployRate_discrete.value_counts(sort=True,dropna =False).sort_index()
print (female_employ_Rate_Frequency)
print ("percentage of female employrate: 2007 total female employees age 15+ (% of population)") female_employ_Rate_Percentage =data_femaleemployRate_discrete.value_counts(sort=True,normalize=True,dropna =False).sort_index()
print(female_employ_Rate_Percentage)
#######polity score only for democracy (score>5) with employ rate>70
print ('Countries having polity score >5 and emply rate>70')
democracy_sample =data[(data['polityscore']>5.0) & (data['employrate']>70.0)]
sample=democracy_sample.copy()
sample_frequency =sample['polityscore'].value_counts(sort=True,dropna =False).sort_index()
print(sample) print ('printing the sample frequency') print (sample_frequency)
#######################################
3/The Output
Total Rows: 213 Total Columns 4 Count for polityscore: 2009 Democracy score (Polity),-10 (dictatorship) to 10 (democracy) -10.000000     2 -9.000000      3 -8.000000      2 -7.000000     12 -6.000000      3 -5.000000      2 -4.000000      5 -3.000000      6 -2.000000      4 -1.000000      4 0.000000      6 1.000000      3 2.000000      3 2.087848      1 2.282655      1 3.000000      2 4.000000      4 5.000000      6 6.000000     10 7.000000     13 8.000000     18 9.000000     14 10.000000    33 NaN          56 Name: polityscore, dtype: int64 percentage of polityscore: 2009 Democracy score (Polity),-10 to 10 10.000000    0.210191 8.000000     0.114650 9.000000     0.089172 7.000000     0.082803 -7.000000     0.076433 6.000000     0.063694 0.000000     0.038217 -3.000000     0.038217 5.000000     0.038217 -4.000000     0.031847 -1.000000     0.025478 4.000000     0.025478 -2.000000     0.025478 1.000000     0.019108 -6.000000     0.019108 2.000000     0.019108 -9.000000     0.019108 -5.000000     0.012739 3.000000     0.012739 -8.000000     0.012739 -10.000000    0.012739 2.087848     0.006369 2.282655     0.006369 Name: polityscore, dtype: float64 Count for employrate: 2007 total employees age 15+ (% of population) (4.657, 12.582]      5 (12.582, 20.429]     1 (20.429, 28.275]     1 (28.275, 36.121]     1 (36.121, 43.968]    12 (43.968, 51.814]    32 (51.814, 59.661]    50 (59.661, 67.507]    44 (67.507, 75.354]    18 (75.354, 83.2]      13 NaN                 36 Name: employrate, dtype: int64 percentage of employ rate: 2007 total employees age 15+ (% of population) (4.657, 12.582]     0.023474 (12.582, 20.429]    0.004695 (20.429, 28.275]    0.004695 (28.275, 36.121]    0.004695 (36.121, 43.968]    0.056338 (43.968, 51.814]    0.150235 (51.814, 59.661]    0.234742 (59.661, 67.507]    0.206573 (67.507, 75.354]    0.084507 (75.354, 83.2]      0.061033 NaN                 0.169014 Name: employrate, dtype: float64 Count for female employrate: 2007 total female employees age 15+ (% of population) (11.299000000000001, 30.34]    18 (30.34, 37.5]                  18 (37.5, 41.16]                  18 (41.16, 45.2]                  17 (45.2, 48.45]                  18 (48.45, 51.3]                  20 (51.3, 54.69]                  15 (54.69, 60.82]                 18 (60.82, 73.12]                 18 (73.12, 9666891666.667]        18 NaN                            35 Name: femaleemployrate, dtype: int64 percentage of female employrate: 2007 total female employees age 15+ (% of population) (11.299000000000001, 30.34]    0.084507 (30.34, 37.5]                  0.084507 (37.5, 41.16]                  0.084507 (41.16, 45.2]                  0.079812 (45.2, 48.45]                  0.084507 (48.45, 51.3]                  0.093897 (51.3, 54.69]                  0.070423 (54.69, 60.82]                 0.084507 (60.82, 73.12]                 0.084507 (73.12, 9666891666.667]        0.084507 NaN                            0.164319 Name: femaleemployrate, dtype: float64 Countries having polity score >5 and emply rate>70      country  femaleemployrate  employrate  polityscore 19      Benin         58.200001   71.599998          7.0 22    Bolivia         61.599998   70.400002          7.0 29    Burundi         83.300003   83.199997          6.0 97      Kenya         66.599998   73.199997          7.0 115    Malawi         69.000000   71.800003          6.0 150  Paraguay         65.300003   73.099998          8.0 printing the sample frequency 6.0    2 7.0    3 8.0    1 Name: polityscore, dtype: int64
0 notes
ghorbelma · 6 years ago
Text
Develop a research question
1/Code Book choosen: gapminder
Data Set:   
*polityscore 2009 Democracy score (Polity)
*femaleemployrate ( Percentage of female population, age above 15, that has been employed during the given year)
*employrate ( Percentage of total population, age above 15, that has been employed during the given year. 
2/ Research Question: Is Democracy associated with gender equality (in term of employment)?
Hypothesis:  My Belief is that in democracies, female employment rate and total employment rate should be very close (women have the same access to labor market)
3/Litterature review: Keyword Democracy . Gender equality , women labor participation
Many academic works have studied this field: Democracy an women labor participation, the effect of switching from dictatorship to democracy...
Among the studies:
* UNI ScholarWorks from  University of Northern Iowa: Women 's labor force participation in Spain: An analysis from dictatorship to democracy
* Democracy and Gender Equality (Caroline Beer)
4/Findings:
Existing research provides conflicting evidence about the relationship between democracy and gender equality. Various studies have found that the level of democracy, measured by Freedom House, is not significant in determining the percentage of women participation in Labor market whereas others  providided evidence for the importance of long-term democracy in women’s participation in labor market.
References
Abrams BA, Settle RF. Women's suffrage and the growth of the welfare state. Public Choice. 1999;100(3- 4):289–300. Alvarez SE. 
Engendering democracy in Brazil: women’s movements in politics. Princeton: Princeton University Press; 1990. Barro RJ, Lee J-W. 
International data on educational attainment: updates and implications. CID Working Paper No. 042. Harvard University: Cambridge, MA; 2000. Bollen KA. 
Cross-national indicators of liberal democracy, 1950-1990. Codebook. University of North Carolina: Chapel Hill, NC; 1998. Bouvard MG. Revolutionizing motherhood: the mothers of the Plaza de Mayo. Wilmington: Scholarly Resources; 1994. Brown DS. 
Democracy and gender inequality in education: a cross-national examination. Br J Polit Sci. 2004;34(1):137–52
0 notes
ghorbelma · 6 years ago
Text
Starting Data Analysis and Interpretation Specialization
1 note · View note