Don't wanna be here? Send us removal request.
Text
ASSIGNMENT WEEK 4
Python Program
import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt
data = pd.read_csv('C://Users/SHYAMKRISHNAN SUDHIR/Desktop/gapminder.csv',low_memory=False) data.columns = map(str.lower, data.columns) pd.set_option('display.float_format', lambda x:'%f'%x)
data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True) data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True) data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True) data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
print("Statistics for a Suicide Rate") print(data['suicideper100th'].describe())
sub = data[(data['suicideper100th']>12)] sub_copy = sub.copy()
plt.figure(1) sb.distplot(sub_copy["breastcancerper100th"].dropna(),kde=False) plt.xlabel('Breast Cancer Rate') plt.ylabel('Frequency') plt.title('Breast Cancer Rate for People with a High Suicide Rate')
plt.figure(2) sb.distplot(sub_copy["hivrate"].dropna(),kde=False) plt.xlabel('HIV Rate') plt.ylabel('Frequency') plt.title('HIV Rate for People with a High Suicide Rate')
plt.figure(3) sb.distplot(sub_copy["employrate"].dropna(),kde=False) plt.xlabel('Employment Rate') plt.ylabel('Frequency') plt.title('Employment Rate for People with a High Suicide Rate')
plt.figure(4) sb.regplot(x="hivrate",y="breastcancerper100th",fit_reg=False,data=sub_copy) plt.xlabel('HIV Rate') plt.ylabel('Breast Cancer Rate') plt.title('Breast Cancer Rate vs. HIV Rate for People with a High Suicide Rate')
Output
This graph is unimodal, with its highest pick at 0-20% of breast cancer rate. It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories.
This graph is unimodal, with its highest pick at 0-1% of HIV rate. It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories.
This graph is unimodal, with its highest pick at the median of 55-60% employment rate. It seems to be a symmetric distribution as there are lower frequencies in lower and higher categories.
This graph plots the breast cancer rate vs. HIV rate for people with a high suicide rate. It shows that people with breast cancer are not infected with HIV.
0 notes
Text
ASSIGNMENT 3
Python Program
import pandas as pd
df = pd.read_csv('C://Users/SHYAMKRISHNAN SUDHIR/Desktop/gapminder.csv',low_memory=False) df.columns = map(str.lower, df.columns) pd.set_option('display.float_format', lambda x:'%f'%x)
df['suicideper100th'] = df['suicideper100th'].convert_objects(convert_numeric=True) df['breastcancerper100th'] = df['breastcancerper100th'].convert_objects(convert_numeric=True) df['hivrate'] = df['hivrate'].convert_objects(convert_numeric=True) df['employrate'] = df['employrate'].convert_objects(convert_numeric=True)
print("Statistics for a Suicide Rate") print(df['suicideper100th'].describe())
sub1 = df[(df['suicideper100th']>12)] sub2 = sub1.copy()
bc_max=sub2['breastcancerper100th'].max() sub2['bcgroup4']=pd.cut(sub2.breastcancerper100th,[0*bc_max,0.25*bc_max,0.5*bc_max,0.75*bc_max,1*bc_max]) bc=sub2['bcgroup4'].value_counts(sort=False,dropna=False) pbc=sub2['bcgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100
bc1=[] pbc1=[] cf=0 cp=0 for freq in bc: cf=cf+freq bc1.append(cf) pf=cf*100/len(sub2) pbc1.append(pf)
print('Number of Breast Cancer Cases with a High Suicide Rate') fmt1 = '%10s %9s %9s %12s %13s' fmt2 = '%9s %9.d %10.2f %9.d %13.2f' print(fmt1 % ('# of Cases','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(bc.keys(),bc,pbc,bc1,pbc1)): print(fmt2 % (key, var1, var2, var3, var4))
sub2['hcgroup4']=pd.qcut(sub2.hivrate,4,labels=["0% tile","25% tile","50% tile","75% tile"]) hc = sub2['hcgroup4'].value_counts(sort=False,dropna=False) phc = sub2['hcgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100
hc1=[] phc1=[] cf=0 cp=0 for freq in hc: cf=cf+freq hc1.append(cf) pf=cf*100/len(sub2) phc1.append(pf)
print('HIV Rate with a High Suicide Rate') print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(hc.keys(),hc,phc,hc1,phc1)): print(fmt2 % (key, var1, var2, var3, var4))
def ecgroup4 (row): if row['employrate'] >= 32 and row['employrate'] < 51: return 1 elif row['employrate'] >= 51 and row['employrate'] < 59: return 2 elif row['employrate'] >= 59 and row['employrate'] < 65: return 3 elif row['employrate'] >= 65 and row['employrate'] < 84: return 4 else: return 5
sub2['ecgroup4'] = sub2.apply(lambda row: ecgroup4 (row), axis=1) ec = sub2['ecgroup4'].value_counts(sort=False,dropna=False) pec = sub2['ecgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100
ec1=[] pec1=[] cf=0 cp=0 for freq in ec: cf=cf+freq ec1.append(cf) pf=cf*100/len(sub2) pec1.append(pf)
print('Employment Rate with a High Suicide Rate') print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(ec.keys(),ec,pec,ec1,pec1)): print(fmt2 % (key, var1, var2, var3, var4))
Summary of Statistics for a Suicide Rate count 191.000000 mean 9.640839 std 6.300178 min 0.201449 25% 4.988449 50% 8.262893 75% 12.328551 max 35.752872
Number of Breast Cancer Cases with a High Suicide Rate No of Cases Freq. Percent Cum. Freq. Cum. Percent (1, 23] 18 33.96 18 33.96 (23, 46] 15 28.30 33 62.26 (46, 69] 10 18.87 43 81.13 (69, 92] 8 15.09 51 96.23 nan 2 3.77 53 100.00
HIV Rate with a High Suicide Rate Rate Freq. Percent Cum. Freq. Cum. Percent 0% tile 18 33.96 18 33.96 25% tile 8 15.09 26 49.06 50% tile 11 20.75 37 69.81 75% tile 12 22.64 49 92.45 nan 4 7.55 53 100.00
Employment Rate with a High Suicide Rate Rate Freq. Percent Cum. Freq. Cum. Percent 1 10 18.87 10 18.87 2 24 45.28 34 64.15 3 5 9.43 39 73.58 4 13 24.53 52 98.11 5 1 1.89 53 100.00
Frequency Distributions Conclusions
bcgroup4, hcgroup4 and ecgroup4 are grouped using three different methods. The grouped data also includes the count for missing data.
1) For the breast cancer rate, data is grouped into 4 and are 1-23, 24-46, 47-69, 70-92 people with lower breast cancer rate experience a high suicide rate. 2) For the HIV rate, I grouped the data into 4 groups and people with lower HIV rate experience a high suicide rate. 3) For the employment rate, data is grouped into 5 categorical groups (1:32-50, 2:51-58, 3:59-64, 4:65-83, 5:NAN). The employment rate is between 51%-58% for people with a high suicide rate.
0 notes
Text
ASSIGNMENT WEEK 2
Python Program
import pandas as pd
import numpy as np
df = pd.read_csv('C://Users/SHYAMKRISHNAN SUDHIR/Desktop/gapminder.csv',low_memory=False) df.columns = map(str.lower, df.columns) pd.set_option('display.float_format', lambda x:'%f'%x)
df['suicideper100th'] = df['suicideper100th'].convert_objects(convert_numeric=True) df['breastcancerper100th'] = df['breastcancerper100th'].convert_objects(convert_numeric=True) df['hivrate'] = df['hivrate'].convert_objects(convert_numeric=True) df['employrate'] = df['employrate'].convert_objects(convert_numeric=True)
print("Summary of the Statistics for a Suicide Rate") print(df['suicideper100th'].describe())
print("Higher suicide rate statistics") sub1 = df[(df['suicideper100th']>12)] sub2 = sub1.copy()
print("Frequency for number of breast cancer cases with a high suicide rate")
fbc = sub2['breastcancerper100th'].value_counts(sort=False,bins=10)
print("Percentage for number of breast cancer cases with a high suicide rate") pbc = sub_copy['breastcancerper100th'].value_counts(sort=False,bins=10,normalize=True)*100
fbc1=[] pbc1=[] cf=0 cp=0 for freq in fbc: cf=cf+freq fbc1.append(cf) pf=cf*100/len(sub_copy) pbc1.append(pf)
print('Number of Breast Cancer Cases with a High Suicide Rate') fmt1 = '%s %7s %9s %12s %12s' fmt2 = '%5.2f %10.d %10.2f %10.d %12.2f' print(fmt1 % ('# of Cases','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(fbc.keys(),fbc,pbc,fbc1,pbc1)): print(fmt2 % (key, var1, var2, var3, var4)) fmt3 = '%5s %10s %10s %10s %12s' print(fmt3 % ('NA', '2', '3.77', '53', '100.00'))
print("Frequency for HIV rate with a high suicide rate") fhc = sub_copy['hivrate'].value_counts(sort=False,bins=7)
print("Percentage for HIV rate with a high suicide rate") phc = sub_copy['hivrate'].value_counts(sort=False,bins=7,normalize=True)*100
fhc1=[] phc1=[] cf=0 cp=0 for freq in fbc: cf=cf+freq fhc1.append(cf) pf=cf*100/len(sub_copy) phc1.append(pf)
print('HIV Rate with a High Suicide Rate') fmt1 = '%5s %12s %9s %12s %12s' fmt2 = '%5.2f %10.d %10.2f %10.d %12.2f' print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(fhc.keys(),fhc,phc,fhc1,phc1)): print(fmt2 % (key, var1, var2, var3, var4)) fmt3 = '%5s %10s %10s %10s %12s' print(fmt3 % ('NA', '2', '3.77', '53', '100.00'))
print("Frequency for employment rate with a high suicide rate") fec = sub_copy['employrate'].value_counts(sort=False,bins=10)
print("Percentage for employment rate with a high suicide rate") pec = sub_copy['employrate'].value_counts(sort=False,bins=10,normalize=True)*100
fec1=[] pec1=[] cf=0 cp=0 for freq in fbc: cf=cf+freq fec1.append(cf) pf=cf*100/len(sub_copy) pec1.append(pf)
print("Employment Rate with a High Suicide Rate") fmt1 = '%5s %12s %9s %12s %12s' fmt2 = '%5.2f %10.d %10.2f %10.d %12.2f' print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(fec.keys(),fec,pec,fec1,pec1)): print(fmt2 % (key, var1, var2, var3, var4)) fmt3 = '%5s %10s %10s %10s %12s' print(fmt3 % ('NA', '2', '3.77', '53', '100.00'))
Summary of Statistics for a Suicide Rate count 191.000000 mean 9.640839 std 6.300178 min 0.201449 25% 4.988449 50% 8.262893 75% 12.328551 max 35.752872
Number of Breast Cancer Cases with a High Suicide Rate No.of Cases Freq. Percent Cum. Freq. Cum. Percent 6.51 6 11.32 6 11.32 15.14 14 26.42 20 37.74 23.68 5 9.43 25 47.17 32.22 7 13.21 32 60.38 40.76 2 3.77 34 64.15 49.30 4 7.55 38 71.70 57.84 5 9.43 43 81.13 66.38 1 1.89 44 83.02 74.92 3 5.66 47 88.68 83.46 4 7.55 51 96.23 NA 2 3.77 53 100.00
HIV Rate with a High Suicide Rate Rate Freq. Percent Cum. Freq. Cum. Percent 0.03 39 73.58 6 11.32 2.64 4 7.55 20 37.74 5.23 2 3.77 25 47.17 7.81 0 0.00 32 60.38 10.40 0 0.00 34 64.15 12.98 2 3.77 38 71.70 15.56 1 1.89 43 81.13 18.15 0 0.00 44 83.02 20.73 0 0.00 47 88.68 23.32 1 1.89 51 96.23 NA 2 3.77 53 100.00
Employment Rate with a High Suicide Rate Rate Freq. Percent Cum. Freq. Cum. Percent 37.35 2 3.77 6 11.32 41.98 2 3.77 20 37.74 46.56 7 13.21 25 47.17 51.14 8 15.09 32 60.38 55.72 16 30.19 34 64.15 60.30 4 7.55 38 71.70 64.88 5 9.43 43 81.13 69.46 2 3.77 44 83.02 74.04 3 5.66 47 88.68 78.62 3 5.66 51 96.23 NA 2 3.77 53 100.00
Frequency Distributions Conclusions
Lower number of breast cancer cases associated with high suicide rates
Lower number of HIV rates associated with high suicide rates
The high suicide rate occurs at 55% of employment rate.
0 notes
Text
ASSIGNMENT WEEK 1
Data Set: gapminder.csv
Research Question: Is the employment rate of a person associated with number of suicides in a country?
Columns required for the CodeBook:
1. Employment rate per country
2. Suicide rate
Literature Review:
From the given information sources,
https://jech.bmj.com/content/57/8/594
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7064437/
as the employment rate per person increases, the number of suicides for a particular country tends to decrease. As examined through the analysis of the data set gapminder.csv
The hypothesis is to explore the dataset: The higher the employment rate, lower the suicide rate
0 notes