shyamsudhir225 - Tumblr blog

shyamsudhir225 · 5 years ago

Text

ASSIGNMENT WEEK 4

Python Program

import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt

data = pd.read_csv('C://Users/SHYAMKRISHNAN SUDHIR/Desktop/gapminder.csv',low_memory=False) data.columns = map(str.lower, data.columns) pd.set_option('display.float_format', lambda x:'%f'%x)

data['suicideper100th'] = data['suicideper100th'].convert_objects(convert_numeric=True) data['breastcancerper100th'] = data['breastcancerper100th'].convert_objects(convert_numeric=True) data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True) data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)

print("Statistics for a Suicide Rate") print(data['suicideper100th'].describe())

sub = data[(data['suicideper100th']>12)] sub_copy = sub.copy()

plt.figure(1) sb.distplot(sub_copy["breastcancerper100th"].dropna(),kde=False) plt.xlabel('Breast Cancer Rate') plt.ylabel('Frequency') plt.title('Breast Cancer Rate for People with a High Suicide Rate')

plt.figure(2) sb.distplot(sub_copy["hivrate"].dropna(),kde=False) plt.xlabel('HIV Rate') plt.ylabel('Frequency') plt.title('HIV Rate for People with a High Suicide Rate')

plt.figure(3) sb.distplot(sub_copy["employrate"].dropna(),kde=False) plt.xlabel('Employment Rate') plt.ylabel('Frequency') plt.title('Employment Rate for People with a High Suicide Rate')

plt.figure(4) sb.regplot(x="hivrate",y="breastcancerper100th",fit_reg=False,data=sub_copy) plt.xlabel('HIV Rate') plt.ylabel('Breast Cancer Rate') plt.title('Breast Cancer Rate vs. HIV Rate for People with a High Suicide Rate')

Output

This graph is unimodal, with its highest pick at 0-20% of breast cancer rate. It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories.

This graph is unimodal, with its highest pick at 0-1% of HIV rate. It seems to be skewed to the right as there are higher frequencies in lower categories than the higher categories.

This graph is unimodal, with its highest pick at the median of 55-60% employment rate. It seems to be a symmetric distribution as there are lower frequencies in lower and higher categories.

This graph plots the breast cancer rate vs. HIV rate for people with a high suicide rate. It shows that people with breast cancer are not infected with HIV.

0 notes

shyamsudhir225 · 5 years ago

Text

ASSIGNMENT 3

Python Program

import pandas as pd

df = pd.read_csv('C://Users/SHYAMKRISHNAN SUDHIR/Desktop/gapminder.csv',low_memory=False) df.columns = map(str.lower, df.columns) pd.set_option('display.float_format', lambda x:'%f'%x)

df['suicideper100th'] = df['suicideper100th'].convert_objects(convert_numeric=True) df['breastcancerper100th'] = df['breastcancerper100th'].convert_objects(convert_numeric=True) df['hivrate'] = df['hivrate'].convert_objects(convert_numeric=True) df['employrate'] = df['employrate'].convert_objects(convert_numeric=True)

print("Statistics for a Suicide Rate") print(df['suicideper100th'].describe())

sub1 = df[(df['suicideper100th']>12)] sub2 = sub1.copy()

bc_max=sub2['breastcancerper100th'].max() sub2['bcgroup4']=pd.cut(sub2.breastcancerper100th,[0*bc_max,0.25*bc_max,0.5*bc_max,0.75*bc_max,1*bc_max]) bc=sub2['bcgroup4'].value_counts(sort=False,dropna=False) pbc=sub2['bcgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100

bc1=[] pbc1=[] cf=0 cp=0 for freq in bc: cf=cf+freq bc1.append(cf) pf=cf*100/len(sub2) pbc1.append(pf)

print('Number of Breast Cancer Cases with a High Suicide Rate') fmt1 = '%10s %9s %9s %12s %13s' fmt2 = '%9s %9.d %10.2f %9.d %13.2f' print(fmt1 % ('# of Cases','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(bc.keys(),bc,pbc,bc1,pbc1)): print(fmt2 % (key, var1, var2, var3, var4))

sub2['hcgroup4']=pd.qcut(sub2.hivrate,4,labels=["0% tile","25% tile","50% tile","75% tile"]) hc = sub2['hcgroup4'].value_counts(sort=False,dropna=False) phc = sub2['hcgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100

hc1=[] phc1=[] cf=0 cp=0 for freq in hc: cf=cf+freq hc1.append(cf) pf=cf*100/len(sub2) phc1.append(pf)

print('HIV Rate with a High Suicide Rate') print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(hc.keys(),hc,phc,hc1,phc1)): print(fmt2 % (key, var1, var2, var3, var4))

def ecgroup4 (row): if row['employrate'] >= 32 and row['employrate'] < 51: return 1 elif row['employrate'] >= 51 and row['employrate'] < 59: return 2 elif row['employrate'] >= 59 and row['employrate'] < 65: return 3 elif row['employrate'] >= 65 and row['employrate'] < 84: return 4 else: return 5

sub2['ecgroup4'] = sub2.apply(lambda row: ecgroup4 (row), axis=1) ec = sub2['ecgroup4'].value_counts(sort=False,dropna=False) pec = sub2['ecgroup4'].value_counts(sort=False,dropna=False,normalize=True)*100

ec1=[] pec1=[] cf=0 cp=0 for freq in ec: cf=cf+freq ec1.append(cf) pf=cf*100/len(sub2) pec1.append(pf)

print('Employment Rate with a High Suicide Rate') print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(ec.keys(),ec,pec,ec1,pec1)): print(fmt2 % (key, var1, var2, var3, var4))

Summary of Statistics for a Suicide Rate count 191.000000 mean 9.640839 std 6.300178 min 0.201449 25% 4.988449 50% 8.262893 75% 12.328551 max 35.752872

Number of Breast Cancer Cases with a High Suicide Rate No of Cases Freq. Percent Cum. Freq. Cum. Percent (1, 23] 18 33.96 18 33.96 (23, 46] 15 28.30 33 62.26 (46, 69] 10 18.87 43 81.13 (69, 92] 8 15.09 51 96.23 nan 2 3.77 53 100.00

HIV Rate with a High Suicide Rate Rate Freq. Percent Cum. Freq. Cum. Percent 0% tile 18 33.96 18 33.96 25% tile 8 15.09 26 49.06 50% tile 11 20.75 37 69.81 75% tile 12 22.64 49 92.45 nan 4 7.55 53 100.00

Employment Rate with a High Suicide Rate Rate Freq. Percent Cum. Freq. Cum. Percent 1 10 18.87 10 18.87 2 24 45.28 34 64.15 3 5 9.43 39 73.58 4 13 24.53 52 98.11 5 1 1.89 53 100.00

Frequency Distributions Conclusions

bcgroup4, hcgroup4 and ecgroup4 are grouped using three different methods. The grouped data also includes the count for missing data.

1) For the breast cancer rate, data is grouped into 4 and are 1-23, 24-46, 47-69, 70-92 people with lower breast cancer rate experience a high suicide rate. 2) For the HIV rate, I grouped the data into 4 groups and people with lower HIV rate experience a high suicide rate. 3) For the employment rate, data is grouped into 5 categorical groups (1:32-50, 2:51-58, 3:59-64, 4:65-83, 5:NAN). The employment rate is between 51%-58% for people with a high suicide rate.

0 notes

shyamsudhir225 · 5 years ago

Text

ASSIGNMENT WEEK 2

Python Program

import pandas as pd

import numpy as np

df = pd.read_csv('C://Users/SHYAMKRISHNAN SUDHIR/Desktop/gapminder.csv',low_memory=False) df.columns = map(str.lower, df.columns) pd.set_option('display.float_format', lambda x:'%f'%x)

print("Summary of the Statistics for a Suicide Rate") print(df['suicideper100th'].describe())

print("Higher suicide rate statistics") sub1 = df[(df['suicideper100th']>12)] sub2 = sub1.copy()

print("Frequency for number of breast cancer cases with a high suicide rate")

fbc = sub2['breastcancerper100th'].value_counts(sort=False,bins=10)

print("Percentage for number of breast cancer cases with a high suicide rate") pbc = sub_copy['breastcancerper100th'].value_counts(sort=False,bins=10,normalize=True)*100

fbc1=[] pbc1=[] cf=0 cp=0 for freq in fbc: cf=cf+freq fbc1.append(cf) pf=cf*100/len(sub_copy) pbc1.append(pf)

print('Number of Breast Cancer Cases with a High Suicide Rate') fmt1 = '%s %7s %9s %12s %12s' fmt2 = '%5.2f %10.d %10.2f %10.d %12.2f' print(fmt1 % ('# of Cases','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(fbc.keys(),fbc,pbc,fbc1,pbc1)): print(fmt2 % (key, var1, var2, var3, var4)) fmt3 = '%5s %10s %10s %10s %12s' print(fmt3 % ('NA', '2', '3.77', '53', '100.00'))

print("Frequency for HIV rate with a high suicide rate") fhc = sub_copy['hivrate'].value_counts(sort=False,bins=7)

print("Percentage for HIV rate with a high suicide rate") phc = sub_copy['hivrate'].value_counts(sort=False,bins=7,normalize=True)*100

fhc1=[] phc1=[] cf=0 cp=0 for freq in fbc: cf=cf+freq fhc1.append(cf) pf=cf*100/len(sub_copy) phc1.append(pf)

print('HIV Rate with a High Suicide Rate') fmt1 = '%5s %12s %9s %12s %12s' fmt2 = '%5.2f %10.d %10.2f %10.d %12.2f' print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(fhc.keys(),fhc,phc,fhc1,phc1)): print(fmt2 % (key, var1, var2, var3, var4)) fmt3 = '%5s %10s %10s %10s %12s' print(fmt3 % ('NA', '2', '3.77', '53', '100.00'))

print("Frequency for employment rate with a high suicide rate") fec = sub_copy['employrate'].value_counts(sort=False,bins=10)

print("Percentage for employment rate with a high suicide rate") pec = sub_copy['employrate'].value_counts(sort=False,bins=10,normalize=True)*100

fec1=[] pec1=[] cf=0 cp=0 for freq in fbc: cf=cf+freq fec1.append(cf) pf=cf*100/len(sub_copy) pec1.append(pf)

print("Employment Rate with a High Suicide Rate") fmt1 = '%5s %12s %9s %12s %12s' fmt2 = '%5.2f %10.d %10.2f %10.d %12.2f' print(fmt1 % ('Rate','Freq.','Percent','Cum. Freq.','Cum. Percent')) for i, (key, var1, var2, var3, var4) in enumerate(zip(fec.keys(),fec,pec,fec1,pec1)): print(fmt2 % (key, var1, var2, var3, var4)) fmt3 = '%5s %10s %10s %10s %12s' print(fmt3 % ('NA', '2', '3.77', '53', '100.00'))

Summary of Statistics for a Suicide Rate count 191.000000 mean 9.640839 std 6.300178 min 0.201449 25% 4.988449 50% 8.262893 75% 12.328551 max 35.752872

Number of Breast Cancer Cases with a High Suicide Rate No.of Cases Freq. Percent Cum. Freq. Cum. Percent 6.51 6 11.32 6 11.32 15.14 14 26.42 20 37.74 23.68 5 9.43 25 47.17 32.22 7 13.21 32 60.38 40.76 2 3.77 34 64.15 49.30 4 7.55 38 71.70 57.84 5 9.43 43 81.13 66.38 1 1.89 44 83.02 74.92 3 5.66 47 88.68 83.46 4 7.55 51 96.23 NA 2 3.77 53 100.00

HIV Rate with a High Suicide Rate Rate Freq. Percent Cum. Freq. Cum. Percent 0.03 39 73.58 6 11.32 2.64 4 7.55 20 37.74 5.23 2 3.77 25 47.17 7.81 0 0.00 32 60.38 10.40 0 0.00 34 64.15 12.98 2 3.77 38 71.70 15.56 1 1.89 43 81.13 18.15 0 0.00 44 83.02 20.73 0 0.00 47 88.68 23.32 1 1.89 51 96.23 NA 2 3.77 53 100.00

Employment Rate with a High Suicide Rate Rate Freq. Percent Cum. Freq. Cum. Percent 37.35 2 3.77 6 11.32 41.98 2 3.77 20 37.74 46.56 7 13.21 25 47.17 51.14 8 15.09 32 60.38 55.72 16 30.19 34 64.15 60.30 4 7.55 38 71.70 64.88 5 9.43 43 81.13 69.46 2 3.77 44 83.02 74.04 3 5.66 47 88.68 78.62 3 5.66 51 96.23 NA 2 3.77 53 100.00

Frequency Distributions Conclusions

Lower number of breast cancer cases associated with high suicide rates

Lower number of HIV rates associated with high suicide rates

The high suicide rate occurs at 55% of employment rate.

0 notes