jahnvijoshijj-blog - Tumblr blog

jahnvijoshijj-blog · 4 years

Text

we are interested to represent in histograms all the 10 craters categories created. we organised all the craters into 10 groups based on their diameter (0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90, 91-100).pandas qcut discretize variable into equal-sized buckets based on rank or based on sample quantiles . Now, I want to see all those groups visualized as histograms.

Python output:

Out[176]: Text(0.5, 1.0, ‘Number of craters for each category’)

Python output:

Describe Table count 384343 unique 10 top 31-40 freq 40849 Name: DIAM_CIRCLE_GROUPS, dtype: object

Conclusion1: The table contains 384343 elements (count), divided into 10 categories (unique). Among these 10 groups, the most populated is 31-40 group (top) that contains craters whose diameter varies between 31 and 40 km. The most frequent group contains 40849 craters in total (freq). from the graph, its uniformly distributed histogram.

Conclusion 2: From all the histograms,

LATITUDE_CIRCLE_IMAGE & LONGITUDE_CIRCLE_IMAGE histograms are Bell shaped curves,

DEPTH_RIMFLOOR_TOPOG & NUMBER_LAYERS histograms are Rightside skewed and

CRATER_ID1 histogram is Multimodal.

Graphs of Variables with Diameter of Crater:

is there any direct relationship exists between the crater diameter and the floor rim and other variables. To do this, we will use a scatter plot graph.

Relation between Depth Rim Floor and the Crater Diameter :

The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.

Relation between longitude and the Crater Diameter :

The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.

Relation between Latitude and the Crater Diameter :

The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.

Relation between Number of Layers and Crater Diameter :

The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.

Python Code :

from pandas import DataFrame, read_csv import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import numpy as np sns.countplot(x='DIAM_CIRCLE_GROUPS’, data=sub6) plt.xlabel('Size of circle diameter for each crater category’) plt.title('Number of craters for each category’)

print('Describe Table’) desc1=sub6['DIAM_CIRCLE_GROUPS’].describe() #sub6['DIAM_CIRCLE_GROUPS’]=sub6['DIAM_CIRCLE_GROUPS’].convert_objects(convert_numeric=True) print(desc1)

scat1= sns.regplot(x='DEPTH_RIMFLOOR_TOPOG’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data) plt.xlabel('DIAM_CIRCLE_IMAGE’) plt.ylabel('DEPTH_RIMFLOOR_TOPOG’) plt.title('Relationship Crater Diameter and Depth Rimfloor’)

scat2= sns.regplot(x='LONGITUDE_CIRCLE_IMAGE’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data) plt.xlabel('LONGITUDE_CIRCLE_IMAGE’) plt.ylabel('DIAM_CIRCLE_IMAGE’) plt.title('Relationship Longitude and Crater Diameter’)

scat3= sns.regplot(x='LATITUDE_CIRCLE_IMAGE’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data) plt.xlabel('LATITUDE_CIRCLE_IMAGE’) plt.ylabel('DIAM_CIRCLE_IMAGE’) plt.title('Relationship Latitude and Crater Diameter’)

scat4= sns.regplot(x='NUMBER_LAYERS’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data) plt.xlabel('NUMBER_LAYERS’) plt.ylabel('DIAM_CIRCLE_IMAGE’) plt.title('Relationship No.of Layers and Crater Diameter’)

import matplotlib.pyplot as plt plt.hist(data['DIAM_CIRCLE_IMAGE’], bins = 10) plt.ylabel('DIAM_CIRCLE_IMAGE’) plt.xlabel('Bins=10’) plt.title('Histogram curve of Crater Diameter’) plt.show()

plt.hist(data['LATITUDE_CIRCLE_IMAGE’], bins = 10) plt.ylabel('LATITUDE_CIRCLE_IMAGE’) plt.xlabel('Bins=10’) plt.title('Histogram curve of LATITUDE_CIRCLE_IMAGE’) plt.show()

plt.hist(data['LONGITUDE_CIRCLE_IMAGE’], bins = 10) plt.ylabel('LONGITUDE_CIRCLE_IMAGE’) plt.xlabel('Bins=10’) plt.title('Histogram curve of LONITUDE_CIRCLE_IMAGE’) plt.show()

plt.hist(data['DEPTH_RIMFLOOR_TOPOG’], bins = 6) plt.ylabel('DEPTH_RIMFLOOR_TOPOG’) plt.xlabel('Bins=6’) plt.title('Histogram curve of DEPTH_RIMFLOOR_TOPOG’) plt.show()

plt.hist(data['NUMBER_LAYERS’], bins = 5) plt.ylabel('NUMBER_LAYERS’) plt.xlabel('Bins=5’) plt.title('Histogram curve of NUMBER_LAYERS’) plt.show()

0 notes

jahnvijoshijj-blog · 4 years

Text

week 3

Updated blank/missing values to NaN

- Removed rows with blank values to avoid any unintended impact on distribution.

- Used “pandas.cut” function to divide the values in the variable column into 3 bins – low, medium and high based on the range of the values

- Used “max()” function to identify the max value in ‘incomeperperson’ and evaluate splits to be made

- Used “pandas.cut” function to divide the values in the variable column for ‘incomeperperson’ into 4 splits i.e. 1 -10000, 10001 – 20000, 20001 – 30000 and 30001 – 40000.

Python Code:

# importing libraries

import pandas

import numpy as np

# loading the gapminder dataset in csv format to python using pandas library,

# gapminder_data is the name chosen for my data frame

# low_memory=False increases efficiency of running the program, by preventing

# pandas from determining data type before completely reading the data set?

gapminder_data = pandas.read_csv(‘gapminder.csv’, low_memory=False)

# to convert all variable titles to lower case to avoid any error in variable calling

# due to case sensitivity

gapminder_data.columns=map(str.lower, gapminder_data.columns)

# to see the number of rows or observations

print(len(gapminder_data))

# to see the number of columns or variables

print(len(gapminder_data.columns))

data_subset=gapminder_data.copy()

# to replace blank cells with NaN values

data_subset.replace(’ ’, np.nan,inplace=True)

# to omit the whole row if the column has any blank cells

data_subset.dropna(subset=['incomeperperson’],inplace=True)

data_subset.dropna(subset=['breastcancerper100th’],inplace=True)

data_subset.dropna(subset=['hivrate’],inplace=True)

data_subset.dropna(subset=['internetuserate’],inplace=True)

data_subset.dropna(subset=['lifeexpectancy’],inplace=True)

data_subset.dropna(subset=['oilperperson’],inplace=True)

data_subset.dropna(subset=['relectricperperson’],inplace=True)

data_subset.dropna(subset=['suicideper100th’],inplace=True)

data_subset.dropna(subset=['employrate’],inplace=True)

data_subset.dropna(subset=['urbanrate’],inplace=True)

# to see the updated number of rows or observations

print(len(data_subset))

# to see the updated number of columns or variables

print(len(data_subset.columns))

# converting the data/variable values to numeric as

# some cells are blank and python may set the values as text/string type

# upon reading the blank cells

data_subset['incomeperperson’]=data_subset['incomeperperson’].astype(float)

data_subset['breastcancerper100th’]=data_subset['breastcancerper100th’].astype(float)

data_subset['hivrate’]=data_subset['hivrate’].astype(float)

data_subset['internetuserate’]=data_subset['internetuserate’].astype(float)

data_subset['lifeexpectancy’]=data_subset['lifeexpectancy’].astype(float)

data_subset['oilperperson’]=data_subset['oilperperson’].astype(float)

data_subset['relectricperperson’]=data_subset['relectricperperson’].astype(float)

data_subset['suicideper100th’]=data_subset['suicideper100th’].astype(float)

data_subset['employrate’]=data_subset['employrate’].astype(float)

data_subset['urbanrate’]=data_subset['urbanrate’].astype(float)

# dividing the values in the variable columns into bins/subsets as the values are continuous and not categorical

data_subset['incomeperperson_bins’]=pandas.cut(data_subset['incomeperperson’],3,labels=['low’, 'medium’, 'high’])

data_subset['breastcancerper100th_bins’]=pandas.cut(data_subset['breastcancerper100th’],3,labels=['low’, 'medium’, 'high’])

data_subset['hivrate_bins’]=pandas.cut(data_subset['hivrate’],3,labels=['low’, 'medium’, 'high’])

data_subset['internetuserate_bins’]=pandas.cut(data_subset['internetuserate’],3,labels=['low’, 'medium’, 'high’])

data_subset['lifeexpectancy_bins’]=pandas.cut(data_subset['lifeexpectancy’],3,labels=['low’, 'medium’, 'high’])

data_subset['oilperperson_bins’]=pandas.cut(data_subset['oilperperson’],3,labels=['low’, 'medium’, 'high’])

data_subset['relectricperperson_bins’]=pandas.cut(data_subset['relectricperperson’],3,labels=['low’, 'medium’, 'high’])

data_subset['suicideper100th_bins’]=pandas.cut(data_subset['suicideper100th’],3,labels=['low’, 'medium’, 'high’])

data_subset['employrate_bins’]=pandas.cut(data_subset['employrate’],3,labels=['low’, 'medium’, 'high’])

data_subset['urbanrate_bins’]=pandas.cut(data_subset['urbanrate’],3,labels=['low’, 'medium’, 'high’])

# print frequency distribution

print('counts for bins in incomeperperson’)

c1_ipp=data_subset['incomeperperson_bins’].value_counts()

print(c1_ipp)

print('percentage for bins incomeperperson’)

p1_ipp=data_subset['incomeperperson_bins’].value_counts(normalize=True)

print(p1_ipp)

print('counts for bins in breastcancerper100th’)

c2_ipp=data_subset['breastcancerper100th_bins’].value_counts()

print(c2_ipp)

print('percentage for bins breastcancerper100th’)

p2_ipp=data_subset['breastcancerper100th_bins’].value_counts(normalize=True)

print(p2_ipp)

print('counts for bins in hivrate’)

c3_ipp=data_subset['hivrate_bins’].value_counts()

print(c3_ipp)

print('percentage for bins hivrate’)

p3_ipp=data_subset['hivrate_bins’].value_counts(normalize=True)

print(p3_ipp)

print('counts for bins in internetuserate’)

c4_ipp=data_subset['internetuserate_bins’].value_counts()

print(c4_ipp)

print('percentage for bins internetuserate’)

p4_ipp=data_subset['internetuserate_bins’].value_counts(normalize=True)

print(p4_ipp)

print('counts for bins in lifeexpectancy’)

c5_ipp=data_subset['lifeexpectancy_bins’].value_counts()

print(c5_ipp)

print('percentage for bins lifeexpectancy’)

p5_ipp=data_subset['lifeexpectancy_bins’].value_counts(normalize=True)

print(p5_ipp)

print('counts for bins in oilperperson’)

c6_ipp=data_subset['oilperperson_bins’].value_counts()

print(c6_ipp)

print('percentage for bins oilperperson’)

p6_ipp=data_subset['oilperperson_bins’].value_counts(normalize=True)

print(p6_ipp)

print('counts for bins in relectricperperson’)

c7_ipp=data_subset['relectricperperson_bins’].value_counts()

print(c7_ipp)

print('percentage for bins relectricperperson’)

p7_ipp=data_subset['relectricperperson_bins’].value_counts(normalize=True)

print(p7_ipp)

print('counts for bins in suicideper100th’)

c8_ipp=data_subset['suicideper100th_bins’].value_counts()

print(c8_ipp)

print('percentage for bins suicideper100th’)

p8_ipp=data_subset['suicideper100th_bins’].value_counts(normalize=True)

print(p8_ipp)

print('counts for bins in employrate’)

c9_ipp=data_subset['employrate_bins’].value_counts()

print(c9_ipp)

print('percentage for bins employrate’)

p9_ipp=data_subset['employrate_bins’].value_counts(normalize=True)

print(p9_ipp)

print('counts for bins in urbanrate’)

c10_ipp=data_subset['urbanrate_bins’].value_counts()

print(c10_ipp)

print('percentage for bins urbanrate’)

p10_ipp=data_subset['urbanrate_bins’].value_counts(normalize=True)

print(p10_ipp)

# use of “pandas.cut” function to create 4 splits in ‘incomeperperson’ variable

print(data_subset['incomeperperson’].max())

data_subset['incomeperperson_split’]=pandas.cut(data_subset['incomeperperson’],[0,10000,20000,30000,40000])

print('counts for splits in incomeperperson’)

c11_ipp=data_subset['incomeperperson_split’].value_counts(sort=False)

print(c11_ipp)

print('percentage for splits incomeperperson’)

p11_ipp=data_subset['incomeperperson_split’].value_counts(normalize=True, sort=False)

print(p11_ipp)

Output - Frequency Distribution

counts for bins in incomeperperson

low 35

high 11

medium 10

Name: incomeperperson_bins, dtype: int64

percentage for bins incomeperperson

low 0.625000

high 0.196429

medium 0.178571

Name: incomeperperson_bins, dtype: float64

counts for bins in breastcancerper100th

low 27

high 16

medium 13

Name: breastcancerper100th_bins, dtype: int64

percentage for bins breastcancerper100th

low 0.482143

high 0.285714

medium 0.232143

Name: breastcancerper100th_bins, dtype: float64

counts for bins in hivrate

low 55

high 1

medium 0

Name: hivrate_bins, dtype: int64

percentage for bins hivrate

low 0.982143

high 0.017857

medium 0.000000

Name: hivrate_bins, dtype: float64

counts for bins in internetuserate

high 23

medium 18

low 15

Name: internetuserate_bins, dtype: int64

percentage for bins internetuserate

high 0.410714

medium 0.321429

low 0.267857

Name: internetuserate_bins, dtype: float64

counts for bins in lifeexpectancy

high 41

medium 14

low 1

Name: lifeexpectancy_bins, dtype: int64

percentage for bins lifeexpectancy

high 0.732143

medium 0.250000

low 0.017857

Name: lifeexpectancy_bins, dtype: float64

counts for bins in oilperperson

low 54

high 1

medium 1

Name: oilperperson_bins, dtype: int64

percentage for bins oilperperson

low 0.964286

high 0.017857

medium 0.017857

Name: oilperperson_bins, dtype: float64

counts for bins in relectricperperson

low 47

medium 8

high 1

Name: relectricperperson_bins, dtype: int64

percentage for bins relectricperperson

low 0.839286

medium 0.142857

high 0.017857

Name: relectricperperson_bins, dtype: float64

counts for bins in suicideper100th

low 36

medium 16

high 4

Name: suicideper100th_bins, dtype: int64

percentage for bins suicideper100th

low 0.642857

medium 0.285714

high 0.071429

Name: suicideper100th_bins, dtype: float64

counts for bins in employrate

medium 33

low 15

high 8

Name: employrate_bins, dtype: int64

percentage for bins employrate

medium 0.589286

low 0.267857

high 0.142857

Name: employrate_bins, dtype: float64

counts for bins in urbanrate

medium 31

high 18

low 7

Name: urbanrate_bins, dtype: int64

percentage for bins urbanrate

medium 0.553571

high 0.321429

low 0.125000

Name: urbanrate_bins, dtype: float64

Max value in incomeperperson

39972.3527684608

counts for splits in incomeperperson

(0, 10000] 32

(10000, 20000] 7

(20000, 30000] 9

(30000, 40000] 8

Name: incomeperperson_split, dtype: int64

percentage for splits incomeperperson

(0, 10000] 0.571429

(10000, 20000] 0.125000

(20000, 30000] 0.160714

(30000, 40000] 0.142857

Name: incomeperperson_split, dtype: float64

About Frequency Distribution

As the values under variable columns are continuous and not categorical, each variable column has been individually sub-divided into three relative categories/bins – low, medium and high, based on the spread of the values within that variable column.

The frequency distribution captures the distribution over these three bins for each variable column. If need is felt for a more sensitive analysis of any variable column, the number of bins can be increased for that individual variable. For example – 4 splits are identified for incomeperperson - [(0, 10000] < (10000, 20000] < (20000, 30000] < (30000, 40000]] based on the max value identified using “max()” function.

0 notes

jahnvijoshijj-blog · 4 years

Text

All data for the variable country only occur once (i.e. frequency is 1). There a few rows for the columns femaleemployrate and urbanrate where the frequency can go up to 3. I have removed all missing values for the variable femaleemployrate.

Program:

libname mydata “/courses/d1406ae5ba27fe300” access=readonly;

run;

data gap_new;

set mydata.gapminder;

where femaleemployrate ne .;

run;

proc sort data=gap_new;

by country;

run;

proc freq data=gap_new;

tables femaleemployrate country urbanrate;

run;

Output:

The FREQ Procedure

FEMALEEMPLOYRATE

femaleemployrate Frequency

11.300000199 1

12.399999619 1

13 1

16.700000763 1

17.700000763 1

18.200000763 1

19 1

Country Frequency

Afghanistan 1

Albania 1

Algeria 1

Angola 1

Argentina 1

Armenia 1

urbanrate Frequency

10.4 1

12.54 1

12.98 1

13.22 1

15.1 1

16.54 1

17 1

Frequency Missing = 5

0 notes

jahnvijoshijj-blog · 4 years

Text

Determinants of how ideal romantic relationship depends on one`s personality and family

After looking through the codebook National Longitudinal Study of Adolescent Health (AddHealth), I have decided to look at determinants of how ideal romantic relationship depends on one`s personality and family

While deciding to look at which factors play a part in determining how ideal romantic relationship depends on one`s personality and family, I need to select the appropriate variables to use for my study. The AddHealth data set contains information on many indicators of independence, closeness in family and personality. Some of these indicators would probably not strongly impact how ideal romantic relationship depends on one`s personality and family and therefore can be dropped for my study.

I will be looking at how a country’s level of wealth, health and development impact its female labor force participation rate. I will be using the following variables in my code book.

1. Personality

How one takes hold of different situations when given option and how well it`s handled

2. Mindset

The process of approach to the problem and the mindset towards the later scenario

3. Relationship with mother

Family bonding of the individual with their mother and the amount of openness.

4. Relationship with father

Family bonding of the individual with their father and the amount of openness.

5. Level of comfort with their partners

My research questions are:

1. How ideal romantic relationship depends on one`s personality and family?

2. How does family support and parent attitude effect the relationship level ?

My code book is:

Personality trait

Relationship with mother

Relationship with father

Bonding with partner

Level of comfort

Safety and precaution

Literature review:

In most countries, children aren`t very comfortable in talking to their parents about their relationship status due to variety reasons such as ethics, social norms, family values and fear.

In a study it was found that if parent`s are open to accepting relationship the mental stress and the ease to solve the problem by the young kids becomes much easier than with parents with strict house hold ground rules.

Another study done revels the fear of getting punished or bring shame to the family stops the kid from communicating about these things to their parents, which further exaggerates the problem among the young generation.

Considering the review, I have developed the following hypotheses:

1. Family relation play a huge role in one`s mindset towards relationships

2. Person`s mindset and awareness also play a vital role.

Bibliography

1 Elsevier, Journal of Adolescence Volume 37, Issue 4, June 2014, Pages 433-440

2. The Role of Romantic Relationships in Adolescent Development: Wyndol Furman and Laura Shaffern WithWyndol Furman and Laura ShaVer Adolescent Romantic Relations and Sexual Behavior Theory, Research

1 note · View note