we are interested to represent in histograms all the 10 craters categories created. we organised all the craters into 10 groups based on their diameter (0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90, 91-100).pandas qcut discretize variable into equal-sized buckets based on rank or based on sample quantiles . Now, I want to see all those groups visualized as histograms.
Python output:
Out[176]: Text(0.5, 1.0, ‘Number of craters for each category’)
Python output:
Describe Table
count 384343
unique 10
top 31-40
freq 40849
Name: DIAM_CIRCLE_GROUPS, dtype: object
 Conclusion1: The table contains 384343 elements (count), divided into 10 categories (unique). Among these 10 groups, the most populated is 31-40 group (top) that contains craters whose diameter varies between 31 and 40 km. The most frequent group contains 40849 craters in total (freq). from the graph, its uniformly distributed histogram.
Conclusion 2: From all the histograms,
LATITUDE_CIRCLE_IMAGE & LONGITUDE_CIRCLE_IMAGE histograms are Bell shaped curves,
DEPTH_RIMFLOOR_TOPOG & NUMBER_LAYERS histograms are Rightside skewed and
CRATER_ID1 histogram is Multimodal.
Graphs of Variables with Diameter of Crater:
is there any direct relationship exists between the crater diameter and the floor rim and other variables. To do this, we will use a scatter plot graph.
 Relation between Depth Rim Floor and the Crater Diameter :
The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.
Relation between longitude and the Crater Diameter :
The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.
Relation between Latitude and the Crater Diameter :
The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.
Relation between Number of Layers and Crater Diameter :
The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.
Python Code :
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
sns.countplot(x='DIAM_CIRCLE_GROUPS’, data=sub6)
plt.xlabel('Size of circle diameter for each crater category’)
plt.title('Number of craters for each category’)
print('Describe Table’)
desc1=sub6['DIAM_CIRCLE_GROUPS’].describe()
#sub6['DIAM_CIRCLE_GROUPS’]=sub6['DIAM_CIRCLE_GROUPS’].convert_objects(convert_numeric=True)
print(desc1)
scat1= sns.regplot(x='DEPTH_RIMFLOOR_TOPOG’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data)
plt.xlabel('DIAM_CIRCLE_IMAGE’)
plt.ylabel('DEPTH_RIMFLOOR_TOPOG’)
plt.title('Relationship Crater Diameter and Depth Rimfloor’)
scat2= sns.regplot(x='LONGITUDE_CIRCLE_IMAGE’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data)
plt.xlabel('LONGITUDE_CIRCLE_IMAGE’)
plt.ylabel('DIAM_CIRCLE_IMAGE’)
plt.title('Relationship Longitude and Crater Diameter’)
scat3= sns.regplot(x='LATITUDE_CIRCLE_IMAGE’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data)
plt.xlabel('LATITUDE_CIRCLE_IMAGE’)
plt.ylabel('DIAM_CIRCLE_IMAGE’)
plt.title('Relationship Latitude and Crater Diameter’)
scat4= sns.regplot(x='NUMBER_LAYERS’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data)
plt.xlabel('NUMBER_LAYERS’)
plt.ylabel('DIAM_CIRCLE_IMAGE’)
plt.title('Relationship No.of Layers and Crater Diameter’)
import matplotlib.pyplot as plt
plt.hist(data['DIAM_CIRCLE_IMAGE’], bins = 10)
plt.ylabel('DIAM_CIRCLE_IMAGE’)
plt.xlabel('Bins=10’)
plt.title('Histogram curve of Crater Diameter’)
plt.show()
plt.hist(data['LATITUDE_CIRCLE_IMAGE’], bins = 10)
plt.ylabel('LATITUDE_CIRCLE_IMAGE’)
plt.xlabel('Bins=10’)
plt.title('Histogram curve of LATITUDE_CIRCLE_IMAGE’)
plt.show()
plt.hist(data['LONGITUDE_CIRCLE_IMAGE’], bins = 10)
plt.ylabel('LONGITUDE_CIRCLE_IMAGE’)
plt.xlabel('Bins=10’)
plt.title('Histogram curve of LONITUDE_CIRCLE_IMAGE’)
plt.show()
plt.hist(data['DEPTH_RIMFLOOR_TOPOG’], bins = 6)
plt.ylabel('DEPTH_RIMFLOOR_TOPOG’)
plt.xlabel('Bins=6’)
plt.title('Histogram curve of DEPTH_RIMFLOOR_TOPOG’)
plt.show()
plt.hist(data['NUMBER_LAYERS’], bins = 5)
plt.ylabel('NUMBER_LAYERS’)
plt.xlabel('Bins=5’)
plt.title('Histogram curve of NUMBER_LAYERS’)
plt.show()
0 notes
week 3
Updated blank/missing values to NaN
- Removed rows with blank values to avoid any unintended impact on distribution.
- Used “pandas.cut” function to divide the values in the variable column into 3 bins – low, medium and high based on the range of the values
- Used “max()” function to identify the max value in ‘incomeperperson’ and evaluate splits to be made
- Used “pandas.cut” function to divide the values in the variable column for ‘incomeperperson’ into 4 splits i.e. 1 -10000, 10001 – 20000, 20001 – 30000 and 30001 – 40000.
Python Code:
# importing libraries
import pandas
import numpy as np
# loading the gapminder dataset in csv format to python using pandas library,
# gapminder_data is the name chosen for my data frame
# low_memory=False increases efficiency of running the program, by preventing
# pandas from determining data type before completely reading the data set?
gapminder_data = pandas.read_csv(‘gapminder.csv’, low_memory=False)
# to convert all variable titles to lower case to avoid any error in variable calling
# due to case sensitivity
gapminder_data.columns=map(str.lower, gapminder_data.columns)
# to see the number of rows or observations
print(len(gapminder_data))
# to see the number of columns or variables
print(len(gapminder_data.columns))
data_subset=gapminder_data.copy()
# to replace blank cells with NaN values
data_subset.replace(’ ’, np.nan,inplace=True)
# to omit the whole row if the column has any blank cells
data_subset.dropna(subset=['incomeperperson’],inplace=True)
data_subset.dropna(subset=['breastcancerper100th’],inplace=True)
data_subset.dropna(subset=['hivrate’],inplace=True)
data_subset.dropna(subset=['internetuserate’],inplace=True)
data_subset.dropna(subset=['lifeexpectancy’],inplace=True)
data_subset.dropna(subset=['oilperperson’],inplace=True)
data_subset.dropna(subset=['relectricperperson’],inplace=True)
data_subset.dropna(subset=['suicideper100th’],inplace=True)
data_subset.dropna(subset=['employrate’],inplace=True)
data_subset.dropna(subset=['urbanrate’],inplace=True)
# to see the updated number of rows or observations
print(len(data_subset))
# to see the updated number of columns or variables
print(len(data_subset.columns))
# converting the data/variable values to numeric as
# some cells are blank and python may set the values as text/string type
# upon reading the blank cells
data_subset['incomeperperson’]=data_subset['incomeperperson’].astype(float)
data_subset['breastcancerper100th’]=data_subset['breastcancerper100th’].astype(float)
data_subset['hivrate’]=data_subset['hivrate’].astype(float)
data_subset['internetuserate’]=data_subset['internetuserate’].astype(float)
data_subset['lifeexpectancy’]=data_subset['lifeexpectancy’].astype(float)
data_subset['oilperperson’]=data_subset['oilperperson’].astype(float)
data_subset['relectricperperson’]=data_subset['relectricperperson’].astype(float)
data_subset['suicideper100th’]=data_subset['suicideper100th’].astype(float)
data_subset['employrate’]=data_subset['employrate’].astype(float)
data_subset['urbanrate’]=data_subset['urbanrate’].astype(float)
# dividing the values in the variable columns into bins/subsets as the values are continuous and not categorical
data_subset['incomeperperson_bins’]=pandas.cut(data_subset['incomeperperson’],3,labels=['low’, 'medium’, 'high’])
data_subset['breastcancerper100th_bins’]=pandas.cut(data_subset['breastcancerper100th’],3,labels=['low’, 'medium’, 'high’])
data_subset['hivrate_bins’]=pandas.cut(data_subset['hivrate’],3,labels=['low’, 'medium’, 'high’])
data_subset['internetuserate_bins’]=pandas.cut(data_subset['internetuserate’],3,labels=['low’, 'medium’, 'high’])
data_subset['lifeexpectancy_bins’]=pandas.cut(data_subset['lifeexpectancy’],3,labels=['low’, 'medium’, 'high’])
data_subset['oilperperson_bins’]=pandas.cut(data_subset['oilperperson’],3,labels=['low’, 'medium’, 'high’])
data_subset['relectricperperson_bins’]=pandas.cut(data_subset['relectricperperson’],3,labels=['low’, 'medium’, 'high’])
data_subset['suicideper100th_bins’]=pandas.cut(data_subset['suicideper100th’],3,labels=['low’, 'medium’, 'high’])
data_subset['employrate_bins’]=pandas.cut(data_subset['employrate’],3,labels=['low’, 'medium’, 'high’])
data_subset['urbanrate_bins’]=pandas.cut(data_subset['urbanrate’],3,labels=['low’, 'medium’, 'high’])
# print frequency distribution
print('counts for bins in incomeperperson’)
c1_ipp=data_subset['incomeperperson_bins’].value_counts()
print(c1_ipp)
print('percentage for bins incomeperperson’)
p1_ipp=data_subset['incomeperperson_bins’].value_counts(normalize=True)
print(p1_ipp)
print('counts for bins in breastcancerper100th’)
c2_ipp=data_subset['breastcancerper100th_bins’].value_counts()
print(c2_ipp)
print('percentage for bins breastcancerper100th’)
p2_ipp=data_subset['breastcancerper100th_bins’].value_counts(normalize=True)
print(p2_ipp)
print('counts for bins in hivrate’)
c3_ipp=data_subset['hivrate_bins’].value_counts()
print(c3_ipp)
print('percentage for bins hivrate’)
p3_ipp=data_subset['hivrate_bins’].value_counts(normalize=True)
print(p3_ipp)
print('counts for bins in internetuserate’)
c4_ipp=data_subset['internetuserate_bins’].value_counts()
print(c4_ipp)
print('percentage for bins internetuserate’)
p4_ipp=data_subset['internetuserate_bins’].value_counts(normalize=True)
print(p4_ipp)
print('counts for bins in lifeexpectancy’)
c5_ipp=data_subset['lifeexpectancy_bins’].value_counts()
print(c5_ipp)
print('percentage for bins lifeexpectancy’)
p5_ipp=data_subset['lifeexpectancy_bins’].value_counts(normalize=True)
print(p5_ipp)
print('counts for bins in oilperperson’)
c6_ipp=data_subset['oilperperson_bins’].value_counts()
print(c6_ipp)
print('percentage for bins oilperperson’)
p6_ipp=data_subset['oilperperson_bins’].value_counts(normalize=True)
print(p6_ipp)
print('counts for bins in relectricperperson’)
c7_ipp=data_subset['relectricperperson_bins’].value_counts()
print(c7_ipp)
print('percentage for bins relectricperperson’)
p7_ipp=data_subset['relectricperperson_bins’].value_counts(normalize=True)
print(p7_ipp)
print('counts for bins in suicideper100th’)
c8_ipp=data_subset['suicideper100th_bins’].value_counts()
print(c8_ipp)
print('percentage for bins suicideper100th’)
p8_ipp=data_subset['suicideper100th_bins’].value_counts(normalize=True)
print(p8_ipp)
print('counts for bins in employrate’)
c9_ipp=data_subset['employrate_bins’].value_counts()
print(c9_ipp)
print('percentage for bins employrate’)
p9_ipp=data_subset['employrate_bins’].value_counts(normalize=True)
print(p9_ipp)
print('counts for bins in urbanrate’)
c10_ipp=data_subset['urbanrate_bins’].value_counts()
print(c10_ipp)
print('percentage for bins urbanrate’)
p10_ipp=data_subset['urbanrate_bins’].value_counts(normalize=True)
print(p10_ipp)
# use of “pandas.cut” function to create 4 splits in ‘incomeperperson’ variable
print(data_subset['incomeperperson’].max())
data_subset['incomeperperson_split’]=pandas.cut(data_subset['incomeperperson’],[0,10000,20000,30000,40000])
print('counts for splits in incomeperperson’)
c11_ipp=data_subset['incomeperperson_split’].value_counts(sort=False)
print(c11_ipp)
print('percentage for splits incomeperperson’)
p11_ipp=data_subset['incomeperperson_split’].value_counts(normalize=True, sort=False)
print(p11_ipp)
Output - Frequency Distribution
counts for bins in incomeperperson
low 35
high 11
medium 10
Name: incomeperperson_bins, dtype: int64
percentage for bins incomeperperson
low 0.625000
high 0.196429
medium 0.178571
Name: incomeperperson_bins, dtype: float64
counts for bins in breastcancerper100th
low 27
high 16
medium 13
Name: breastcancerper100th_bins, dtype: int64
percentage for bins breastcancerper100th
low 0.482143
high 0.285714
medium 0.232143
Name: breastcancerper100th_bins, dtype: float64
counts for bins in hivrate
low 55
high 1
medium 0
Name: hivrate_bins, dtype: int64
percentage for bins hivrate
low 0.982143
high 0.017857
medium 0.000000
Name: hivrate_bins, dtype: float64
counts for bins in internetuserate
high 23
medium 18
low 15
Name: internetuserate_bins, dtype: int64
percentage for bins internetuserate
high 0.410714
medium 0.321429
low 0.267857
Name: internetuserate_bins, dtype: float64
counts for bins in lifeexpectancy
high 41
medium 14
low 1
Name: lifeexpectancy_bins, dtype: int64
percentage for bins lifeexpectancy
high 0.732143
medium 0.250000
low 0.017857
Name: lifeexpectancy_bins, dtype: float64
counts for bins in oilperperson
low 54
high 1
medium 1
Name: oilperperson_bins, dtype: int64
percentage for bins oilperperson
low 0.964286
high 0.017857
medium 0.017857
Name: oilperperson_bins, dtype: float64
counts for bins in relectricperperson
low 47
medium 8
high 1
Name: relectricperperson_bins, dtype: int64
percentage for bins relectricperperson
low 0.839286
medium 0.142857
high 0.017857
Name: relectricperperson_bins, dtype: float64
counts for bins in suicideper100th
low 36
medium 16
high 4
Name: suicideper100th_bins, dtype: int64
percentage for bins suicideper100th
low 0.642857
medium 0.285714
high 0.071429
Name: suicideper100th_bins, dtype: float64
counts for bins in employrate
medium 33
low 15
high 8
Name: employrate_bins, dtype: int64
percentage for bins employrate
medium 0.589286
low 0.267857
high 0.142857
Name: employrate_bins, dtype: float64
counts for bins in urbanrate
medium 31
high 18
low 7
Name: urbanrate_bins, dtype: int64
percentage for bins urbanrate
medium 0.553571
high 0.321429
low 0.125000
Name: urbanrate_bins, dtype: float64
Max value in incomeperperson
39972.3527684608
counts for splits in incomeperperson
(0, 10000] 32
(10000, 20000] 7
(20000, 30000] 9
(30000, 40000] 8
Name: incomeperperson_split, dtype: int64
percentage for splits incomeperperson
(0, 10000] 0.571429
(10000, 20000] 0.125000
(20000, 30000] 0.160714
(30000, 40000] 0.142857
Name: incomeperperson_split, dtype: float64
About Frequency Distribution
As the values under variable columns are continuous and not categorical, each variable column has been individually sub-divided into three relative categories/bins – low, medium and high, based on the spread of the values within that variable column.
The frequency distribution captures the distribution over these three bins for each variable column. If need is felt for a more sensitive analysis of any variable column, the number of bins can be increased for that individual variable. For example – 4 splits are identified for incomeperperson - [(0, 10000] < (10000, 20000] < (20000, 30000] < (30000, 40000]] based on the max value identified using “max()” function.
0 notes
Determinants of how ideal romantic relationship depends on one`s personality and family
After looking through the codebook National Longitudinal Study of Adolescent Health (AddHealth), I have decided to look at determinants of how ideal romantic relationship depends on one`s personality and family
While deciding to look at which factors play a part in determining how ideal romantic relationship depends on one`s personality and family, I need to select the appropriate variables to use for my study. The AddHealth data set contains information on many indicators of independence, closeness in family and personality. Some of these indicators would probably not strongly impact how ideal romantic relationship depends on one`s personality and family and therefore can be dropped for my study.
I will be looking at how a country’s level of wealth, health and development impact its female labor force participation rate. I will be using the following variables in my code book.
1. Personality
How one takes hold of different situations when given option and how well it`s handled
2. Mindset
The process of approach to the problem and the mindset towards the later scenario
3. Relationship with mother
Family bonding of the individual with their mother and the amount of openness.
4. Relationship with father
Family bonding of the individual with their father and the amount of openness.
5. Level of comfort with their partners
My research questions are:
1. How ideal romantic relationship depends on one`s personality and family?
2. How does family support and parent attitude effect the relationship level ?
My code book is:
Personality trait
Relationship with mother
Relationship with father
Bonding with partner
Level of comfort
Safety and precaution
Literature review:
In most countries, children aren`t very comfortable in talking to their parents about their relationship status due to variety reasons such as ethics, social norms, family values and fear.
In a study it was found that if parent`s are open to accepting relationship the mental stress and the ease to solve the problem by the young kids becomes much easier than with parents with strict house hold ground rules.
Another study done revels the fear of getting punished or bring shame to the family stops the kid from communicating about these things to their parents, which further exaggerates the problem among the young generation.
Considering the review, I have developed the following hypotheses:
1. Family relation play a huge role in one`s mindset towards relationships
2. Person`s mindset and awareness also play a vital role.
Bibliography
1 Elsevier, Journal of Adolescence Volume 37, Issue 4, June 2014, Pages 433-440
2. The Role of Romantic Relationships in Adolescent Development: Wyndol Furman and Laura Shaffern WithWyndol Furman and Laura ShaVer Adolescent Romantic Relations and Sexual Behavior Theory, Research
1 note
·
View note