Tumgik
jahnvijoshijj-blog · 4 years
Text
Tumblr media Tumblr media Tumblr media
Python Code:
# importing libraries
Tumblr media
0 notes
jahnvijoshijj-blog · 4 years
Text
we are interested to represent in histograms all the 10 craters categories created. we organised all the craters into 10 groups based on their diameter (0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90, 91-100).pandas qcut  discretize variable into equal-sized buckets based on rank or based on sample quantiles . Now, I want to see all those groups visualized as histograms.
Python output:
Out[176]: Text(0.5, 1.0, ‘Number of craters for each category’)
Python output:
Describe Table count     384343 unique        10 top        31-40 freq       40849 Name: DIAM_CIRCLE_GROUPS, dtype: object
 Conclusion1: The table contains 384343 elements (count), divided into 10 categories (unique). Among these 10 groups, the most populated is 31-40 group (top) that contains craters whose diameter varies between 31 and 40 km. The most frequent group contains 40849 craters in total (freq). from the graph, its uniformly distributed histogram.
Conclusion 2: From all the histograms,
LATITUDE_CIRCLE_IMAGE & LONGITUDE_CIRCLE_IMAGE histograms are Bell shaped curves,
DEPTH_RIMFLOOR_TOPOG & NUMBER_LAYERS histograms are Rightside skewed and
CRATER_ID1 histogram is Multimodal.
Graphs of Variables with Diameter of Crater:
is there any direct relationship exists between the crater diameter and the floor rim and other variables. To do this, we will use a scatter plot graph.
 Relation between Depth Rim Floor and the Crater Diameter :
The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.
Relation between longitude and the Crater Diameter :
The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.
Relation between Latitude and the Crater Diameter :
The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.
Relation between Number of Layers and  Crater Diameter :
The scatter plot fails to display a graph that can be interpreted. Additional data manipulation is required.
Python Code :
from pandas import DataFrame, read_csv import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import numpy as np sns.countplot(x='DIAM_CIRCLE_GROUPS’, data=sub6) plt.xlabel('Size of circle diameter for each crater category’) plt.title('Number of craters for each category’)
print('Describe Table’) desc1=sub6['DIAM_CIRCLE_GROUPS’].describe() #sub6['DIAM_CIRCLE_GROUPS’]=sub6['DIAM_CIRCLE_GROUPS’].convert_objects(convert_numeric=True) print(desc1)
scat1= sns.regplot(x='DEPTH_RIMFLOOR_TOPOG’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data) plt.xlabel('DIAM_CIRCLE_IMAGE’) plt.ylabel('DEPTH_RIMFLOOR_TOPOG’) plt.title('Relationship Crater Diameter and Depth Rimfloor’)
scat2= sns.regplot(x='LONGITUDE_CIRCLE_IMAGE’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data) plt.xlabel('LONGITUDE_CIRCLE_IMAGE’) plt.ylabel('DIAM_CIRCLE_IMAGE’) plt.title('Relationship Longitude and Crater Diameter’)
scat3= sns.regplot(x='LATITUDE_CIRCLE_IMAGE’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data) plt.xlabel('LATITUDE_CIRCLE_IMAGE’) plt.ylabel('DIAM_CIRCLE_IMAGE’) plt.title('Relationship Latitude and Crater Diameter’)
scat4= sns.regplot(x='NUMBER_LAYERS’, y='DIAM_CIRCLE_IMAGE’, fit_reg=False, data=data) plt.xlabel('NUMBER_LAYERS’) plt.ylabel('DIAM_CIRCLE_IMAGE’) plt.title('Relationship No.of Layers and Crater Diameter’)
import matplotlib.pyplot as plt plt.hist(data['DIAM_CIRCLE_IMAGE’], bins = 10) plt.ylabel('DIAM_CIRCLE_IMAGE’) plt.xlabel('Bins=10’) plt.title('Histogram curve of Crater Diameter’) plt.show()
plt.hist(data['LATITUDE_CIRCLE_IMAGE’], bins = 10) plt.ylabel('LATITUDE_CIRCLE_IMAGE’) plt.xlabel('Bins=10’) plt.title('Histogram curve of LATITUDE_CIRCLE_IMAGE’) plt.show()
plt.hist(data['LONGITUDE_CIRCLE_IMAGE’], bins = 10) plt.ylabel('LONGITUDE_CIRCLE_IMAGE’) plt.xlabel('Bins=10’) plt.title('Histogram curve of LONITUDE_CIRCLE_IMAGE’) plt.show()
plt.hist(data['DEPTH_RIMFLOOR_TOPOG’], bins = 6) plt.ylabel('DEPTH_RIMFLOOR_TOPOG’) plt.xlabel('Bins=6’) plt.title('Histogram curve of DEPTH_RIMFLOOR_TOPOG’) plt.show()
plt.hist(data['NUMBER_LAYERS’], bins = 5) plt.ylabel('NUMBER_LAYERS’) plt.xlabel('Bins=5’) plt.title('Histogram curve of NUMBER_LAYERS’) plt.show()
0 notes
jahnvijoshijj-blog · 4 years
Text
week 3
  Updated blank/missing values to NaN
-        Removed rows with blank values to avoid any unintended impact on distribution.
-        Used “pandas.cut” function to divide the values in the variable column into 3 bins – low, medium and high based on the range of the values
-        Used “max()” function to identify the max value in ‘incomeperperson’ and evaluate splits to be made
-        Used “pandas.cut” function to divide the values in the variable column for ‘incomeperperson’ into 4 splits i.e. 1 -10000, 10001 – 20000, 20001 – 30000 and 30001 – 40000.
Python Code:
# importing libraries
import pandas
import numpy as np
# loading the gapminder dataset in csv format to python using pandas library,
# gapminder_data is the name chosen for my data frame
# low_memory=False increases efficiency of running the program, by preventing
# pandas from determining data type before completely reading the data set?
gapminder_data = pandas.read_csv(‘gapminder.csv’, low_memory=False)
# to convert all variable titles to lower case to avoid any error in variable calling
# due to case sensitivity
gapminder_data.columns=map(str.lower, gapminder_data.columns)
# to see the number of rows or observations
print(len(gapminder_data))
# to see the number of columns or variables
print(len(gapminder_data.columns))
data_subset=gapminder_data.copy()
# to replace blank cells with NaN values
data_subset.replace(’ ’, np.nan,inplace=True)
# to omit the whole row if the column has any blank cells
data_subset.dropna(subset=['incomeperperson’],inplace=True)
data_subset.dropna(subset=['breastcancerper100th’],inplace=True)
data_subset.dropna(subset=['hivrate’],inplace=True)
data_subset.dropna(subset=['internetuserate’],inplace=True)
data_subset.dropna(subset=['lifeexpectancy’],inplace=True)
data_subset.dropna(subset=['oilperperson’],inplace=True)
data_subset.dropna(subset=['relectricperperson’],inplace=True)
data_subset.dropna(subset=['suicideper100th’],inplace=True)
data_subset.dropna(subset=['employrate’],inplace=True)
data_subset.dropna(subset=['urbanrate’],inplace=True)
# to see the updated number of rows or observations
print(len(data_subset))
# to see the updated number of columns or variables
print(len(data_subset.columns))
# converting the data/variable values to numeric as
# some cells are blank and python may set the values as text/string type
# upon reading the blank cells
data_subset['incomeperperson’]=data_subset['incomeperperson’].astype(float)
data_subset['breastcancerper100th’]=data_subset['breastcancerper100th’].astype(float)
data_subset['hivrate’]=data_subset['hivrate’].astype(float)
data_subset['internetuserate’]=data_subset['internetuserate’].astype(float)
data_subset['lifeexpectancy’]=data_subset['lifeexpectancy’].astype(float)
data_subset['oilperperson’]=data_subset['oilperperson’].astype(float)
data_subset['relectricperperson’]=data_subset['relectricperperson’].astype(float)
data_subset['suicideper100th’]=data_subset['suicideper100th’].astype(float)
data_subset['employrate’]=data_subset['employrate’].astype(float)
data_subset['urbanrate’]=data_subset['urbanrate’].astype(float)
# dividing the values in the variable columns into bins/subsets as the values are continuous and not categorical
data_subset['incomeperperson_bins’]=pandas.cut(data_subset['incomeperperson’],3,labels=['low’, 'medium’, 'high’])
data_subset['breastcancerper100th_bins’]=pandas.cut(data_subset['breastcancerper100th’],3,labels=['low’, 'medium’, 'high’])
data_subset['hivrate_bins’]=pandas.cut(data_subset['hivrate’],3,labels=['low’, 'medium’, 'high’])
data_subset['internetuserate_bins’]=pandas.cut(data_subset['internetuserate’],3,labels=['low’, 'medium’, 'high’])
data_subset['lifeexpectancy_bins’]=pandas.cut(data_subset['lifeexpectancy’],3,labels=['low’, 'medium’, 'high’])
data_subset['oilperperson_bins’]=pandas.cut(data_subset['oilperperson’],3,labels=['low’, 'medium’, 'high’])
data_subset['relectricperperson_bins’]=pandas.cut(data_subset['relectricperperson’],3,labels=['low’, 'medium’, 'high’])
data_subset['suicideper100th_bins’]=pandas.cut(data_subset['suicideper100th’],3,labels=['low’, 'medium’, 'high’])
data_subset['employrate_bins’]=pandas.cut(data_subset['employrate’],3,labels=['low’, 'medium’, 'high’])
data_subset['urbanrate_bins’]=pandas.cut(data_subset['urbanrate’],3,labels=['low’, 'medium’, 'high’])
# print frequency distribution
print('counts for bins in incomeperperson’)
c1_ipp=data_subset['incomeperperson_bins’].value_counts()
print(c1_ipp)
print('percentage for bins incomeperperson’)
p1_ipp=data_subset['incomeperperson_bins’].value_counts(normalize=True)
print(p1_ipp)
print('counts for bins in breastcancerper100th’)
c2_ipp=data_subset['breastcancerper100th_bins’].value_counts()
print(c2_ipp)
print('percentage for bins breastcancerper100th’)
p2_ipp=data_subset['breastcancerper100th_bins’].value_counts(normalize=True)
print(p2_ipp)
print('counts for bins in hivrate’)
c3_ipp=data_subset['hivrate_bins’].value_counts()
print(c3_ipp)
print('percentage for bins hivrate’)
p3_ipp=data_subset['hivrate_bins’].value_counts(normalize=True)
print(p3_ipp)
print('counts for bins in internetuserate’)
c4_ipp=data_subset['internetuserate_bins’].value_counts()
print(c4_ipp)
print('percentage for bins internetuserate’)
p4_ipp=data_subset['internetuserate_bins’].value_counts(normalize=True)
print(p4_ipp)
print('counts for bins in lifeexpectancy’)
c5_ipp=data_subset['lifeexpectancy_bins’].value_counts()
print(c5_ipp)
print('percentage for bins lifeexpectancy’)
p5_ipp=data_subset['lifeexpectancy_bins’].value_counts(normalize=True)
print(p5_ipp)
print('counts for bins in oilperperson’)
c6_ipp=data_subset['oilperperson_bins’].value_counts()
print(c6_ipp)
print('percentage for bins oilperperson’)
p6_ipp=data_subset['oilperperson_bins’].value_counts(normalize=True)
print(p6_ipp)
print('counts for bins in relectricperperson’)
c7_ipp=data_subset['relectricperperson_bins’].value_counts()
print(c7_ipp)
print('percentage for bins relectricperperson’)
p7_ipp=data_subset['relectricperperson_bins’].value_counts(normalize=True)
print(p7_ipp)
print('counts for bins in suicideper100th’)
c8_ipp=data_subset['suicideper100th_bins’].value_counts()
print(c8_ipp)
print('percentage for bins suicideper100th’)
p8_ipp=data_subset['suicideper100th_bins’].value_counts(normalize=True)
print(p8_ipp)
print('counts for bins in employrate’)
c9_ipp=data_subset['employrate_bins’].value_counts()
print(c9_ipp)
print('percentage for bins employrate’)
p9_ipp=data_subset['employrate_bins’].value_counts(normalize=True)
print(p9_ipp)
print('counts for bins in urbanrate’)
c10_ipp=data_subset['urbanrate_bins’].value_counts()
print(c10_ipp)
print('percentage for bins urbanrate’)
p10_ipp=data_subset['urbanrate_bins’].value_counts(normalize=True)
print(p10_ipp)
# use of “pandas.cut” function to create 4 splits in ‘incomeperperson’ variable
print(data_subset['incomeperperson’].max())
data_subset['incomeperperson_split’]=pandas.cut(data_subset['incomeperperson’],[0,10000,20000,30000,40000])
print('counts for splits in incomeperperson’)
c11_ipp=data_subset['incomeperperson_split’].value_counts(sort=False)
print(c11_ipp)
print('percentage for splits incomeperperson’)
p11_ipp=data_subset['incomeperperson_split’].value_counts(normalize=True, sort=False)
print(p11_ipp)
Output - Frequency Distribution
counts for bins in incomeperperson
low       35
high      11
medium    10
Name: incomeperperson_bins, dtype: int64
percentage for bins incomeperperson
low       0.625000
high      0.196429
medium    0.178571
Name: incomeperperson_bins, dtype: float64
counts for bins in breastcancerper100th
low       27
high      16
medium    13
Name: breastcancerper100th_bins, dtype: int64
percentage for bins breastcancerper100th
low       0.482143
high      0.285714
medium    0.232143
Name: breastcancerper100th_bins, dtype: float64
counts for bins in hivrate
low       55
high       1
medium     0
Name: hivrate_bins, dtype: int64
percentage for bins hivrate
low       0.982143
high      0.017857
medium    0.000000
Name: hivrate_bins, dtype: float64
counts for bins in internetuserate
high      23
medium    18
low       15
Name: internetuserate_bins, dtype: int64
percentage for bins internetuserate
high      0.410714
medium    0.321429
low       0.267857
Name: internetuserate_bins, dtype: float64
counts for bins in lifeexpectancy
high      41
medium    14
low        1
Name: lifeexpectancy_bins, dtype: int64
percentage for bins lifeexpectancy
high      0.732143
medium    0.250000
low       0.017857
Name: lifeexpectancy_bins, dtype: float64
counts for bins in oilperperson
low       54
high       1
medium     1
Name: oilperperson_bins, dtype: int64
percentage for bins oilperperson
low       0.964286
high      0.017857
medium    0.017857
Name: oilperperson_bins, dtype: float64
counts for bins in relectricperperson
low       47
medium     8
high       1
Name: relectricperperson_bins, dtype: int64
percentage for bins relectricperperson
low       0.839286
medium    0.142857
high      0.017857
Name: relectricperperson_bins, dtype: float64
counts for bins in suicideper100th
low       36
medium    16
high       4
Name: suicideper100th_bins, dtype: int64
percentage for bins suicideper100th
low       0.642857
medium    0.285714
high      0.071429
Name: suicideper100th_bins, dtype: float64
counts for bins in employrate
medium    33
low       15
high       8
Name: employrate_bins, dtype: int64
percentage for bins employrate
medium    0.589286
low       0.267857
high      0.142857
Name: employrate_bins, dtype: float64
counts for bins in urbanrate
medium    31
high      18
low        7
Name: urbanrate_bins, dtype: int64
percentage for bins urbanrate
medium    0.553571
high      0.321429
low       0.125000
Name: urbanrate_bins, dtype: float64
Max value in incomeperperson
39972.3527684608
counts for splits in incomeperperson
(0, 10000]        32
(10000, 20000]     7
(20000, 30000]     9
(30000, 40000]     8
Name: incomeperperson_split, dtype: int64
percentage for splits incomeperperson
(0, 10000]       0.571429
(10000, 20000]   0.125000
(20000, 30000]   0.160714
(30000, 40000]   0.142857
Name: incomeperperson_split, dtype: float64
About Frequency Distribution
As the values under variable columns are continuous and not categorical, each variable column has been individually sub-divided into three relative categories/bins – low, medium and high, based on the spread of the values within that variable column.
The frequency distribution captures the distribution over these three bins for each variable column. If need is felt for a more sensitive analysis of any variable column, the number of bins can be increased for that individual variable. For example – 4 splits are identified for incomeperperson - [(0, 10000] < (10000, 20000] < (20000, 30000] < (30000, 40000]] based on the max value identified using “max()” function.
0 notes
jahnvijoshijj-blog · 4 years
Text
All data for the variable country only occur once (i.e. frequency is 1). There a few rows for the columns femaleemployrate and urbanrate where the frequency can go up to 3. I have removed all missing values for the variable femaleemployrate.
Program:
libname mydata “/courses/d1406ae5ba27fe300” access=readonly;
run;
data gap_new;
set mydata.gapminder;
where femaleemployrate ne .;
run;
proc sort data=gap_new;
by country;
run;
proc freq data=gap_new;
tables femaleemployrate country urbanrate;
run;
Output:
The FREQ Procedure
FEMALEEMPLOYRATE
femaleemployrate    Frequency
11.300000199           1
12.399999619           1
13                              1
16.700000763           1
17.700000763           1
18.200000763           1
19                              1
Country                  Frequency
Afghanistan            1
Albania                    1
Algeria                     1
Angola                     1
Argentina                1
Armenia                  1
urbanrate          Frequency
10.4                         1
12.54                       1
12.98                       1
13.22                       1
15.1                         1
16.54                       1
17                            1
Frequency Missing = 5
0 notes
jahnvijoshijj-blog · 4 years
Text
Determinants of how ideal romantic relationship depends on one`s personality and family
 After looking through the codebook  National Longitudinal Study of Adolescent Health (AddHealth), I have decided to look at determinants of how ideal romantic relationship depends on one`s personality and family
While deciding to look at which factors play a part in determining how ideal romantic relationship depends on one`s personality and family, I need to select the appropriate variables to use for my study. The AddHealth data set contains information on many indicators of independence, closeness in family and personality. Some of these indicators would probably not strongly impact how ideal romantic relationship depends on one`s personality and family and therefore can be dropped for my study.
I will be looking at how a country’s level of wealth, health and development impact its female labor force participation rate. I will be using the following variables in my code book.
 1.          Personality
How one takes hold of different situations when given option and how well it`s handled
2.          Mindset
The process of approach to the problem and the mindset towards the later scenario
3.          Relationship with mother
Family bonding of the individual with their mother and the amount of openness.
4.          Relationship with father
Family bonding of the individual with their father and the amount of openness.
5.   Level of comfort with their partners
My research questions are:
1.      How ideal romantic relationship depends on one`s personality and family?
2.      How does family support and parent attitude effect the relationship level ?
My code book is:
Personality trait
 Relationship with mother
 Relationship with father
 Bonding with partner
 Level of comfort
 Safety and precaution
 Literature review:
In most countries, children aren`t very comfortable in talking to their parents about their relationship status due to variety reasons such as ethics, social norms, family values and fear.
In a study it was found that if parent`s are open to accepting relationship the mental stress and the ease to solve the problem by the young kids becomes much easier than with parents with strict house hold ground rules.
Another study done revels the fear of getting punished or bring shame to the family stops the kid from communicating about these things to their parents, which further exaggerates the problem among the young generation.
Considering the review, I have developed the following hypotheses:
1.      Family relation play a huge role in one`s mindset towards relationships
2.      Person`s mindset and awareness also play a vital role.
Bibliography
1 Elsevier, Journal of Adolescence Volume 37, Issue 4, June 2014, Pages 433-440
2. The Role of Romantic Relationships in Adolescent Development: Wyndol Furman and Laura Shaffern WithWyndol Furman and Laura ShaVer Adolescent Romantic Relations and Sexual Behavior Theory, Research
1 note · View note