willemgijsbers - Tumblr blog

willemgijsbers · 5 years ago

Text

Peer-graded Assignment: Creating graphs for your data

Original research questions: Are craters evenly distributed across latitude and longitude? Is there a correlation between the latitude / longitude of a crate and its size?

The Python code used below and the graphs suggest that craters are not evenly distributed across Mars, with particular spikes around -25° latitude and between -50° and +50° longitude. However there appears to be no strong correlation between the size of the crater and the latitude and longitude as can be derived from the scatterplots.

Graphs:

Python code:

# -*- coding: utf-8 -*- """ Created on Tue Apr 28 15:24:11 2020

@author: willem gijsbers """

#import required libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt

#Set PANDAS to show all columns in DataFrame pd.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pd.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%f'%x)

#Load the mars crater data set data = pd.read_csv("Marscrater_pds.csv",low_memory = False)

#Convert variables to numeric pd.to_numeric(data["DIAM_CIRCLE_IMAGE"]) pd.to_numeric(data["LATITUDE_CIRCLE_IMAGE"]) pd.to_numeric(data["LONGITUDE_CIRCLE_IMAGE"])

#As per the week 2 assignment, Latitude and longitude need to be binned to answer the correlation question #To keep the number of bins manageable, 5 degrees (both for latitude and longitude) is used as the bin size #Crater sizes are split using the same principle but using a step of 10kms latitudeBins = np.arange(-90,95,5,np.int64) longitudeBins = np.arange(-180,185,10,np.int64) sizeBins = np.arange(0,1166,10,np.int64)

#Output univariate graphs

#Latitude sns.distplot(data["LATITUDE_CIRCLE_IMAGE"].dropna(), kde=False,bins=latitudeBins); plt.xlabel('Latitude') plt.title('Number of Craters per latitude bin of 5°') plt.figure()

#Longitude sns.distplot(data["LONGITUDE_CIRCLE_IMAGE"].dropna(), kde=False,bins=longitudeBins); plt.xlabel('Longitude') plt.title('Number of Craters per longitude bin of 5°') plt.figure()

#Size sns.distplot(data["DIAM_CIRCLE_IMAGE"].dropna(), kde=False,bins=sizeBins); plt.xlabel('Crater size') plt.title('Size of the craters per 10 kilometers') plt.figure()

#Output bivariate graphs #basic scatterplot: Q->Q scat1 = sns.regplot(x="LATITUDE_CIRCLE_IMAGE", y="DIAM_CIRCLE_IMAGE", fit_reg=False, data=data) plt.xlabel('Crater latitude') plt.ylabel('Crater size') plt.title('Scatterplot for the Association Between crater latitude and crater size') plt.figure()

scat1 = sns.regplot(x="LONGITUDE_CIRCLE_IMAGE", y="DIAM_CIRCLE_IMAGE", fit_reg=False, data=data) plt.xlabel('Crater longitude') plt.ylabel('Crater size') plt.title('Scatterplot for the Association Between crater latitude and crater size') plt.figure()

0 notes

willemgijsbers · 5 years ago

Text

Peer-graded Assignment: Making Data Management Decisions

The subsequent assignment is to perform data managment on the Mars crater data. As the objective is to identify correlation between crater size and the position on Mars and given the large number of data points I have selected (for the this exercise to bin the potential values.

In practice this might not be required as correlation can work perfectly well on continuous variables.

The code bins latitude and longitude variables into buckets of 5° and the size in buckets of 1 kilometer.

Unlike the survey examples in the lessons, the data does not contain any missing values as it is not based on surveys.

Used Python code:

# -*- coding: utf-8 -*-

"""

Created on Tue Apr 28 13:17:37 2020

@author: Willem Gijsbers

"""

#import required libraries - numpy not imported as not used

import pandas as pd

import numpy as np

#Load the mars crater data set

data = pd.read_csv("Marscrater_pds.csv",low_memory = False)

#Convert variables to numeric

pd.to_numeric(data["DIAM_CIRCLE_IMAGE"])

pd.to_numeric(data["LATITUDE_CIRCLE_IMAGE"])

pd.to_numeric(data["LONGITUDE_CIRCLE_IMAGE"])

#As per the week 2 assignment, Latitude and longitude need to be binned to answer the correlation question

#To keep the number of bins manageable, 5 degrees (both for latitude and longitude) is used as the bin size

#Crater sizes are split using the quartiles princple but over 20 bins to provide more granularity.

latitudeBins = np.arange(-90,95,5,np.int64)

longitudeBins = np.arange(-180,185,5,np.int64)

sizeBins = np.arange(0,1166,1,np.int64)

data["LATITUDE_BIN"] = pd.cut(data["LATITUDE_CIRCLE_IMAGE"],latitudeBins)

data["LONGITUDE_BIN"] = pd.cut(data["LONGITUDE_CIRCLE_IMAGE"],longitudeBins)

data["SIZE_BIN"] = pd.cut(data["DIAM_CIRCLE_IMAGE"],sizeBins)

#Output

print("Updated latitude frequency table")

print(data["LATITUDE_BIN"].value_counts().sort_index())

print("Updated longitude frequency table")

print(data["LONGITUDE_BIN"].value_counts().sort_index())

print("Updated size frequency table")

print(data["SIZE_BIN"].value_counts().sort_index())

Output

[-90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90] [-180 -175 -170 -165 -160 -155 -150 -145 -140 -135 -130 -125 -120 -115 -110 -105 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180] Latitude and bins cross-tab for validation LATITUDE_CIRCLE_IMAGE -86.700 -86.560 -85.988 ... 85.085 85.097 85.702 LATITUDE_BIN ... (-90, -85] 1 1 1 ... 0 0 0 (-85, -80] 0 0 0 ... 0 0 0 (-80, -75] 0 0 0 ... 0 0 0 (-75, -70] 0 0 0 ... 0 0 0 (-70, -65] 0 0 0 ... 0 0 0 (-65, -60] 0 0 0 ... 0 0 0 (-60, -55] 0 0 0 ... 0 0 0 (-55, -50] 0 0 0 ... 0 0 0 (-50, -45] 0 0 0 ... 0 0 0 (-45, -40] 0 0 0 ... 0 0 0 (-40, -35] 0 0 0 ... 0 0 0 (-35, -30] 0 0 0 ... 0 0 0 (-30, -25] 0 0 0 ... 0 0 0 (-25, -20] 0 0 0 ... 0 0 0 (-20, -15] 0 0 0 ... 0 0 0 (-15, -10] 0 0 0 ... 0 0 0 (-10, -5] 0 0 0 ... 0 0 0 (-5, 0] 0 0 0 ... 0 0 0 (0, 5] 0 0 0 ... 0 0 0 (5, 10] 0 0 0 ... 0 0 0 (10, 15] 0 0 0 ... 0 0 0 (15, 20] 0 0 0 ... 0 0 0 (20, 25] 0 0 0 ... 0 0 0 (25, 30] 0 0 0 ... 0 0 0 (30, 35] 0 0 0 ... 0 0 0 (35, 40] 0 0 0 ... 0 0 0 (40, 45] 0 0 0 ... 0 0 0 (45, 50] 0 0 0 ... 0 0 0 (50, 55] 0 0 0 ... 0 0 0 (55, 60] 0 0 0 ... 0 0 0 (60, 65] 0 0 0 ... 0 0 0 (65, 70] 0 0 0 ... 0 0 0 (70, 75] 0 0 0 ... 0 0 0 (75, 80] 0 0 0 ... 0 0 0 (80, 85] 0 0 0 ... 0 0 0 (85, 90] 0 0 0 ... 1 1 1

[36 rows x 129197 columns]

0 notes

willemgijsbers · 5 years ago

Text

Peer-graded Assignment: Running Your First Program

The second assignment consists of developing a Python program leveraging the selected data set (in my case the Mars craters) and explore the data for selected variables using frequency counts.

I have selected the following 3 variables:

DIAM_CIRCLE_IMAGE LATITUDE_CIRCLE_IMAGE LONGITUDE_CIRCLE_IMAGE

The following Python code is used to calculate the outputs:

# -*- coding: utf-8 -*- """ Created on Tue Apr 28 10:56:21 2020

@author: Willem Gijsbers """

#import required libraries - numpy not imported as not used import pandas as pd

#Load the mars crater data set data = pd.read_csv("Marscrater_pds.csv",low_memory = False)

#Convert variables to numeric pd.to_numeric(data["DIAM_CIRCLE_IMAGE"]) pd.to_numeric(data["LATITUDE_CIRCLE_IMAGE"]) pd.to_numeric(data["LONGITUDE_CIRCLE_IMAGE"])

#Calculate freqency counts and sort by key rather than frequency count diameterValueCount = data["DIAM_CIRCLE_IMAGE"].value_counts().sort_index() latitudeValueCount = data["LATITUDE_CIRCLE_IMAGE"].value_counts().sort_index() longitudeValueCount = data["LONGITUDE_CIRCLE_IMAGE"].value_counts().sort_index()

print ("Freqency distribution for crater diameters") print(diameterValueCount)

print ("Freqency distribution for latitude") print(latitudeValueCount)

print ("Freqency distribution for longitude") print(longitudeValueCount)

#For practice only select craters with diameter between 1 and 2 kilometers selection = data[(data["DIAM_CIRCLE_IMAGE"] >= 1) & (data["DIAM_CIRCLE_IMAGE"] < 2)] selection = selection.copy()

selectedDiameterValueCount = selection["DIAM_CIRCLE_IMAGE"].value_counts().sort_index()

print ("Freqency distribution for selected crater diameters between 1 and 2 kilometers") print(selectedDiameterValueCount)

The output is the following:

Freqency distribution for crater diameters 1.00 3129 1.01 6298 1.02 6077 1.03 6035 1.04 5941

467.25 1 512.75 1 624.50 1 1096.65 1 1164.22 1 Name: DIAM_CIRCLE_IMAGE, Length: 6240, dtype: int64 Freqency distribution for latitude -86.700 1 -86.560 1 -85.988 1 -85.973 1 -85.560 1 .. 84.969 1 85.008 1 85.085 1 85.097 1 85.702 1 Name: LATITUDE_CIRCLE_IMAGE, Length: 129197, dtype: int64 Freqency distribution for longitude -179.997 1 -179.993 1 -179.992 2 -179.991 2 -179.990 1 .. 179.992 1 179.993 1 179.994 1 179.996 1 179.997 1 Name: LONGITUDE_CIRCLE_IMAGE, Length: 231245, dtype: int64 Freqency distribution for selected crater diameters between 1 and 2 kilometers 1.00 3129 1.01 6298 1.02 6077 1.03 6035 1.04 5941

1.95 942 1.96 926 1.97 918 1.98 957 1.99 876 Name: DIAM_CIRCLE_IMAGE, Length: 100, dtype: int64

Observations:

Craters appear to be mostly small (with frequency count dropping quicky between 1-2km) and clearly not normally distributed.

Longitude & latitude distribution more challenging to judge and would require to group the craters into larger buckets (e.g., of one degree) to answer the original research question.

Frequency tables do not suggest any missing values (both looking at the index sort and the overall length).

0 notes

willemgijsbers · 5 years ago

Text

Week 1 - Getting Your Research Project Started

STEP 1: Choose a data set that you would like to work with

After reviewing the data sets, I’ve selected the MARS crater data set as the one I would like to use for this course on data management & visualization. I have an exact sciences background and have always been passionate about astronomy.

STEP 2: Identify a specific topic of interest & STEP 4: Identify a second topic that you would like to explore in terms of its association with your original topic

As a topic I would like to explore the randomness of the crates on Mars: Is there any correlation between the amount and size of the crates and their position.

STEP 3 & 5: Prepare a codebook of your own & Add questions/items/variables documenting this second topic to your personal codebook

Relevant variables selected from the official code book (https://d396qusza40orc.cloudfront.net/phoenixassets/data-management-visualization/Mars%20Crater%20Codebook.pdf):

CRATER_ID LATITUDE_CIRCLE_IMAGE LONGITUDE_CIRCLE_IMAGE DIAM_CIRCLE_IMAGE

STEP 6: Perform a literature review to see what research has been previously done on this topic

I leveraged Google Scholar to search for “Randomness crates MARS”: Crates typically occur in clusters or as isolated crater doublets (https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19710020643.pdf) Additionally I searched for : “mars crater distribution”:

Research performed in 1974 suggests that depending on the size of the crater, distribution is more or less random. Large craters tend to form in clusters whereas smaller ones clustered around specific bands (https://www.sciencedirect.com/science/article/abs/pii/0019103574901754)

Overall the literature review suggests that crates typically occur in clusters due to the geological processes that create them and that the distribution is not random.

STEP 7: Hypothesis

Based on the literature review, my current hypothesis is that crates are not evenly / random distributed and that there will be correlations with longitude and latitude

1 note · View note