#frequency_distribution | Explore Tumblr posts and blogs

chaolinchang · 11 months ago

Text

Analyzing Tobacco Use Patterns and Experiences

(Creating graphs for your data(week4))

Code:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

# Load the NESARC dataset

data = pd.read_csv('/Users/ccl/Documents/Coursera/Data Management and Visualization/datasets/nesarc_pds.csv', low_memory=False)

# Select the relevant columns for nicotine dependence and smoking

columns_of_interest = [

'SMOKER', # TOBACCO USE STATUS

'S3AQ2A1', # Age when first smoked a cigarette

'S3AQ3C1', # Usual quantity when smoked cigarettes..

'S3AQ3B1', # USUAL FREQUENCY WHEN SMOKED CIGARETTES

'S3AQ91', # SMOKED CIGARETTES WHEN HAD SOME OF THESE EXPERIENCES WITH TOBACCO IN LAST 12 MONTHS

]

colnames = [

"TOBACCO USE STATUS",

"AGE WHEN SMOKED FIRST FULL CIGARETTE",

"USUAL QUANTITY WHEN SMOKED CIGARETTES",

"USUAL FREQUENCY WHEN SMOKED CIGARETTES",

"SMOKED CIGARETTES WHEN HAD SOME OF THESE EXPERIENCES WITH TOBACCO IN LAST 12 MONTHS"

]

# Extract the relevant columns

nesarc_subset = data[columns_of_interest]

# Convert the selected columns to numeric, handling non-numeric values

for col in columns_of_interest:

nesarc_subset[col] = pd.to_numeric(nesarc_subset[col], errors='coerce')

# Display basic information about the subset

print(nesarc_subset.info())

# Display the first few rows of the subset

print(nesarc_subset.head())

# Calculate the frequency distribution for each variable

frequency_distributions = {col: nesarc_subset[col].value_counts(dropna=False) for col in columns_of_interest}

# Display the frequency distributions

namesIndex = 0

for col, freq_dist in frequency_distributions.items():

print(f'Frequency distribution for {colnames[namesIndex]}:')

print(freq_dist)

print('\n')

namesIndex += 1

# Drop rows with NaN values to avoid issues in plotting and correlation calculation

nesarc_subset_clean = nesarc_subset.dropna()

# Plotting associations between the variables

# Pair plot to see relationships between variables

sns.pairplot(nesarc_subset_clean)

plt.suptitle("Pair Plot of Selected Variables", y=1.02)

plt.show()

# Heatmap to show correlation between the variables

corr_matrix = nesarc_subset_clean.corr()

plt.figure(figsize=(10, 8))

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)

plt.title("Correlation Heatmap")

plt.show()

# Individual scatter plots for specific pairs of variables

plt.figure(figsize=(14, 8))

plt.subplot(2, 2, 1)

sns.scatterplot(x=nesarc_subset_clean['S3AQ2A1'], y=nesarc_subset_clean['S3AQ3C1'])

plt.xlabel(colnames[1])

plt.ylabel(colnames[2])

plt.title(f'{colnames[1]} vs {colnames[2]}')

plt.subplot(2, 2, 2)

sns.scatterplot(x=nesarc_subset_clean['S3AQ2A1'], y=nesarc_subset_clean['S3AQ3B1'])

plt.xlabel(colnames[1])

plt.ylabel(colnames[3])

plt.title(f'{colnames[1]} vs {colnames[3]}')

plt.subplot(2, 2, 3)

sns.scatterplot(x=nesarc_subset_clean['S3AQ3B1'], y=nesarc_subset_clean['S3AQ3C1'])

plt.xlabel(colnames[3])

plt.ylabel(colnames[2])

plt.title(f'{colnames[3]} vs {colnames[2]}')

plt.subplot(2, 2, 4)

sns.scatterplot(x=nesarc_subset_clean['SMOKER'], y=nesarc_subset_clean['S3AQ91'])

plt.xlabel(colnames[0])

plt.ylabel(colnames[4])

plt.title(f'{colnames[0]} vs {colnames[4]}')

plt.tight_layout()

plt.show()

Plots:

From the graph we can tell that:

If AGE WHEN SMOKED FIRST FULL CIGARETTE is smaller, it leads to higher USUAL QUANTITY WHEN SMOKED CIGARETTES.

If AGE WHEN SMOKED FIRST FULL CIGARETTE is smaller, it leads to higher USUAL FREQUENCY WHEN SMOKED CIGARETTES.

In summary, it seems that there's a strong association between the age of smoking the first cigarette and smoking quantity and frequency.

0 notes

freeavenuefox · 5 years ago

Text

Data Management: Mars Craters

The following python program gives the arrangement of data in a logically understandable and increased readability form.

The result of the above program is shown below;

The steps done in the above program are;

Subset data of Craters with 50Kms and above diameter

Assigning category to each craters based on their size and latitudinal position

Creating another data set with only required data

The inferences from the above program are

For increased readability it is best to subset data

Assigning logically understandable values to data helps in data management

Assignment of category to data helped in readability of frequency table on Latitude_range and Diameter_of_craters

#mars #craters #data_management #python #frequency_distribution

1 note · View note

dipayan-banerjee · 8 years ago

Text

Frequency Distribution for Marscrater_pds & Insights

SAS Code:

/*Library referencing the dataset*/

libname mydata '/courses/d1406ae5ba27fe300' access=readonly;

/*using the dataset marscrater_pds*/

data new;

set mydata.marscrater_pds;

/* Labelling the variables of the dataset*/

label NUMBER_LAYERS= 'Layers'

DIAM_CIRCLE_IMAGE= 'Diameter'

DEPTH_RIMFLOOR_TOPOG= 'Depth';

run;

/*sorting the data according to the NUMBER_LAYERS*/

proc sort;

by NUMBER_LAYERS;

run;

/*frequency table distribution of the 3 variable*/

proc freq;

tables NUMBER_LAYERS DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG;

run;

Variables Considered:

NUMBER_LAYERS

DIAM_CIRCLE_IMAGE

DEPTH_RIMFLOOR_TOPOG

Frequency Distribution Insights & Output

NUMBER_LAYERS:

The frequency is highest for craters whose NUMBER_LAYERS is 0.

The frequency drops abruptly as the NUMBER_LAYERS increases from 0-4.

#Click here for the output.

DEPTH_RIMFLOOR_TOPOG:

The DEPTH_RIMFLOOR_TOPOG is less the 0 for a few observations. It is deceptive to believe that the depth of crater can be negative.

DEPTH_RIMFLOOR_TOPOG distribution frequency abruptly decreases as the depth increases from 0-4. It is maximum for the depth ranging from 0-1.

#Click here for the output

DIAM_CIRCLE_IMAGE:

The Diameter of the crater decreases gradually as the diameter of the crater increases.

#click here for the output

#Frequency_distribution output - click the hyperlink.

0 notes

chaolinchang · 11 months ago

Text

Analyzing Tobacco Use Patterns and Experiences

(Running Your First Program(week2))

Code:

import pandas as pd

NESARC dataset

data = pd.read_csv('/Users/ccl/Documents/Coursera/Data Management and Visualization/datasets/nesarc_pds.csv', low_memory=False)

print(data.head())

Select the relevant columns for nicotine dependence and smoking

columns_of_interest = [ 'SMOKER', # TOBACCO USE STATUS 'S3AQ2A1', # Age when first smoked a cigarette 'S3AQ3C1', # Usual quantity when smoked cigarettes.. 'S3AQ3B1', # USUAL FREQUENCY WHEN SMOKED CIGARETTES 'S3AQ91', # SMOKED CIGARETTES WHEN HAD SOME OF THESE EXPERIENCES WITH TOBACCO IN LAST 12 MONTHS ]

colnames = ["TOBACCO USE STATUS", "AGE WHEN SMOKED FIRST FULL CIGARETTE", "USUAL QUANTITY WHEN SMOKED CIGARETTES", "USUAL FREQUENCY WHEN SMOKED CIGARETTES", "SMOKED CIGARETTES WHEN HAD SOME OF THESE EXPERIENCES WITH TOBACCO IN LAST 12 MONTHS"]

Extract the relevant columns

nesarc_subset = data[columns_of_interest]

Display basic information about the subset

print(nesarc_subset.info())

Display the first few rows of the subset

print(nesarc_subset.head()) '''

setting variables you will be working with to numeric

data['SMOKER'] = pd.to_numeric(data['SMOKER']) data['S3AQ2A1'] = pd.to_numeric(data['S3AQ2A1']) data['S3AQ3C1'] = pd.to_numeric(data['S3AQ3C1']) data['S3AQ3B1'] = pd.to_numeric(data['S3AQ3B1']) data['S3AQ91'] = pd.to_numeric(data['S3AQ91']) '''

Calculate the frequency distribution for each variable

frequency_distributions = {col: nesarc_subset[col].value_counts(dropna=False) for col in columns_of_interest}

Display the frequency distributions

namesIndex = 0 for col, freq_dist in frequency_distributions.items(): print(f'Frequency distribution for:', colnames[namesIndex]) #print(f'Frequency distribution for {col}:') print(freq_dist) print('\n') namesIndex += 1

Outputs:

Frequency distribution for: TOBACCO USE STATUS SMOKER 3 23901 1 11118 2 8074 Name: count, dtype: int64

Frequency distribution for: USUAL FREQUENCY WHEN SMOKED CIGARETTES S3AQ3B1 NaN 25080 1.0 14836 6.0 772 4.0 747 3.0 687 2.0 460 5.0 409 9.0 102 Name: count, dtype: int64

Frequency distribution for: SMOKED CIGARETTES WHEN HAD SOME OF THESE EXPERIENCES WITH TOBACCO IN LAST 12 MONTHS S3AQ91 38081 1 4743 2 269 Name: count, dtype: int64

Frequency Distributions:

Tobacco Use Status (SMOKER):

Current Smoker: 23,901 occurrences

Former Smoker: 11,118 occurrences

Never Smoker: 8,074 occurrences

Analysis: The dataset shows that the majority of participants are current smokers, followed by former smokers, with a smaller proportion having never smoked.

Usual Frequency When Smoked Cigarettes (S3AQ3B1):

Every day: 14,836 occurrences

Several times a week: 772 occurrences

Once a month: 747 occurrences

Several times a month: 687 occurrences

Once a week: 460 occurrences

Several times a year: 409 occurrences

Other: 102 occurrences

Missing/No Response: 25,080 occurrences

Analysis: The most common smoking frequency among respondents is daily smoking. There is a substantial amount of missing data, indicating that a large number of participants did not provide their smoking frequency.

Smoked Cigarettes When Had Some of These Experiences with Tobacco in Last 12 Months (S3AQ91):

Yes: 4,743 occurrences

No: 269 occurrences

Not specified/Other: 38,081 occurrences

Analysis: The majority of the responses do not clearly specify a particular experience (38,081 occurrences), but among the specified responses, most participants reported smoking cigarettes when they had certain experiences related to tobacco in the last 12 months, with very few reporting not smoking in such instances.

By examining these frequency distributions, we gain insights into the smoking behaviors and patterns among the participants. This information is crucial for understanding nicotine dependence and its association with various factors such as age and smoking habits.

0 notes