data-analysis-and-interprtn - Tumblr blog

data-analysis-and-interprtn · 2 years ago

Text

Running a k-mer Cluster Analysis

A k-means cluster analysis was conducted to identify underlying subgroups of countries based on their similarity of responses on 7 variables that represent characteristics that could have an impact on internet use rates. Clustering variables included quantitative variables measuring income per person, employment rate, female employment rate, polity score, alcohol consumption, life expectancy, and urban rate. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Because the GapMinder dataset which I am using is relatively small (N < 250), I have not split the data into test and training sets. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Load the data, set the variables to numeric, and clean the data of NA values

Subset the clustering variables

Standardize the clustering variables to have mean = 0 and standard deviation = 1

Split the data into train and test sets

Perform k-means cluster analysis for 1-9 clusters

Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Running A Lasso Regression Model

LASSO regression stands for Least Absolute Shrinkage and Selection Operator. The algorithm is another variation of linear regression, just like ridge regression. We use lasso regression when we have a large number of predictor variables. Lasso regression is a parsimonious model that performs L1 regularization. The L1 regularization adds a penalty equivalent to the absolute magnitude of regression coefficients and tries to minimize them. The equation of lasso is similar to ridge regression and looks like as given below.

LS Obj + λ (sum of the absolute values of coefficients)

Here the objective is as follows: If λ = 0, We get the same coefficients as linear regression If λ = vary large, All coefficients are shrunk towards zero

The two models, lasso and ridge regression, are almost similar to each other. However, in lasso, the coefficients which are responsible for large variance are converted to zero. On the other hand, coefficients are only shrunk but are never made zero in ridge regression. Lasso regression analysis is also used for variable selection as the model imposes coefficients of some variables to shrink towards zero. The following diagram is the visual interpretation comparing OLS and lasso regression.

TRAINING LASSO REGRESSION MODEL The training of the lasso regression model is exactly the same as that of ridge regression. We need to identify the optimal lambda value and then use that value to train the model. To achieve this, we can use the same glmnet function and passalpha = 1 argument. When we pass alpha = 0, glmnet() runs a ridge regression, and when we pass alpha = 0.5, the glmnet runs another kind of model which is called as elastic net and is a combination of ridge and lasso regression. We use cv.glmnet() function to identify the optimal lambda value Extract the best lambda and best model Rebuild the model using glmnet() function Use predict function to predict the values on future data For this example, we will be using swiss dataset to predict fertility based upon Socioeconomic Indicators for the year 1888. Updated – Code snippet was updated to correct some variable names – 28/05/2020 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Loading the library

library(glmnet)

Loading the data

data(swiss)

x_vars <- model.matrix(Fertility~. , swiss)[,-1] y_var <- swiss$Fertility lambda_seq <- 10^seq(2, -2, by = -.1)

Splitting the data into test and train

set.seed(86) train = sample(1:nrow(x_vars), nrow(x_vars)/2) x_test = (-train) y_test = y_var[x_test]

cv_output <- cv.glmnet(x_vars[train,], y_var[train], alpha = 1, lambda = lambda_seq, nfolds = 5)

identifying best lamda

best_lam <- cv_output$lambda.min best_lam 1 2

Output

[1] 0.3981072 Using this value, let us train the lasso model again. 1 2 3

Rebuilding the model with best lamda value identified

lasso_best <- glmnet(x_vars[train,], y_var[train], alpha = 1, lambda = best_lam) pred <- predict(lasso_best, s = best_lam, newx = x_vars[x_test,]) Finally, we combine the predicted values and actual values to see the two values side by side, and then you can use the R-Squared formula to check the model performance. Note – you must calculate the R-Squared values for both the train and test dataset. 1 2 3 final <- cbind(y_var[test], pred)

Checking the first six obs

head(final) 1 2 3 4 5 6 7 8

Output

Actual Pred

Courtelary 80.2 66.54744 Delemont 83.1 76.92662 Franches-Mnt 92.5 81.01839 Moutier 85.8 72.23535 Neuveville 76.9 61.02462 Broye 83.8 79.25439 SHARING THE R SQUARED FORMULA The function provided below is just indicative, and you must provide the actual and predicted values based upon your dataset. 1 2 3 4 5 6 actual <- test$actual preds <- test$predicted rss <- sum((preds - actual) ^ 2) tss <- sum((actual - mean(actual)) ^ 2) rsq <- 1 - rss/tss rsq GETTING THE LIST OF IMPORTANT VARIABLES To get the list of important variables, we just need to investigate the beta coefficients of the final best model. 1 2

Inspecting beta coefficients

coef(lasso_best) 1 2 3 4 5 6 7 8 9

Output

6 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) 66.5365304 Agriculture -0.0489183 Examination . Education -0.9523625 Catholic 0.1188127 Infant.Mortality 0.4994369 The model indicates that the coefficients of Agriculture and Education have been shrunk to zero. Thus we are left with three variables, namely; Examination, Catholic, and Infant.Mortality In this chapter, we learned how to build a lasso regression using the same glmnet package, which we used to build the ridge regression. We also saw what’s the difference between the ridge and the lasso is. In the next chapter, we will discuss how to predict a dichotomous variable using logistic regression.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Running Random Forest

Run a Random Forest

from sklearn.ensemble import RandomForestClassifier

Create a Random Forest classifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

Fit the model to the training data

rf_model.fit(X_train, y_train) # X_train contains explanatory variables, y_train contains the response variable

from sklearn.metrics import accuracy_score, classification_report

Make predictions on the testing data

y_pred = rf_model.predict(X_test) # X_test contains the explanatory variables for the testing set

Evaluate the model

accuracy = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy) print("Classification Report:\n", report)

import matplotlib.pyplot as plt

Get feature importances

feature_importances = rf_model.feature_importances_

Plot feature importances

plt.barh(range(len(feature_importances)), feature_importances, tick_label=feature_names) plt.xlabel('Feature Importance') plt.ylabel('Feature') plt.title('Random Forest Feature Importance') plt.show()

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Running a Classification Tree

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Testing a Logistic Regression Model

Output of the Logistic Regression Model

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Testing a Logistic Regression Model

Whether or not there was evidence of confounding

Based on the analysis, it seems that there is no strong evidence of confounding for the association between primary explanatory variable (income per person) and response variable (life expectancy). Here's why:

Introduction of HIV Rate as a Second Explanatory Variable I mentioned that when I introduced HIV rate as a second explanatory variable, it did not change the statistics of income per person much. This observation suggests that HIV rate is not acting as a significant confounding variable in the relationship between income per person and life expectancy.

Consistency of Results The p-value for income per person remained very low (0.000) after introducing HIV rate into the analysis. This indicates that the association between income per person and life expectancy remained highly significant even in the presence of the second variable. If HIV rate were a strong confounder, it could have substantially altered the significance of the income per person-life expectancy relationship.

Odds Ratios: The odds ratio for income per person remained close to its original value (1.000518) when HIV rate was added to the model. If there were confounding, you might have expected a more substantial change in the odds ratio for income per person.

Independence of Effects The odds ratio for HIV rate (0.313940) is substantially different from 1 and is statistically significant (p-value of 0.007), indicating that HIV rate has an independent effect on life expectancy. This suggests that HIV rate is not merely a confounder but a relevant explanatory variable itself.

There is no strong evidence of confounding between income per person and life expectancy. It appears that the relationship between income per person and life expectancy is robust and remains significant even when accounting for the influence of HIV rate. However, it's essential to consider that confounding can be a complex issue, and additional factors not included in the analysis could still potentially confound the relationship.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Testing a Logistic Regression Model

Whether the Results are supporting the Hypothesis

Based on the analysis, it appears that results support the hypothesis regarding the association between primary explanatory variable (income per person) and response variable (life expectancy). Here's a summary:

Income per Person I found that as income per person increases, life expectancy also increases. The odds ratio for income per person is slightly greater than 1 (1.000518), indicating a positive association. Although the odds ratio is close to 1, it is statistically significant with a very low p-value (0.000), and the 95% confidence interval is narrow, suggesting a reliable relationship.

HIV Rate I introduced HIV rate as a second explanatory variable and found that it is negatively associated with life expectancy. Decreasing HIV rates lead to an increase in life expectancy. The odds ratio for HIV rate is less than 1 (0.313940), and it is statistically significant with a p-value of 0.007. The confidence intervals for both income per person and HIV rate are small, indicating the reliability of these results.

The results support the hypothesis that increasing income per person is associated with higher life expectancy, and decreasing the HIV rate is also associated with higher life expectancy. These findings are consistent with the expectations, as higher income can lead to better access to healthcare and improved living conditions, while a lower HIV rate indicates better public health outcomes, both of which contribute to increased life expectancy.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Testing a Logistic Regression Model

Summary of the Finding The analysis describes a logistic regression model that examines the relationship between life expectancy (categorized as low or high) and two explanatory variables: income per person and HIV rate. Here are the key findings and interpretations:

Life Expectancy vs. Income per Person:

The initial analysis shows that the relationship between life expectancy and income per person is highly significant, with a p-value of 0.000.

The odds ratio for income per person is slightly above 1 (1.000518), indicating that as income per person increases, the odds of having high life expectancy also increase. However, the correlation is not particularly strong due to the odds ratio's proximity to 1.

The narrow 95% confidence interval for the odds ratio suggests that the estimate is precise.

Adding HIV Rate as an Explanatory Variable:

The introduction of HIV rate as a second explanatory variable did not significantly change the statistics for income per person. This suggests that HIV rate is not a confounding variable.

With both income per person and HIV rate included in the model, the p-value for income per person remained highly significant (0.000), and the p-value for HIV rate was 0.007, indicating that both variables are statistically significant.

The odds ratio for income per person (1.000496) suggests that increasing income per person leads to higher life expectancy, and the odds ratio for HIV rate (0.313940) suggests that decreasing HIV rates also lead to higher life expectancy.

The narrow confidence intervals for both odds ratios indicate precise estimates.

Interpretation:

These findings align with previous analysis, indicating that increasing income per person is associated with higher life expectancy, while increasing the HIV rate is associated with lower life expectancy.

The logistic regression model provides a way to quantify these associations and assess their significance.

The odds ratios indicate the change in the odds of having high life expectancy for a one-unit change in each explanatory variable, holding other variables constant.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Testing a Regression Model (3)

Regression diagnostic plots are essential for assessing the assumptions and performance of a regression model. They help us understand the distribution of residuals, model fit, influential observations, and the presence of outliers. Without access to the specific data and software, I can describe the typical diagnostic plots and what they might reveal:

Residual vs. Fitted Values Plot:

This plot examines the relationship between the residuals (the differences between observed and predicted values) and the fitted values (the values predicted by the model).

What to look for:

Homoscedasticity: If the points are randomly scattered around a horizontal line, it suggests that the assumption of constant variance (homoscedasticity) is met.

Heteroscedasticity: If the points form a funnel shape or show a clear pattern, it indicates heteroscedasticity, suggesting that the variance of residuals is not constant across the range of fitted values.

Normal Q-Q Plot:

This plot assesses whether the residuals follow a normal distribution.

What to look for:

If the points closely follow a straight diagonal line, it suggests that the residuals are approximately normally distributed.

Deviations from the line at the tails indicate departures from normality.

Residual vs. Predictor Plot:

These plots show the relationship between individual predictors and the corresponding residuals.

What to look for:

Outliers: Points far from the horizontal line may represent influential observations or outliers.

Patterns: Systematic patterns in the plot may indicate model misspecification.

Leverage-Residuals Plot:

This plot displays leverage (influence) values against standardized residuals.

What to look for:

Outliers: Points with high leverage and large residuals are influential observations that can significantly affect the regression coefficients.

Influential Observations: Observations that exert considerable influence on the model's fit are identified.

Cook's Distance Plot:

Cook's distance measures the influence of each observation on the regression coefficients.

What to look for:

Observations with high Cook's distances are influential; they may need further examination for potential outliers or influential data points.

Partial Regression Plots:

These plots show the relationship between a single predictor and the response variable while controlling for the effects of other predictors.

What to look for:

Patterns in the partial regression plots can help identify influential observations or outliers.

Outlier Detection Plot:

This plot specifically highlights potential outliers.

What to look for:

Data points that fall far from the cluster of other points may be outliers.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Testing a Regression Model (2)

The analysis suggests that there is evidence of confounding when considering the association between the primary explanatory variable (income per person) and the response variable (life expectancy). Here's why:

Introduction of Additional Variable (Internet Use Rate) When the variable "internet use rate" was added to the analysis, it exhibited a p-value that fell outside the range of significance. This suggests that "internet use rate" did not have a significant independent effect on life expectancy.

Impact on Other Variables Notably, the inclusion of "internet use rate" as an additional variable led to several other variables having higher (but still significant) p-values. This indicates that the introduction of "internet use rate" affected the relationships between the primary explanatory variable (income per person) and other variables in the model.

Confounding Explanation Confounding occurs when an additional variable (in this case, "internet use rate") is associated with both the primary explanatory variable and the response variable. It can distort the true relationship between the primary explanatory variable and the response variable. In this analysis, it appears that "internet use rate" is associated with both income per person and life expectancy, leading to confounding.

Interpretation of Confounding The presence of confounding implies that the observed relationship between income per person and life expectancy may be influenced by "internet use rate." This means that the effect of income per person on life expectancy might be overestimated or underestimated when "internet use rate" is not considered as a confounding variable.

Addressing Confounding To address confounding, it is essential to either control for the confounding variable (e.g., by including it as a covariate in the analysis) or conduct further analyses to understand the relationship between the confounding variable and the primary explanatory and response variables. Failing to account for confounding can lead to biased or incorrect conclusions.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Testing a Multiple Regression Model (1)

Correlations and Coefficients The analysis identifies significant correlations between life expectancy and three variables: income per person, employ rate, and HIV rate, all with very low p-values (p-value=0.000). Income per person shows a slight positive correlation with life expectancy, with a coefficient of 0.0005. Employ rate is negatively correlated with life expectancy, with a coefficient of -0.2952. HIV rate is strongly negatively associated with life expectancy, with a coefficient of -1.1005. These findings support the hypothesis that life expectancy can be predicted based on these variables.

Impact of Additional Variable (Internet Use Rate): The introduction of the variable "internet use rate" into the analysis results in a p-value that falls outside the range of significance. Additionally, it causes several other variables to have higher (but still significant) p-values. This suggests that "internet use rate" may be a confounding variable that is associated with the others but does not provide new information when included in the analysis.

Choice of Model: The analysis suggests that a curved line (rather than a linear one) is a better fit for the relationship between income per person and life expectancy. While a 2-degree polynomial line was considered, the data appears to align better with a logarithmic line. This is supported by observations in Q-Q plots, where actual data deviates from predicted values, especially at the extremes and in the middle.

Residual Analysis: The plot of residuals reveals one data point that falls outside of -3 standard deviations of the mean, indicating an extreme outlier. Many other points fall within the range of -2 to -3 standard deviations. The poor fit of the polynomial line compared to a logarithmic line is considered a possible reason for this pattern.

Influence Analysis: An alarming point (labeled 111 and 57 in the influence plot) is identified as an extreme outlier in terms of both residual value and influence. These points are also prominent in the plot of residuals versus income per person and the Partial regression plot. The need to examine and potentially exclude these points from further analysis is acknowledged.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Testing Multiple Regression Model

The analysis shows that life expectancy is significantly correlated with income per person(p-value=0.000) ,employrate (p-value=0.000) and hivrate (p-value=0.000). With a coefficient of 0.0005 income per person is slightly positively correlated with life expectancy while employrate is negatively correlated with a coefficient of -0.2952 and hivrate is strongly negatively associated with a coefficient of -1.1005. This supports my hypothesis that lifeexpectancy could be predicted based upon incomeperperson, employrate and hivrate.

When I added internetuserate to my analysis, it exhibited a p-value out of range of significance, but it also threw several other variables into higher (but still significant) p-value ranges. This suggests that internetuserate is a confounding variable and is associated with the others but adding it to my analysis adds no new information.

Examining the plots posted above indicates that a curved line is a much better fit for the relationship between incomeperperson and lifeexpectancy. However, I do not believe a 2-degree polynomial line is the best; the data appears to match a logarithmic line better. Indeed, the Q-Q plot does show that the actual data is lower than predicted at the extremes and lower than predicted in the middle. This would match my theory that a logarithmic line would be a better fit. The plot of residuals has one data points fall outside -3 standard deviations of the mean; however, I am concerned that so many fall within -2 to -3 deviations. I attribute this to the poor fit of the polynomial line as compared to a logarithmic line. The regression plots and the influence plot show an alarming point (labeled 111 and 57 in the influence plot) which is an extreme outlier in terms of both residual value and influence. These points show up again in the plot Residuals versus incomeperperson and the Partial regression plot. I must examine what these points are and possibly exclude it from the rest of my analysis.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Description of Variables

a) Measurement of Explanatory and Response Variables:

Explanatory Variables:

Commodity: This is a nominal variable that measures the name or type of commodity being traded. It serves as a descriptor of the traded goods.

Quantity Name: Another nominal variable, it describes the unit or name of the quantity being measured (e.g., liters, tons).

Category: This nominal variable categorizes the commodities into broader groups or categories.

Response Variables:

Commodity Code: A nominal variable that provides a code or identifier for the specific commodity. It serves as a unique identifier for each commodity.

Year: An interval variable that measures the year in which the trade occurred. It provides a chronological dimension to the data.

Flow: A nominal variable indicating the direction of trade (export, import, or re-export). It describes the flow of commodities between countries.

Weight in kg: A ratio variable measuring the weight of the traded commodities in kilograms. It quantifies the mass of the goods.

Quantity: Another ratio variable that quantifies the quantity of the traded commodities.

b) Response Scales:

Explanatory Variables: The explanatory variables, which include "Commodity," "Quantity Name," and "Category," are measured on a nominal scale. Nominal variables represent categories or labels without any inherent order or ranking.

Response Variables:

"Commodity Code" is measured on a nominal scale, serving as a unique identifier for commodities.

"Year" is measured on an interval scale, representing time in years. It has a meaningful order and consistent intervals between values.

"Flow" is measured on a nominal scale, representing trade direction without a specific order.

"Weight in kg" and "Quantity" are measured on a ratio scale, as they have a meaningful zero point (absence of weight or quantity) and consistent intervals. These variables can be subjected to arithmetic operations.

c) Management of Explanatory and Response Variables:

The management of explanatory and response variables would typically involve data preprocessing and analysis tasks. Here are some common steps in managing these variables:

Data Cleaning: Ensuring that all data entries are valid and consistent, with no missing values or outliers. This may involve handling cases where "No Quantity" is specified in the "Quantity Name" variable but there is a non-zero trade value.

Encoding Categorical Variables: Converting nominal categorical variables like "Commodity," "Quantity Name," "Category," and "Flow" into numerical representations (e.g., one-hot encoding or label encoding) if needed for modeling purposes.

Time Series Analysis: Given that "Year" is an interval variable representing time, time series analysis techniques can be applied to explore trends, seasonality, and patterns over the years.

Exploratory Data Analysis (EDA): Conducting EDA to understand the distribution, relationships, and potential correlations between explanatory and response variables. Visualization techniques can be helpful in this phase.

Statistical Modeling: Depending on the research objectives, statistical models, such as regression analysis, can be used to explore the relationships between explanatory and response variables.

Feature Engineering: Creating new features or transformations based on the explanatory variables to improve model performance or gain deeper insights.

Data Visualization: Creating visualizations to communicate findings and insights effectively, especially when presenting results to stakeholders.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Description of the procedures used for data collection

a) Study Design The data appears to have been generated through data reporting and observation. The United Nations Statistics Department (UNSD) collected the data by gathering national accounts from member states. It seems to be an observational study since the data spans across different countries and years without any indication of controlled experiments or interventions.

b) Original Purpose The original purpose of collecting this data was likely for the analysis of global commodity trade statistics. It serves various applications, such as studying the flow of commodities in the market, assessing the economic health of nations, and providing insights into international trade dynamics.

c) Data Collection The data collection process involved the UNSD collecting national accounts data from member states. It's common for such data to be reported by countries through customs records, trade agencies, or other relevant authorities. The data includes information on commodity codes, commodity names, trade flows (exports, imports, or re-exports), trade values in USD, weights in kilograms, quantity names, quantities, and categories. The collection process likely required collaboration between international organizations and national authorities to compile a comprehensive dataset.

d) Data Collection Period The data collection period spans from 1962 to 2022. This suggests that data was collected over several decades, providing a long-term perspective on global commodity trade trends.

e) Data Collection Location The data was collected from every United Nations (UN) member state. Therefore, it was collected globally, covering countries from around the world. The dataset represents a wide geographic range, allowing for a comprehensive analysis of international trade patterns.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Description of Sample

a) Study Population The dataset sample that will be studied is about global commodity trade statistics. This dataset studies import and export of many countries of the world. This dataset covers more than 1 million observed commodity trades of all types traded by observed countries. Level of analysis studied is aggregate, since observed data spans across the whole globe.

c) Level of analysis All the data was collected by United Nations Statistics Department(UNSD) by collecting national accounts from member states. The collected data by UNSD can have many applications: help analyze market flow of commodities, economical health of a nation, and much more. Data was collected from every UN member state between 1962 and 2022.

3. Variables

Dataset contains next variables:

Year - Interval;

Commodity Code - Nominal

Commodity (name of comodity) - Nominal

Flow (Export, Import or Re-Export) - Nominal

Trade value in USD - Ratio

Weight in kg - Ratio

Quantity name - Nominal

Quantity - Ratio

Category - Nominal

Responsive variables are: Commodity Code, Year, Flow, Weight, Quantity

Explanatory variables are: Commodity, Quantity name, category

All data entries have no missing variables, but some trades have "No Quantity" specified in quantity_name variable and have non-zero trade value.

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Creating Graphs For Data

The assignment was to create univariate graphs on the one hand, and bivariate graphs on the other hand.

Univariate graphs for each selected variables:

MAJORDEP12 and AGE

2. S4AQ4A14

3. MAJORDEP12 and S4AQ4A14

4. S4AQ54

5. MAJORDEP12 and S4AQ54

6. MAR12ABDEP

7. S4AQ6B

8. MAR12ABDEP and S4AQ6B

Bivariate graphs:

Summary:

As the variables were all categorical, there was no possibility to calculate the center and spreads

the graphs showed that the 20 years old people have the highest number of major depression compared to the other young adults between 18 and 30 years old. Although there are some peaks, the number of major depression among young adults seems to be equally distributed and there is no tendency to be seen

thy symptom of having trouble to concentrate/keeping mind on things among young people with major depression shows no direct association, as there are less people with that symptom compared to the total number of people (all ages)

however, the symptom of having trouble doing things supposed to do is a bit more widespread among young people. But compared to the entirety it is not strongly pronounced

the graphs revealed that there is a very low number of people with any kind of cannabis diagnosis and also a low number of young adults whose had an onset of first episode in the last 12 months

the fewest people who had their onset of first episode had a cannabis diagnosis.

some had an onset during cannabis abuse/dependence. but, the majority of those who consumed cannabis in the last 12 months had no onset

0 notes

data-analysis-and-interprtn · 2 years ago

Text

Making Data Management Decisions

First subset data from (19 to 30) who have smoked in the past 12 months then recode missing values to NaN, note that there are 3 NaN

Coding Valid Data:

I use 'fillna' to replace NaN with 19, notice that output of NaN has disappeared

Recode missing values for 'S3AQ3B1' and recoding new variables for 'S3AQ3B1'

Secondary variable where multiplies the number of days smoked/month and the approximation number of cig. smoked/day

Where shows counts and percentages for age

Here I split age into 4 groups (17, 20), (20, 22), (22, 25), (25, 29) and show Counts, percentage (frequency distribution).

and show frequency distribution for 'S3AQ3C1'

0 notes