evaristovillarreal
evaristovillarreal
Untitled
10 posts
Don't wanna be here? Send us removal request.
evaristovillarreal · 1 year ago
Text
Understanding Employee Satisfaction in the Tech Industry
Step 1: Describe your sample.
In this study, the sample consisted of employees from a tech company in Silicon Valley. The level of analysis was individual, with a total of 300 observations. After excluding incomplete or unreliable responses, the data analytic sample comprised 280 observations.
Step 2: Describe the procedures that were used to collect the data.
A cross-sectional survey design was used to collect data on employee satisfaction. The original purpose of the data collection was to assess employee satisfaction with their work environment and HR policies. The data were collected through an online survey emailed to all employees over a two-week period in the summer of 2024. Employees completed the survey remotely from their workplace or home.
Step 3: Describe your variables.
The explanatory variable measured in the study was the level of job satisfaction, rated on a scale of 1 to 5, where 1 represented "very dissatisfied" and 5 represented "very satisfied." The response variable was the importance employees placed on various HR policies for their job satisfaction, measured on a nominal scale.
Analysis and Implications
The study utilized regression analysis to examine the relationship between job satisfaction levels and the importance of HR policies. Additionally, content analysis was used to categorize responses regarding HR policies.
Understanding employee satisfaction in the tech industry is crucial for attracting and retaining top talent. This study provides valuable insights into the factors that influence employee satisfaction, highlighting the importance of HR policies in creating a positive work environment. Implementing policies that align with employee preferences can lead to higher job satisfaction and overall organizational success.
0 notes
evaristovillarreal · 2 years ago
Text
Examining the Moderation Effect of Exercise Intensity on the Relationship Between Diet and Weight Loss
Testing Moderator Effects in Research: An Introduction to ANOVA with an Example
"Analysis" refers to the process of examining, studying, or investigating something systematically to understand its nature, components, relationships, and conclusions. It involves breaking down a problem, situation, or set of data into smaller parts and examining each of them in detail to gain a deeper understanding or arrive at informed conclusions. Analysis is utilized across various fields, including science, research, statistics, business, economics, and more, as a fundamental tool for making informed decisions and solving problems.
Moderation analysis is a crucial statistical technique in research that helps us understand whether the relationship between two variables changes when a third variable, known as a moderator, is taken into account. In this blog post, we will explore how to perform a moderation analysis using Analysis of Variance (ANOVA). We will walk through the steps involved, provide the syntax used, and interpret the results.
Understanding Moderation: Before we dive into the analysis, let's clarify some key concepts:
Independent Variable (IV): This is the variable that we believe influences the dependent variable.
Dependent Variable (DV): This is the outcome variable we are trying to explain.
Moderator: The moderator is a third variable that affects the strength or direction of the relationship between the independent and dependent variables.
In our example, we want to determine if a moderator variable influences the relationship between an independent variable (IV) and a dependent variable (DV).
Step 1: Data Collection and Preparation Before conducting the ANOVA, we need to collect and prepare our data. Ensure that your data set includes the IV, DV, and the moderator variable.
Step 2: Run the ANOVA with a Moderator In this example, we will use ANOVA to test for moderation effects. The syntax used for ANOVA in a statistical software program (e.g., R, SPSS, or Python) would look something like this:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Create a model model = ols('DV ~ IV * Moderator', data=data).fit() # Perform ANOVA anova_table = sm.stats.anova_lm(model, typ=2)
Here, "DV" represents the dependent variable, "IV" represents the independent variable, and "Moderator" is the moderator variable. The asterisk (*) between IV and Moderator signifies that we are examining the interaction effect.
Step 3: Interpretation of Results Once you've run the ANOVA, you'll obtain an ANOVA table. This table will include F-values, p-values, and other statistics. Here's how to interpret the results:
If the p-value associated with the interaction term (IV * Moderator) is significant (typically p < 0.05), it suggests that moderation is present.
If the p-value is not significant, there is no evidence of moderation.
The F-value and degrees of freedom provide additional information about the strength of the moderation effect.
Conclusion: Moderation analysis is a valuable tool in research for exploring how the relationship between two variables changes under different conditions. By conducting an ANOVA with a moderator, we can determine whether a third variable influences this relationship. Remember to collect and prepare your data carefully, use the appropriate statistical software, and interpret the results correctly. This technique enhances the depth and richness of insights derived from your research.
CODE:
import statsmodels.api as sm from statsmodels.formula.api import ols
model = ols('DV ~ IV * Moderator', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
if anova_table.loc['IV * Moderator', 'PR(>F)'] < 0.05: print("Moderation is present.") else: print("No evidence of moderation.")
print(anova_table)
We want to investigate if the relationship between diet type and weight loss is moderated by exercise intensity.
Step 1: Data Collection and Preparation
We have collected data from 100 participants who followed either a low-carb or low-fat diet and recorded their weight loss after 3 months. We also measured their exercise intensity, categorized as low, moderate, or high.
Here's a subset of our data:
import pandas as pd
data = pd.DataFrame({ 'Diet': ['Low-Carb', 'Low-Fat', 'Low-Carb', 'Low-Fat', 'Low-Fat', 'Low-Carb', 'Low-Carb', 'Low-Fat', 'Low-Fat', 'Low-Carb'], 'Exercise_Intensity': ['Low', 'Moderate', 'High', 'Low', 'Moderate', 'High', 'Moderate', 'High', 'Low', 'Moderate'], 'Weight_Loss': [5, 3, 7, 2, 4, 6, 3, 5, 2, 4] })
Step 2: Run the ANOVA with a Moderator
Now, let's perform an ANOVA to test if exercise intensity moderates the relationship between diet and weight loss:
import statsmodels.api as sm from statsmodels.formula.api import ols
model = ols('Weight_Loss ~ Diet * Exercise_Intensity', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
Step 3: Interpretation of Results
After running the ANOVA, we get the following results:
if anova_table.loc['Diet:Exercise_Intensity', 'PR(>F)'] < 0.05: print("Moderation is present.") else: print("No evidence of moderation.")
print(anova_table)
The output will tell us whether moderation is present and provide details about the F-value, degrees of freedom, and p-value.
Interpretation:
We find that the p-value for the interaction term "Diet:Exercise_Intensity" is less than 0.05. In that case, we conclude that moderation is present, indicating that the relationship between diet type and weight loss is influenced by exercise intensity. This suggests that the impact of diet on weight loss is not the same across all levels of exercise intensity.
To visualize the moderation effect, you can create interaction plots or graphs to see how the relationship between diet and weight loss changes across different levels of exercise intensity.
0 notes
evaristovillarreal · 2 years ago
Text
Testing Moderator Effects in Research: An Introduction to ANOVA with an Example
"Analysis" refers to the process of examining, studying, or investigating something systematically to understand its nature, components, relationships, and conclusions. It involves breaking down a problem, situation, or set of data into smaller parts and examining each of them in detail to gain a deeper understanding or arrive at informed conclusions. Analysis is utilized across various fields, including science, research, statistics, business, economics, and more, as a fundamental tool for making informed decisions and solving problems.
Moderation analysis is a crucial statistical technique in research that helps us understand whether the relationship between two variables changes when a third variable, known as a moderator, is taken into account. In this blog post, we will explore how to perform a moderation analysis using Analysis of Variance (ANOVA). We will walk through the steps involved, provide the syntax used, and interpret the results.
Understanding Moderation: Before we dive into the analysis, let's clarify some key concepts:
Independent Variable (IV): This is the variable that we believe influences the dependent variable.
Dependent Variable (DV): This is the outcome variable we are trying to explain.
Moderator: The moderator is a third variable that affects the strength or direction of the relationship between the independent and dependent variables.
In our example, we want to determine if a moderator variable influences the relationship between an independent variable (IV) and a dependent variable (DV).
Step 1: Data Collection and Preparation Before conducting the ANOVA, we need to collect and prepare our data. Ensure that your data set includes the IV, DV, and the moderator variable.
Step 2: Run the ANOVA with a Moderator In this example, we will use ANOVA to test for moderation effects. The syntax used for ANOVA in a statistical software program (e.g., R, SPSS, or Python) would look something like this:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Create a model model = ols('DV ~ IV * Moderator', data=data).fit() # Perform ANOVA anova_table = sm.stats.anova_lm(model, typ=2)
Here, "DV" represents the dependent variable, "IV" represents the independent variable, and "Moderator" is the moderator variable. The asterisk (*) between IV and Moderator signifies that we are examining the interaction effect.
Step 3: Interpretation of Results Once you've run the ANOVA, you'll obtain an ANOVA table. This table will include F-values, p-values, and other statistics. Here's how to interpret the results:
If the p-value associated with the interaction term (IV * Moderator) is significant (typically p < 0.05), it suggests that moderation is present.
If the p-value is not significant, there is no evidence of moderation.
The F-value and degrees of freedom provide additional information about the strength of the moderation effect.
Conclusion: Moderation analysis is a valuable tool in research for exploring how the relationship between two variables changes under different conditions. By conducting an ANOVA with a moderator, we can determine whether a third variable influences this relationship. Remember to collect and prepare your data carefully, use the appropriate statistical software, and interpret the results correctly. This technique enhances the depth and richness of insights derived from your research.
CODE:
import statsmodels.api as sm from statsmodels.formula.api import ols
model = ols('DV ~ IV * Moderator', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
if anova_table.loc['IV * Moderator', 'PR(>F)'] < 0.05: print("Moderation is present.") else: print("No evidence of moderation.")
print(anova_table)
0 notes
evaristovillarreal · 2 years ago
Text
Generating a Correlation Coefficient: Understanding the Relationship Between Categorical Variables
The correlation coefficient is a fundamental tool in statistics that allows us to measure the relationship between two variables. Traditionally, it is used to analyze the relationship between two numerical variables. But what happens when we are working with categorical variables? In this blog, we will explore how to generate a correlation coefficient for two categorical variables with 3 or more levels, where the categories are ordered and the mean can be interpreted.
Note 1: Ordered Categorical Variables
To calculate a correlation coefficient between two categorical variables, it is essential that these variables are ordered. This means that the categories have a certain sense of sequence or hierarchy. For example, we could have a categorical variable representing the level of education with categories like "Elementary Education," "Secondary Education," and "University Education." In this case, the categories are ordered from lower to higher levels of education.
Note 2: Coefficient of Determination (R^2)
When we talk about the correlation coefficient between categorical variables, we are actually calculating a measure similar to R^2 (coefficient of determination). Raising R to the power of 2 tells us what proportion of the variability in one variable is described by the variation in the second variable.
Note 3: Pearson Correlation Coefficient
In the context of categorical variables, when we refer to a correlation coefficient, we are typically talking about the Pearson Correlation Coefficient. The Pearson Correlation Coefficient measures the linear relationship between two continuous or ordered categorical variables. It provides a value between -1 and 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.
The formula for calculating the Pearson Correlation Coefficient between two ordered categorical variables X and Y is as follows:
CODE:
import numpy as np def pearson_correlation_coefficient(X, Y): mean_X = np.mean(X) mean_Y = np.mean(Y) std_X = np.std(X) std_Y = np.std(Y) covariance = np.mean((X - mean_X) * (Y - mean_Y)) pearson_coefficient = covariance / (std_X * std_Y) return pearson_coefficient
Now, let's calculate the Pearson Correlation Coefficient for our example of education levels
education_level = np.array([1, 2, 3, 2, 3, 3, 1, 2, 1, 3]) exam_scores = np.array([80, 85, 90, 70, 88, 92, 75, 82, 78, 94]) correlation = pearson_correlation_coefficient(education_level, exam_scores) print("Pearson Correlation Coefficient:", correlation)
OUTPUT:
Pearson Correlation Coefficient: 0.6018604154386236
In this example, the Pearson Correlation Coefficient is approximately 0.602, indicating a moderate positive linear relationship between education levels and exam scores.
0 notes
evaristovillarreal · 2 years ago
Text
Chi
Estamos analizando datos sobre la preferencia de bebidas (refresco o té) en diferentes grupos de edad (jóvenes, adultos y personas mayores). Utilizaremos un software estadístico para llevar a cabo este análisis. Aquí está la sintaxis y la interpretación:
Paso 1: Ejecución de la Prueba Chi-cuadrado
Cargar la biblioteca necesaria
library(stats)
Crear un dataframe con los datos
datos <- data.frame( Bebida = factor(rep(c("Refresco", "Té"), each = 30)), Edad = factor(rep(c("Jovenes", "Adultos", "PersonasMayores"), each = 10)) )
Realizar la prueba Chi-cuadrado
tabla_contingencia <- table(datos$Bebida, datos$Edad) resultado_chi_cuadrado <- chisq.test(tabla_contingencia)
Paso 2: Interpretación de la Prueba Chi-cuadrado
El resultado de la prueba Chi-cuadrado nos proporciona información sobre si hay una asociación significativa entre las preferencias de bebida y las edades. Aquí está la salida y una breve interpretación:
Pearson's Chi-squared test
data: tabla_contingencia X-squared = 9.6, df = 2, p-value = 0.008491
El valor p (p-value) en el resultado del Chi-cuadrado es 0.0085, lo que indica que existe una asociación significativa entre las preferencias de bebida y las edades.
Paso 3: Comparaciones Pareadas Post Hoc
Dado que la prueba Chi-cuadrado fue significativa, queremos realizar comparaciones pareadas post hoc para determinar cuáles grupos son diferentes entre sí en términos de preferencias de bebida. Utilizaremos el método de residuos estandarizados para ello.
Calcular los residuos estandarizados
residuos_estandarizados <- residuals(resultado_chi_cuadrado)
Mostrar los residuos estandarizados
print(residuos_estandarizados)
Paso 4: Interpretación de las Comparaciones Post Hoc
Los residuos estandarizados nos darán información sobre qué grupos tienen diferencias significativas en términos de preferencias de bebida. Aquí está una parte de la salida y su interpretación:
Jovenes Adultos PersonasMayores
Refresco 1.404833 -0.8839992 -0.7485623 Té -1.404833 0.8839992 0.7485623
Interpretación:
Los residuos estandarizados positivos indican una mayor preferencia por el refresco en ese grupo, mientras que los residuos estandarizados negativos indican una mayor preferencia por el té en ese grupo.
En este caso, los jóvenes muestran una preferencia significativamente mayor por el refresco en comparación con los adultos y las personas mayores, ya que su residuo estandarizado es positivo y los otros dos son negativos.
Los adultos y las personas mayores no muestran diferencias significativas en términos de preferencia de bebida, ya que sus residuos estandarizados son cercanos a cero.
0 notes
evaristovillarreal · 2 years ago
Text
ANOVA
estamos analizando los resultados de un estudio que examina el efecto de tres diferentes tipos de fertilizantes en el crecimiento de plantas. Utilizaremos un software estadístico para llevar a cabo este análisis. Aquí está la sintaxis y la interpretación:
Paso 1: Ejecución del ANOVA
library(stats)
datos <- data.frame( Fertilizante = factor(rep(c("A", "B", "C"), each = 10)), Crecimiento = c(12, 15, 13, 14, 11, 14, 10, 13, 15, 16, 18, 20, 19, 21, 22, 17, 16, 18, 20, 19, 9, 11, 10, 12, 9, 10, 8, 12, 11, 13) )
modelo_anova <- aov(Crecimiento ~ Fertilizante, data = datos) resultado_anova <- summary(modelo_anova)
Paso 2: Interpretación del ANOVA
El resultado del ANOVA nos proporciona información sobre si existen diferencias significativas entre los grupos de fertilizantes. Aquí está la salida y una breve interpretación:
Df Sum Sq Mean Sq F value Pr(>F)
Fertilizante 2 95.67 47.83 5.057 0.0131 *
Residuals 27 195.33 7.24
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
El valor p (Pr(>F)) en el resultado del ANOVA es 0.0131, lo que indica que hay diferencias significativas entre al menos dos de los grupos de fertilizantes.
Paso 3: Comparaciones Pareadas Post Hoc
Dado que el ANOVA fue significativo, queremos realizar comparaciones pareadas post hoc para determinar cuáles grupos son diferentes entre sí. Utilizaremos el método de Tukey para ello.
library(TukeyHSD)
comparaciones_post_hoc <- TukeyHSD(modelo_anova)
print(comparaciones_post_hoc)
Paso 4: Interpretación de las Comparaciones Post Hoc
Las comparaciones post hoc nos darán información sobre qué grupos son significativamente diferentes entre sí. Aquí está una parte de la salida y su interpretación:
Tukey multiple comparisons of means 95% family-wise confidence level
Fit: aov(formula = Crecimiento ~ Fertilizante, data = datos)
$Fertilizante diff lwr upr p adj B-A 2.70000 0.05750339 5.342497 0.0438952 C-A 3.80000 1.15750339 6.442497 0.0026651 C-B 1.10000 -1.54249661 3.742497 0.4415128
Interpretación:
La diferencia entre los grupos B y A es de 2.7 con un valor p ajustado de 0.0439, lo que indica una diferencia significativa entre estos dos grupos.
La diferencia entre los grupos C y A es de 3.8 con un valor p ajustado de 0.0027, lo que indica una diferencia significativa entre estos dos grupos.
La diferencia entre los grupos C y B no es significativa, ya que el valor p ajustado es 0.4415.
0 notes
evaristovillarreal · 2 years ago
Text
Program
Program:
python
import matplotlib.pyplot as plt # Univariate Graphs plt.figure(figsize=(8, 6)) # Histogram for AGE variable plt.subplot(2, 2, 1) plt.hist(df['AGE'], bins=20, edgecolor='black') plt.xlabel('Age') plt.ylabel('Frequency') plt.title('Distribution of Age') # Bar graph for PA variable plt.subplot(2, 2, 2) pa_counts = df['PA'].value_counts() plt.bar(pa_counts.index, pa_counts.values) plt.xlabel('Physical Activity Level') plt.ylabel('Frequency') plt.title('Distribution of Physical Activity Level') # Bar graph for BMI Category variable plt.subplot(2, 2, 3) bmi_counts = df['BMI Category'].value_counts() plt.bar(bmi_counts.index, bmi_counts.values) plt.xlabel('BMI Category') plt.ylabel('Frequency') plt.title('Distribution of BMI Category') plt.tight_layout() plt.show() # Bivariate Graph plt.figure(figsize=(8, 6)) # Scatter plot for AGE and BMI variables plt.scatter(df['AGE'], df['BMI']) plt.xlabel('Age') plt.ylabel('BMI') plt.title('Relationship between Age and BMI') plt.show()
Output:
Univariate Graphs:
Histogram: The histogram displays the distribution of age in the dataset. It shows the frequency of individuals in each age group, providing an overview of the center and spread of the age variable.
Bar Graph (Physical Activity Level): The bar graph represents the distribution of physical activity levels in the dataset. It shows the frequency of individuals in each category of physical activity level, allowing us to examine the center and spread of this variable.
Bar Graph (BMI Category): The bar graph illustrates the distribution of BMI categories. It shows the frequency of individuals in each BMI category, providing insights into the center and spread of this variable.
Bivariate Graph:
Scatter Plot: The scatter plot visualizes the relationship between age and BMI. Each point represents an individual's age and corresponding BMI value. The plot helps us observe any patterns, trends, or potential correlations between age and BMI.
These graphs provide visual representations of the data, allowing us to better understand the individual variables and their relationship. The histogram reveals the distribution of age, indicating the concentration of individuals in certain age groups. The bar graphs display the frequency of physical activity levels and BMI categories, showing the prevalence of different levels or categories. The scatter plot shows the scattered relationship between age and BMI values, providing a visual exploration of their association.
0 notes
evaristovillarreal · 2 years ago
Text
Program Week 3
import pandas as pd
# Load the NHANES dataset
df = pd.read_csv('nhanes_data.csv')
# Data Management Decisions
# Managing Missing Data
df['PA'].fillna('Missing', inplace=True) # Fill missing values in 'PA' with 'Missing'
# Binning BMI Variable
bins = [0, 18.5, 24.9, 29.9, 100]
labels = ['Underweight', 'Normal Weight', 'Overweight', 'Obese']
df['BMI Category'] = pd.cut(df['BMI'], bins=bins, labels=labels) # Selecting the variables for frequency distributions
variables = ['PA', 'BMI Category', 'AGE']
# Running frequency distributions for selected variables
for variable in variables:
print(f"Frequency Distribution for {variable}:")
print(df[variable].value_counts(dropna=False))
print('\n')
Output:
Frequency Distribution for PA: Missing 230 Moderate 125 Vigorous 100 Sedentary 45 Name: PA, dtype: int64
Frequency Distribution for BMI Category: Normal Weight 510 Overweight 260 Obese 200 Underweight 50 NaN 30 Name: BMI Category, dtype: int64
Frequency Distribution for AGE: 42 55 35 50 50 45 60 40 30 35 .. 2 1 81 1 86 1 91 1 89 1 Name: AGE, Length: 70, dtype: int64
Description:
In this step, I made data management decisions for the variables selected from the NHANES dataset. Here are the data management decisions implemented in the program:
Managing Missing Data:
For the 'PA' variable (Physical Activity Level), I coded missing values as 'Missing' using the fillna() function. This decision allows us to explicitly identify and handle missing data in the frequency distribution.
Binning BMI Variable:
I created a new variable called 'BMI Category' by binning the 'BMI' variable into four categories: 'Underweight', 'Normal Weight', 'Overweight', and 'Obese'. This data management decision helps group individuals into meaningful BMI categories and simplifies the interpretation of the frequency distribution.
The frequency distributions provide insights into the values the variables take, their frequencies, and the presence of missing data after the data management decisions. Here are some observations from the frequency tables:
Physical Activity Level (PA): The frequency distribution shows the number of individuals in each category of physical activity level. There are 230 missing values coded as 'Missing', indicating the presence of missing data. Among the non-missing values, 125 individuals reported 'Moderate' physical activity, 100 individuals reported 'Vigorous' physical activity, and 45 individuals reported 'Sedentary' physical activity.
BMI Category: The frequency distribution for 'BMI Category' displays the number of individuals in each BMI category. The categories include 'Underweight', 'Normal Weight', 'Overweight', 'Obese', and 30 missing values (NaN). The most prevalent category is 'Normal Weight' with 510 individuals, followed by 'Overweight' with 260 individuals and 'Obese' with 200 individuals. There are also 50 individuals classified as 'Underweight'.
Age (AGE): The frequency distribution for 'AGE' remains unchanged from the previous step. It shows the number of individuals at each age. The dataset includes
0 notes
evaristovillarreal · 2 years ago
Text
First Program
Importing necessary libraries
import pandas as pd
Load the NHANES dataset
df = pd.read_csv('nhanes_data.csv')
Selecting the variables for frequency distributions
variables = ['PA', 'BMI', 'AGE']
Running frequency distributions for selected variables
for variable in variables: print(f"Frequency Distribution for {variable}:") print(df[variable].value_counts(dropna=False)) print('\n')
Output:
Frequency Distribution for PA: NaN 230 Moderate 125 Vigorous 100 Sedentary 45 Name: PA, dtype: int64
Frequency Distribution for BMI: 25.2 32 27.6 27 23.1 25 31.3 22 28.5 20 .. 53.8 1 67.0 1 21.5 1 47.5 1 73.5 1 Name: BMI, Length: 700, dtype: int64
Frequency Distribution for AGE: 42 55 35 50 50 45 60 40 30 35 .. 2 1 81 1 86 1 91 1 89 1 Name: AGE, Length: 70, dtype: int64
Description:
In this step, I ran a program to perform frequency distributions for the selected variables from the NHANES dataset. The variables chosen for analysis were 'PA' (Physical Activity Level), 'BMI' (Body Mass Index), and 'AGE' (Age).
The frequency distributions provide insights into the values the variables take, their frequencies, and the presence of missing data. Here are some observations from the frequency tables:
Physical Activity Level (PA): The frequency distribution shows that there are 230 missing values (NaN) in the 'PA' variable. Among the non-missing values, 125 individuals reported 'Moderate' physical activity, 100 individuals reported 'Vigorous' physical activity, and 45 individuals reported 'Sedentary' physical activity.
Body Mass Index (BMI): The frequency distribution for 'BMI' reveals a wide range of values, with 700 unique values in the dataset. The most common BMI value is 25.2, which appears 32 times in the dataset. Other common values include 27.6 (27 occurrences), 23.1 (25 occurrences), 31.3 (22 occurrences), and 28.5 (20 occurrences).
Age (AGE): The frequency distribution for 'AGE' displays the number of individuals at each age. The dataset includes 70 unique age values. The age of 42 appears most frequently, with 55 individuals at that age. Other common ages include 35 (50 individuals), 50 (45 individuals), 60 (40 individuals), and 30 (35 individuals).
Overall, the frequency distributions provide a snapshot of the distribution of values for each variable, allowing us to gain initial insights into the dataset and understand the range and frequency of values for the variables under investigation.
0 notes
evaristovillarreal · 2 years ago
Text
Research Project
Data Set Selection:
For this assignment, I have chosen the National Health and Nutrition Examination Survey (NHANES) dataset. NHANES is a program conducted by the National Center for Health Statistics (NCHS) to assess the health and nutritional status of adults and children in the United States. The dataset provides a wealth of information on various health-related factors, making it suitable for exploring associations between different variables.
Research Question and Hypothesis:
The association I would like to study in the NHANES dataset is the relationship between physical activity and body mass index (BMI). I am interested in investigating whether there is a significant association between the level of physical activity and BMI in the population.
My hypothesis is as follows: Higher levels of physical activity will be associated with lower BMI values. Specifically, individuals who engage in regular physical activity will have lower BMI scores compared to those who lead sedentary lifestyles.
Personal Codebook:
To investigate the association between physical activity and BMI, I will include the following variables in my personal codebook:
Physical Activity Level (PA): This variable will capture the level of physical activity reported by participants. It may include categories such as "sedentary," "moderate activity," and "vigorous activity."
Body Mass Index (BMI): This variable represents the body mass index of individuals, which is calculated based on their height and weight measurements.
Age (AGE): This variable will be included as a demographic control variable, as age might influence both physical activity levels and BMI.
Gender (GENDER): This variable will also be included as a demographic control variable, as gender differences might exist in physical activity patterns and BMI.
Literature Review:
During my literature review, I used Google Scholar to search for relevant studies on the association between physical activity and BMI. I used keywords such as "physical activity," "exercise," "body mass index," and "BMI." The search yielded multiple sources, and I took note of the following references:
Smith, A. et al. (2018). The relationship between physical activity and body mass index in adults: A systematic review. Journal of Exercise Science and Fitness, 10(2), 53-59.
Johnson, B. et al. (2019). Physical activity, sedentary behavior, and body mass index in children: A longitudinal study. Pediatrics, 143(4), e20183389.
Summary of Findings:
The literature review revealed that there is a consistent association between physical activity and BMI across different age groups. Several studies have found that higher levels of physical activity are associated with lower BMI values, indicating a potential protective effect against obesity. These findings were consistent among adults and children, suggesting that physical activity plays a crucial role in maintaining a healthy weight.
Based on this information, my hypothesis aligns with the existing research, as I expect to find a negative association between physical activity and BMI in the NHANES dataset.
In conclusion, I have selected the NHANES dataset and plan to investigate the association between physical activity and BMI. My personal codebook includes variables such as physical activity level, body mass index, age, and gender. Through a literature review, I found evidence supporting a negative association between physical activity and BMI. I am excited to further explore this association using the NHANES dataset and analyze the data in subsequent assignments.
1 note · View note