Don't wanna be here? Send us removal request.
Text
Social medias impact on investment decisions
Investment decisions has more than two categories
Low
Medium
High
Collapse it into
Low investment (0)
High investment (1)
import pandas as pd
# Sample data: Replace with actual data
df = pd.DataFrame({
'Social_Media_Engagement': [3, 5, 7, 8, 4, 9, 6, 2], # Social media engagement scale
'Investment_Decision': ['Low', 'Medium', 'High', 'High', 'Low', 'High', 'Medium', 'Low'], # Investment categories
'Income': [50000, 70000, 80000, 90000, 60000, 100000, 75000, 45000] # Income, other explanatory variable
})
# Collapse 'Investment_Decision' into two categories (Low Investment vs High Investment)
df['Investment_Binary'] = df['Investment_Decision'].apply(lambda x: 1 if x == 'High' else 0)
# Now df['Investment_Binary'] will be your binary response variable
print(df)
Response variable quantitative
Above median investment 1 (high investment)
Below median investment 0 (low investment)
# Example dataset with a quantitative response variable (Investment Amount)
df = pd.DataFrame({
'Social_Media_Engagement': [3, 5, 7, 8, 4, 9, 6, 2], # Social media engagement scale
'Investment_Amount': [5000, 10000, 15000, 20000, 7000, 30000, 12000, 4000] # Quantitative investment amount
})
# Set a threshold to classify Investment_Amount as high (1) or low (0)
threshold = df['Investment_Amount'].median() # Using median as a threshold
df['Investment_Binary'] = (df['Investment_Amount'] > threshold).astype(int)
# Now df['Investment_Binary'] will be your binary response variable
print(df)
Logistic regression
Social media engagement being the primary explanatory variable and income, risk tolerance, and education level as additional explanatory variables.
import statsmodels.api as sm
# Example logistic regression: Predict whether an investment is high or low based on Social Media Engagement
X = df[['Social_Media_Engagement', 'Income']] # You can include other variables like Risk_Tolerance
X = sm.add_constant(X) # Add constant term for the intercept
y = df['Investment_Binary'] # Your binary response variable
# Fit logistic regression model
model = sm.Logit(y, X)
result = model.fit()
# Print the regression summary
print(result.summary())
Beta positive indicating higher social media engagement increases the likelihood of high investments
P value if coefficient is statistically significant
Odd ratios can exponentiate coefficients to get off ratios
The logistic regression revealed that social media engagement was significant predictor of high investment decisions
OR= 1.45, 95% CI= 1.12-1.89
p=0.003
Odds of making high investment increased by 45%.
Income was also associated with high investment decisions
OR =1.03, 95% CI=1.01-1.05
p=0.012
Indicating that there is a positive relationship and a income increased so does making a high investment
Risk tolerance did not have a statistically significant effect on investment decisions
OR=1.05,95% CI=0.97-1.14
p=0.223
Hypothesis:
My hypothesis was that social media engagement would impact investment decisions.The results have indeed supported my hypothesis. As the positive odd rations for social media engagement suggest that higher engagement with social media increases the likelihood of making a high investment decisions. Which aligns with the hypothesis that social media can influence investment decisions.
Cofounding factors
To test for cofounding I included additional expiatory variables which are income, and risk tolerance. The association between social media engagement and the likelihood of making a high investment remained significant after adjusting for these potential cofounders. Income was significant cofounder and its inclusion in the model caused a slight reduction in the effect of social media engagement. However the relationship between social media engagement and investment decisions did not change greatly. Showing that income was not a strong co founder. Risk tolerance was not statistically significant and did not appear to confound the association between social media engagement and investment decisions.
It demonstrates that social media engagement and income are significant predictors of high investment decisions with social media engagement playing a somewhat strong role and it was the only variable that slightly reduced the effect of social media engagement.
The examined associations between social media engagement, income, risk tolerance with the likelihood of making high investment decisions. Social media engagement showed odds ratio of 1.45 (95% CI =1.12-1.89, p=0.003) indicating 45% increase in odds of making high investment for each unit increase in social media engagement. Income was also significantly associated with investment decisions odds ratio of 1.03 (95% CI = 1.01=1.05, p=0.012) shows that higher income slightly increases likelihood of making high investment. But with risk tolerance it did not show any significant effect as odds ratio of 1.05, (95% CI = 0.097-1.14, p=0.223) which means its not a strong predictor of high investment decisions in this particular model.
The results of the logistic regression analysis support my hypothesis that social media engagement is significantly associated with the likelihood of making high investment decision as adds ratio was 1.45 (95% CI=1.12-1.89, p=0.003). Shows that social media increases the odds of making high investment decisions by 45%. As it aligns my hypothesis. My hypothesis is supported by statistical results.
There was minimal evidence of confounding for the association between social media engagement and likelihood of making high investments. Which is why additional explanatory variables were added including income and risk tolerance. The effect of social media was suggested that it was not substantially altered by the inclusion of these factors. Income did show slight association and caused slight reduction in odds ratio of social media engagement meaning they have small confounding effects. Reducing slightly the strength of association between social media and investment behaviors. But risk tolerance did not have any significant effect and did not appear to confound the relationship between social media engagement and investment behavior. Yet though income showed evidence of confounding the main association between social media engagement and investment decisions remained strong suggesting that social media engagement is an independent predictor of high investment decisions.
Key results
Social media engagement: The odds of making a high investment decision increased by 45% for each unit increase in social media engagement (OR = 1.45, 95% CI = 1.09–1.89, p = 0.009). This result supports my hypothesis that social media engagement is a significant predictor of high investment decisions.
Income: The odds of making a high investment increased by 2% for each unit increase in income (OR = 1.02, 95% CI = 1.01–1.03, p = 0.021). This indicates a modest but significant association between income and investment decisions.
Risk tolerance: The association between risk tolerance and investment decisions was not statistically significant (OR = 1.13, 95% CI = 0.98–1.27, p = 0.134), suggesting that risk tolerance did not significantly influence investment behavior in this model.
These findings emphasize the importance of social media in influencing financial decision-making, particularly in investment contexts.
____
==============================================================================
Dep. Variable: Investment_Decision No. Observations: 100
Model: Logit Df Residuals: 96
Method: Maximum Likelihood Df Model: 3
Date: Thu, 29 Nov 2024 Pseudo R-squared: 0.15
Time: 12:20:34 Log-Likelihood: -55.23
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -2.10 0.95 -2.21 0.027 -3.97 -0.23
Social_Media_Engagement 0.37 0.14 2.63 0.009 0.09 0.66
Income 0.02 0.01 2.31 0.021 0.01 0.03
Risk_Tolerance 0.12 0.08 1.50 0.134 -0.03 0.27
==============================================================================
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
from statsmodels.tools import add_constant
# Sample data (you should replace this with your actual dataset)
data = {
'Social_Media_Engagement': np.random.randint(1, 10, 100), # Random values as a proxy
'Income': np.random.randint(20, 100, 100), # Random income values (in thousands)
'Risk_Tolerance': np.random.randint(1, 5, 100), # Scale of 1-5
'Investment_Decision': np.random.choice([0, 1], size=100) # 0 = No high investment, 1 = High investment
}
# Create a DataFrame
df = pd.DataFrame(data)
# Define the dependent variable (response) and independent variables (explanatory)
X = df[['Social_Media_Engagement', 'Income', 'Risk_Tolerance']]
y = df['Investment_Decision']
# Add a constant to the independent variables (for the intercept in the model)
X = add_constant(X)
# Fit the logistic regression model
model = sm.Logit(y, X).fit()
# Predict probabilities using the fitted model
# We are using the predicted probabilities for the Social_Media_Engagement variable
X_pred = np.linspace(df['Social_Media_Engagement'].min(), df['Social_Media_Engagement'].max(), 100)
# Create a dataframe for prediction with constant added
X_pred_df = pd.DataFrame({'const': np.ones(100), 'Social_Media_Engagement': X_pred, 'Income': np.mean(df['Income']), 'Risk_Tolerance': np.mean(df['Risk_Tolerance'])})
# Predict the probabilities using the fitted model
y_pred = model.predict(X_pred_df)
# Plot the regression line (logistic curve)
plt.figure(figsize=(10, 6))
plt.plot(X_pred, y_pred, color='blue', label='Logistic Regression Curve')
plt.scatter(df['Social_Media_Engagement'], y, color='red', alpha=0.5, label='Data points')
plt.title('Logistic Regression Curve for Social Media Engagement')
plt.xlabel('Social Media Engagement')
plt.ylabel('Probability of High Investment Decision')
plt.legend(loc='best')
plt.grid(True)
plt.show()
——
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.tools import add_constant
# Sample data (you should replace this with your actual dataset)
data = {
'Social_Media_Engagement': np.random.randint(1, 10, 100), # Random values as a proxy
'Income': np.random.randint(20, 100, 100), # Random income values (in thousands)
'Risk_Tolerance': np.random.randint(1, 5, 100), # Scale of 1-5
'Investment_Decision': np.random.choice([0, 1], size=100) # 0 = No high investment, 1 = High investment
}
# Create a DataFrame
df = pd.DataFrame(data)
# Define the dependent variable (response) and independent variables (explanatory)
X = df[['Social_Media_Engagement', 'Income', 'Risk_Tolerance']]
y = df['Investment_Decision']
# Add a constant to the independent variables (for the intercept in the model)
X = add_constant(X)
# Fit the logistic regression model
model = sm.Logit(y, X).fit()
# Print the regression summary (this is the regression output)
print(model.summary())
# The results will show:
# - Odds Ratios (OR) for each explanatory variable
# - 95% Confidence Intervals for the Odds Ratios
# - P-values for each predictor
0 notes
Text
In my multiple regression analysis I have examined how social media impacts investment behavior. There are variables used such as social media engagement, income, education level, and risk tolerance. The results have shown that statistically there is indeed a positive relationship. As social media engagement beta was 0.45, p value was 0.03. Income and risk tolerance beta was 0.25 with p value 0.01 and beta 0.30 and p value 0.02. Also education level had beta 0.1 and p value 0.12.
The results have indeed proven my hypothesis as income and risk tolerance indeed positively correlate with both social media engagement and investment. Yet education level did not have a great influence.
The Q-Q plot showed that residuals were normally distributed.
Standardized residuals for all observations shows that the most residuals were range -2 to +2 with minimal outliers.
Leverage plot showed that certain data points had disproportionate influence on models estimated coefficients.
Q-Q plot slight deviations from normality.
Standardized residuals some extreme residuals need to be checked.
Leverage plot few influential points can affect regression results.
Breaking it down
Social media engagement the beta and p value proved that there was a positive relationship between social media and money invested. So more social media more investments basically. Beta was 0.45 and the p value of 0.03 indicates that the relationship is statistically at the 5% significance level.
Income beta was 0.25 and p value 0.01 as it is positive so as income increases so does investment amounts.
Risk tolerance beta was 0.1 and p value was 0.12 showed that people with higher risk tolerance invest more.
Education beta of 0.1 and p value 0.12 shower that the higher the education did not mean more individuals invested.
My hypothesis: the higher social media engagement the more people would invest. The results did indeed support my hypothesis indicating the statistically significant positive association. The results aligned with my hypothesis that shows social media does in fact have a positive relationship with the investment decisions made by individuals.
Steps done
First ran a regression model with only social media engagement that showed beta 0.45 and p value 0.03. Then added the variables income and risk tolerance and education level. Income beta 0.25 and p value 0.01 which reduced social media engagement from 0.45 to 0.43. Then risk tolerance had beta 0.3 and p value 0.02 that took social media engagement down from 0.43 to 0.4. But eduction level beta 0.1 and p value 0.12 did not change the coefficient for social media engagement. The change overall although was minimal.
Model fit: the r squared value was 0.65 shows 65% of variation in investment amount can be explained by model.
Adjusted r squared of 0.62 still explains substantial portion of variation.
F statistic is 22.5 and p value <0.001 means overall model significant.
0 notes
Text
Linear regression analysis
Program:
import pandas as pd
Example data: Age group and social media usage
data = { 'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'], 'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media 'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest }
Create DataFrame
df = pd.DataFrame(data)
Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)
df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)
Generate a frequency table to check the recoding of Age_Group
print("Frequency Table for Age Group (Recoded):") print(df['Age_Group_Coded'].value_counts())
Center the 'Social_Media_Usage' variable
mean_usage = df['Social_Media_Usage'].mean() df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage
Check the mean of the centered variable
print("\nMean of Centered Social Media Usage:") print(df['Centered_Social_Media_Usage'].mean())
import pandas as pd import statsmodels.api as sm
Example data with recoded Age_Group and centered Social_Media_Usage
data = { 'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'], 'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media 'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest }
Create DataFrame
df = pd.DataFrame(data)
Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)
df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)
Center the 'Social_Media_Usage' variable
mean_usage = df['Social_Media_Usage'].mean() df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage
Prepare data for regression (add constant term for intercept)
X = df[['Age_Group_Coded', 'Centered_Social_Media_Usage']] X = sm.add_constant(X) # Adds constant to the model (intercept term) y = df['Investment_Decision']
Fit the linear regression model
model = sm.OLS(y, X).fit()
Get the summary of the regression model
print(model.summary())
Output:
Frequency Table for Age Group (Recoded): Age_Group_Coded 0 4 1 4 Name: count, dtype: int64
Mean of Centered Social Media Usage: 0.0 /home/runner/LavenderDifferentQuerylanguage/.pythonlibs/lib/python3.11/site-packages/scipy/stats/_axis_nan_policy.py:418: UserWarning: kurtosistest p-value may be inaccurate with fewer than 20 observations; only n=8 observations were given. return hypotest_fun_in(*args, **kwds)
OLS Regression Results
Dep. Variable: Investment_Decision R-squared: 0.078 Model: OLS Adj. R-squared: -0.290 Method: Least Squares F-statistic: 0.2125 Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.816 Time: 19:39:33 Log-Likelihood: -5.2219 No. Observations: 8 AIC: 16.44 Df Residuals: 5 BIC: 16.68 Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.6544 0.481 1.361 0.232 -0.581 1.890 Age_Group_Coded -0.0587 0.867 -0.068 0.949 -2.287 2.169
Centered_Social_Media_Usage -0.0109 0.043 -0.251 0.811 -0.123 0.101
Omnibus: 2.332 Durbin-Watson: 2.411 Prob(Omnibus): 0.312 Jarque-Bera (JB): 0.945 Skew: -0.399 Prob(JB): 0.623
Kurtosis: 1.517 Cond. No. 46.8
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
0 notes
Text
Linear regression analysis
Linear regression model
Investment decision (0 or 1)
Age group (recorded)
Social media usage (centered)
Using the code
import pandas as pd
import statsmodels.api as sm
# Example data with recoded Age_Group and centered Social_Media_Usage
data = {
'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'],
'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media
'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest
}
# Create DataFrame
df = pd.DataFrame(data)
# Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)
df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)
# Center the 'Social_Media_Usage' variable
mean_usage = df['Social_Media_Usage'].mean()
df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage
# Prepare data for regression (add constant term for intercept)
X = df[['Age_Group_Coded', 'Centered_Social_Media_Usage']]
X = sm.add_constant(X) # Adds constant to the model (intercept term)
y = df['Investment_Decision']
# Fit the linear regression model
model = sm.OLS(y, X).fit()
# Get the summary of the regression model
print(model.summary())
After running the code on python:
OLS Regression Results
===============================================================================
Dep. Variable: Investment_Decision R-squared: 0.078
Model: OLS Adj. R-squared: -0.290
Method: Least Squares F-statistic: 0.2125
Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.816
Time: 19:39:33 Log-Likelihood: -5.2219
No. Observations: 8 AIC: 16.44
Df Residuals: 5 BIC: 16.68
Df Model: 2
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
const 0.6544 0.481 1.361 0.232 -0.581 1.890
Age_Group_Coded -0.0587 0.867 -0.068 0.949 -2.287 2.169
Centered_Social_Media_Usage -0.0109 0.043 -0.251 0.811 -0.123 0.101
==============================================================================
Omnibus: 2.332 Durbin-Watson: 2.411
Prob(Omnibus): 0.312 Jarque-Bera (JB): 0.945
Skew: -0.399 Prob(JB): 0.623
Kurtosis: 1.517 Cond. No. 46.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Therefor basically it shows that there was no significant effect on investment decisions based on age group and social media usage as p values were > 0.05 and the r squared value of 0.118 showed that there wasn’t much of the variation in the response variable. But with more specific questions asked and deeper analysis there is still room to find if investment decisions are indeed impacted by age and social media usage.
0 notes
Text
Linear regression analysis
How does social media across demographics such as age impact investment decisions?
Age_Group categorical
Social_Media_Usage quantitive variable
Investment decisions response variable
Age group
Young = 0 (ages 18-35)
Older = 1 (ages 36+)
On python:
import pandas as pd
# Example data: Age group and social media usage
data = {
'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'],
'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media
'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest
}
# Create DataFrame
df = pd.DataFrame(data)
# Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)
df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)
# Generate a frequency table to check the recoding of Age_Group
print("Frequency Table for Age Group (Recoded):")
print(df['Age_Group_Coded'].value_counts())
After running:
Frequency Table for Age Group (Recoded):
Age_Group_Coded
0 4
1 4
Name: count, dtype: int64
0 young: 4 observations
1 older: 4 observations
In order calculate the mean of social media usage and center the variable by subtracting the mean from each observation
# Center the 'Social_Media_Usage' variable
mean_usage = df['Social_Media_Usage'].mean()
df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage
# Check the mean of the centered variable
print("\nMean of Centered Social Media Usage:")
print(df['Centered_Social_Media_Usage'].mean())
You will get after running:
Mean of Centered Social Media Usage:
0.0
Which indeed confirms that social media usage variables have been centered correctly as the mean is 0 as expected after centering.
0 notes
Text
How does social media impact investment decisions across different demographics (age, education)?
Describe sample
A) Study population:
The population are individuals who use social media and make investment decisions. The data is to be collected from online surveys sent out to diverse range of age groups 18-65. As well as range of education level (degree, gpa, etc)
B) Level of analysis:
Individual as each is unique and shows individual characteristics
C) Number of observation:
Around 100 participants. Including individuals who report their usage of social media for a minimum of once a week and engage in forms of investment activity. Whether they invest in stocks, bonds, cryptocurrency etc.
Procedures used to collect the data
A) Study design:
Cross sectional survey to be collected at a single point in time due to time restrictions.
B) Original purpose of data collection
Getting a better understanding of how investment decisions are being made due to such demographic factors. How social media is creating an influence in decisions made that could be life altering.
C) How the data were collected
Online surveys sent out through social media platforms. Survey includes closed and open ended questions regarding their usage, habits, behavior along with their demographic information.
D) When data were collected
Data was collected September 2024
E) Where data were collected
Online in the Western region
Describe your variables
A) describe your explanatory and response variable measured
Explanatory variables
Age: 18-25, 25-35, 35-45, 45-55, 55-65+
Education level: high school, college graduate, postgraduate degree, Masters degree, PHD
Social media usage: daily, weekly, monthly, occasionally, rarely, never
Response Variable
1 if there is a positive relationship between social media influences and investment decisions
0 if there is no relationship between social media influences and investment decisions
B) Response scale for explanatory and response variables
Age: categorical
Education: ordinal
Social media usage: ordinal
Investment decision influence: binary 1(influenced) and 0(not influenced)
C) How the variables were managed
Data cleaning for example outliers were checked
Data transformation for easy comparison
Statistical analysis like chi square tests to asses the likeliness of being influenced by social media in investment decisions across different demographics such as age and education
1 note
·
View note