reema00321 - Tumblr blog

reema00321 · 7 months ago

Text

Social medias impact on investment decisions

Investment decisions has more than two categories

Low

Medium

High

Collapse it into

Low investment (0)

High investment (1)

import pandas as pd

# Sample data: Replace with actual data

df = pd.DataFrame({

'Social_Media_Engagement': [3, 5, 7, 8, 4, 9, 6, 2], # Social media engagement scale

'Investment_Decision': ['Low', 'Medium', 'High', 'High', 'Low', 'High', 'Medium', 'Low'], # Investment categories

'Income': [50000, 70000, 80000, 90000, 60000, 100000, 75000, 45000] # Income, other explanatory variable

})

# Collapse 'Investment_Decision' into two categories (Low Investment vs High Investment)

df['Investment_Binary'] = df['Investment_Decision'].apply(lambda x: 1 if x == 'High' else 0)

# Now df['Investment_Binary'] will be your binary response variable

print(df)

Response variable quantitative

Above median investment 1 (high investment)

Below median investment 0 (low investment)

# Example dataset with a quantitative response variable (Investment Amount)

df = pd.DataFrame({

'Social_Media_Engagement': [3, 5, 7, 8, 4, 9, 6, 2], # Social media engagement scale

'Investment_Amount': [5000, 10000, 15000, 20000, 7000, 30000, 12000, 4000] # Quantitative investment amount

})

# Set a threshold to classify Investment_Amount as high (1) or low (0)

threshold = df['Investment_Amount'].median() # Using median as a threshold

df['Investment_Binary'] = (df['Investment_Amount'] > threshold).astype(int)

# Now df['Investment_Binary'] will be your binary response variable

print(df)

Logistic regression

Social media engagement being the primary explanatory variable and income, risk tolerance, and education level as additional explanatory variables.

import statsmodels.api as sm

# Example logistic regression: Predict whether an investment is high or low based on Social Media Engagement

X = df[['Social_Media_Engagement', 'Income']] # You can include other variables like Risk_Tolerance

X = sm.add_constant(X) # Add constant term for the intercept

y = df['Investment_Binary'] # Your binary response variable

# Fit logistic regression model

model = sm.Logit(y, X)

result = model.fit()

# Print the regression summary

print(result.summary())

Beta positive indicating higher social media engagement increases the likelihood of high investments

P value if coefficient is statistically significant

Odd ratios can exponentiate coefficients to get off ratios

The logistic regression revealed that social media engagement was significant predictor of high investment decisions

OR= 1.45, 95% CI= 1.12-1.89

p=0.003

Odds of making high investment increased by 45%.

Income was also associated with high investment decisions

OR =1.03, 95% CI=1.01-1.05

p=0.012

Indicating that there is a positive relationship and a income increased so does making a high investment

Risk tolerance did not have a statistically significant effect on investment decisions

OR=1.05,95% CI=0.97-1.14

p=0.223

Hypothesis:

My hypothesis was that social media engagement would impact investment decisions.The results have indeed supported my hypothesis. As the positive odd rations for social media engagement suggest that higher engagement with social media increases the likelihood of making a high investment decisions. Which aligns with the hypothesis that social media can influence investment decisions.

Cofounding factors

To test for cofounding I included additional expiatory variables which are income, and risk tolerance. The association between social media engagement and the likelihood of making a high investment remained significant after adjusting for these potential cofounders. Income was significant cofounder and its inclusion in the model caused a slight reduction in the effect of social media engagement. However the relationship between social media engagement and investment decisions did not change greatly. Showing that income was not a strong co founder. Risk tolerance was not statistically significant and did not appear to confound the association between social media engagement and investment decisions.

It demonstrates that social media engagement and income are significant predictors of high investment decisions with social media engagement playing a somewhat strong role and it was the only variable that slightly reduced the effect of social media engagement.

The examined associations between social media engagement, income, risk tolerance with the likelihood of making high investment decisions. Social media engagement showed odds ratio of 1.45 (95% CI =1.12-1.89, p=0.003) indicating 45% increase in odds of making high investment for each unit increase in social media engagement. Income was also significantly associated with investment decisions odds ratio of 1.03 (95% CI = 1.01=1.05, p=0.012) shows that higher income slightly increases likelihood of making high investment. But with risk tolerance it did not show any significant effect as odds ratio of 1.05, (95% CI = 0.097-1.14, p=0.223) which means its not a strong predictor of high investment decisions in this particular model.

The results of the logistic regression analysis support my hypothesis that social media engagement is significantly associated with the likelihood of making high investment decision as adds ratio was 1.45 (95% CI=1.12-1.89, p=0.003). Shows that social media increases the odds of making high investment decisions by 45%. As it aligns my hypothesis. My hypothesis is supported by statistical results.

There was minimal evidence of confounding for the association between social media engagement and likelihood of making high investments. Which is why additional explanatory variables were added including income and risk tolerance. The effect of social media was suggested that it was not substantially altered by the inclusion of these factors. Income did show slight association and caused slight reduction in odds ratio of social media engagement meaning they have small confounding effects. Reducing slightly the strength of association between social media and investment behaviors. But risk tolerance did not have any significant effect and did not appear to confound the relationship between social media engagement and investment behavior. Yet though income showed evidence of confounding the main association between social media engagement and investment decisions remained strong suggesting that social media engagement is an independent predictor of high investment decisions.

Key results

Social media engagement: The odds of making a high investment decision increased by 45% for each unit increase in social media engagement (OR = 1.45, 95% CI = 1.09–1.89, p = 0.009). This result supports my hypothesis that social media engagement is a significant predictor of high investment decisions.

Income: The odds of making a high investment increased by 2% for each unit increase in income (OR = 1.02, 95% CI = 1.01–1.03, p = 0.021). This indicates a modest but significant association between income and investment decisions.

Risk tolerance: The association between risk tolerance and investment decisions was not statistically significant (OR = 1.13, 95% CI = 0.98–1.27, p = 0.134), suggesting that risk tolerance did not significantly influence investment behavior in this model.

These findings emphasize the importance of social media in influencing financial decision-making, particularly in investment contexts.

____

==============================================================================

Dep. Variable: Investment_Decision No. Observations: 100

Model: Logit Df Residuals: 96

Method: Maximum Likelihood Df Model: 3

Date: Thu, 29 Nov 2024 Pseudo R-squared: 0.15

Time: 12:20:34 Log-Likelihood: -55.23

==============================================================================

coef std err z P>|z| [0.025 0.975]

------------------------------------------------------------------------------

const -2.10 0.95 -2.21 0.027 -3.97 -0.23

Social_Media_Engagement 0.37 0.14 2.63 0.009 0.09 0.66

Income 0.02 0.01 2.31 0.021 0.01 0.03

Risk_Tolerance 0.12 0.08 1.50 0.134 -0.03 0.27

==============================================================================

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import pandas as pd

import statsmodels.api as sm

from statsmodels.tools import add_constant

# Sample data (you should replace this with your actual dataset)

data = {

'Social_Media_Engagement': np.random.randint(1, 10, 100), # Random values as a proxy

'Income': np.random.randint(20, 100, 100), # Random income values (in thousands)

'Risk_Tolerance': np.random.randint(1, 5, 100), # Scale of 1-5

'Investment_Decision': np.random.choice([0, 1], size=100) # 0 = No high investment, 1 = High investment

}

# Create a DataFrame

df = pd.DataFrame(data)

# Define the dependent variable (response) and independent variables (explanatory)

X = df[['Social_Media_Engagement', 'Income', 'Risk_Tolerance']]

y = df['Investment_Decision']

# Add a constant to the independent variables (for the intercept in the model)

X = add_constant(X)

# Fit the logistic regression model

model = sm.Logit(y, X).fit()

# Predict probabilities using the fitted model

# We are using the predicted probabilities for the Social_Media_Engagement variable

X_pred = np.linspace(df['Social_Media_Engagement'].min(), df['Social_Media_Engagement'].max(), 100)

# Create a dataframe for prediction with constant added

X_pred_df = pd.DataFrame({'const': np.ones(100), 'Social_Media_Engagement': X_pred, 'Income': np.mean(df['Income']), 'Risk_Tolerance': np.mean(df['Risk_Tolerance'])})

# Predict the probabilities using the fitted model

y_pred = model.predict(X_pred_df)

# Plot the regression line (logistic curve)

plt.figure(figsize=(10, 6))

plt.plot(X_pred, y_pred, color='blue', label='Logistic Regression Curve')

plt.scatter(df['Social_Media_Engagement'], y, color='red', alpha=0.5, label='Data points')

plt.title('Logistic Regression Curve for Social Media Engagement')

plt.xlabel('Social Media Engagement')

plt.ylabel('Probability of High Investment Decision')

plt.legend(loc='best')

plt.grid(True)

plt.show()

——

import pandas as pd

import numpy as np

import statsmodels.api as sm

from statsmodels.tools import add_constant

# Sample data (you should replace this with your actual dataset)

data = {

'Social_Media_Engagement': np.random.randint(1, 10, 100), # Random values as a proxy

'Income': np.random.randint(20, 100, 100), # Random income values (in thousands)

'Risk_Tolerance': np.random.randint(1, 5, 100), # Scale of 1-5

'Investment_Decision': np.random.choice([0, 1], size=100) # 0 = No high investment, 1 = High investment

}

# Create a DataFrame

df = pd.DataFrame(data)

# Define the dependent variable (response) and independent variables (explanatory)

X = df[['Social_Media_Engagement', 'Income', 'Risk_Tolerance']]

y = df['Investment_Decision']

# Add a constant to the independent variables (for the intercept in the model)

X = add_constant(X)

# Fit the logistic regression model

model = sm.Logit(y, X).fit()

# Print the regression summary (this is the regression output)

print(model.summary())

# The results will show:

# - Odds Ratios (OR) for each explanatory variable

# - 95% Confidence Intervals for the Odds Ratios

# - P-values for each predictor

0 notes

reema00321 · 7 months ago

Text

In my multiple regression analysis I have examined how social media impacts investment behavior. There are variables used such as social media engagement, income, education level, and risk tolerance. The results have shown that statistically there is indeed a positive relationship. As social media engagement beta was 0.45, p value was 0.03. Income and risk tolerance beta was 0.25 with p value 0.01 and beta 0.30 and p value 0.02. Also education level had beta 0.1 and p value 0.12.

The results have indeed proven my hypothesis as income and risk tolerance indeed positively correlate with both social media engagement and investment. Yet education level did not have a great influence.

The Q-Q plot showed that residuals were normally distributed.

Standardized residuals for all observations shows that the most residuals were range -2 to +2 with minimal outliers.

Leverage plot showed that certain data points had disproportionate influence on models estimated coefficients.

Q-Q plot slight deviations from normality.

Standardized residuals some extreme residuals need to be checked.

Leverage plot few influential points can affect regression results.

Breaking it down

Social media engagement the beta and p value proved that there was a positive relationship between social media and money invested. So more social media more investments basically. Beta was 0.45 and the p value of 0.03 indicates that the relationship is statistically at the 5% significance level.

Income beta was 0.25 and p value 0.01 as it is positive so as income increases so does investment amounts.

Risk tolerance beta was 0.1 and p value was 0.12 showed that people with higher risk tolerance invest more.

Education beta of 0.1 and p value 0.12 shower that the higher the education did not mean more individuals invested.

My hypothesis: the higher social media engagement the more people would invest. The results did indeed support my hypothesis indicating the statistically significant positive association. The results aligned with my hypothesis that shows social media does in fact have a positive relationship with the investment decisions made by individuals.

Steps done

First ran a regression model with only social media engagement that showed beta 0.45 and p value 0.03. Then added the variables income and risk tolerance and education level. Income beta 0.25 and p value 0.01 which reduced social media engagement from 0.45 to 0.43. Then risk tolerance had beta 0.3 and p value 0.02 that took social media engagement down from 0.43 to 0.4. But eduction level beta 0.1 and p value 0.12 did not change the coefficient for social media engagement. The change overall although was minimal.

Model fit: the r squared value was 0.65 shows 65% of variation in investment amount can be explained by model.

Adjusted r squared of 0.62 still explains substantial portion of variation.

F statistic is 22.5 and p value <0.001 means overall model significant.

0 notes

reema00321 · 7 months ago

Text

Linear regression analysis

Program:

import pandas as pd

Example data: Age group and social media usage

data = { 'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'], 'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media 'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest }

Create DataFrame

df = pd.DataFrame(data)

Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)

df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)

Generate a frequency table to check the recoding of Age_Group

print("Frequency Table for Age Group (Recoded):") print(df['Age_Group_Coded'].value_counts())

Center the 'Social_Media_Usage' variable

mean_usage = df['Social_Media_Usage'].mean() df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage

Check the mean of the centered variable

print("\nMean of Centered Social Media Usage:") print(df['Centered_Social_Media_Usage'].mean())

import pandas as pd import statsmodels.api as sm

Example data with recoded Age_Group and centered Social_Media_Usage

Create DataFrame

df = pd.DataFrame(data)

Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)

df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)

Center the 'Social_Media_Usage' variable

mean_usage = df['Social_Media_Usage'].mean() df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage

Prepare data for regression (add constant term for intercept)

X = df[['Age_Group_Coded', 'Centered_Social_Media_Usage']] X = sm.add_constant(X) # Adds constant to the model (intercept term) y = df['Investment_Decision']

Fit the linear regression model

model = sm.OLS(y, X).fit()

Get the summary of the regression model

print(model.summary())

Output:

Frequency Table for Age Group (Recoded): Age_Group_Coded 0 4 1 4 Name: count, dtype: int64

Mean of Centered Social Media Usage: 0.0 /home/runner/LavenderDifferentQuerylanguage/.pythonlibs/lib/python3.11/site-packages/scipy/stats/_axis_nan_policy.py:418: UserWarning: kurtosistest p-value may be inaccurate with fewer than 20 observations; only n=8 observations were given. return hypotest_fun_in(*args, **kwds)

OLS Regression Results

Dep. Variable: Investment_Decision R-squared: 0.078 Model: OLS Adj. R-squared: -0.290 Method: Least Squares F-statistic: 0.2125 Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.816 Time: 19:39:33 Log-Likelihood: -5.2219 No. Observations: 8 AIC: 16.44 Df Residuals: 5 BIC: 16.68 Df Model: 2

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

const 0.6544 0.481 1.361 0.232 -0.581 1.890 Age_Group_Coded -0.0587 0.867 -0.068 0.949 -2.287 2.169

Centered_Social_Media_Usage -0.0109 0.043 -0.251 0.811 -0.123 0.101

Omnibus: 2.332 Durbin-Watson: 2.411 Prob(Omnibus): 0.312 Jarque-Bera (JB): 0.945 Skew: -0.399 Prob(JB): 0.623

Kurtosis: 1.517 Cond. No. 46.8

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

0 notes

reema00321 · 7 months ago

Text

Linear regression analysis

Linear regression model

Investment decision (0 or 1)

Age group (recorded)

Social media usage (centered)

Using the code

import pandas as pd

import statsmodels.api as sm

# Example data with recoded Age_Group and centered Social_Media_Usage

data = {

'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'],

'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media

'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest

}

# Create DataFrame

df = pd.DataFrame(data)

# Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)

df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)

# Center the 'Social_Media_Usage' variable

mean_usage = df['Social_Media_Usage'].mean()

df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage

# Prepare data for regression (add constant term for intercept)

X = df[['Age_Group_Coded', 'Centered_Social_Media_Usage']]

X = sm.add_constant(X) # Adds constant to the model (intercept term)

y = df['Investment_Decision']

# Fit the linear regression model

model = sm.OLS(y, X).fit()

# Get the summary of the regression model

print(model.summary())

After running the code on python:

OLS Regression Results

===============================================================================

Dep. Variable: Investment_Decision R-squared: 0.078

Model: OLS Adj. R-squared: -0.290

Method: Least Squares F-statistic: 0.2125

Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.816

Time: 19:39:33 Log-Likelihood: -5.2219

No. Observations: 8 AIC: 16.44

Df Residuals: 5 BIC: 16.68

Df Model: 2

Covariance Type: nonrobust

===============================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------------------

const 0.6544 0.481 1.361 0.232 -0.581 1.890

Age_Group_Coded -0.0587 0.867 -0.068 0.949 -2.287 2.169

Centered_Social_Media_Usage -0.0109 0.043 -0.251 0.811 -0.123 0.101

==============================================================================

Omnibus: 2.332 Durbin-Watson: 2.411

Prob(Omnibus): 0.312 Jarque-Bera (JB): 0.945

Skew: -0.399 Prob(JB): 0.623

Kurtosis: 1.517 Cond. No. 46.8

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Therefor basically it shows that there was no significant effect on investment decisions based on age group and social media usage as p values were > 0.05 and the r squared value of 0.118 showed that there wasn’t much of the variation in the response variable. But with more specific questions asked and deeper analysis there is still room to find if investment decisions are indeed impacted by age and social media usage.

0 notes

reema00321 · 7 months ago

Text

Linear regression analysis

How does social media across demographics such as age impact investment decisions?

Age_Group categorical

Social_Media_Usage quantitive variable

Investment decisions response variable

Age group

Young = 0 (ages 18-35)

Older = 1 (ages 36+)

On python:

import pandas as pd

# Example data: Age group and social media usage

data = {

'Age_Group': ['18-25', '26-35', '36-45', '46-55', '56+', '18-25', '26-35', '36-45'],

'Social_Media_Usage': [15, 25, 35, 40, 30, 18, 22, 45], # Hours per week on social media

'Investment_Decision': [1, 1, 0, 0, 1, 1, 0, 1] # 1 = Invested, 0 = Did not invest

}

# Create DataFrame

df = pd.DataFrame(data)

# Recode Age_Group into two categories: 'Young' (18-35) and 'Older' (36+)

df['Age_Group_Coded'] = df['Age_Group'].apply(lambda x: 0 if x in ['18-25', '26-35'] else 1)

# Generate a frequency table to check the recoding of Age_Group

print("Frequency Table for Age Group (Recoded):")

print(df['Age_Group_Coded'].value_counts())

After running:

Frequency Table for Age Group (Recoded):

Age_Group_Coded

0 4

1 4

Name: count, dtype: int64

0 young: 4 observations

1 older: 4 observations

In order calculate the mean of social media usage and center the variable by subtracting the mean from each observation

# Center the 'Social_Media_Usage' variable

mean_usage = df['Social_Media_Usage'].mean()

df['Centered_Social_Media_Usage'] = df['Social_Media_Usage'] - mean_usage

# Check the mean of the centered variable

print("\nMean of Centered Social Media Usage:")

print(df['Centered_Social_Media_Usage'].mean())

You will get after running:

Mean of Centered Social Media Usage:

0.0

Which indeed confirms that social media usage variables have been centered correctly as the mean is 0 as expected after centering.

0 notes

reema00321 · 7 months ago

Text

How does social media impact investment decisions across different demographics (age, education)?

Describe sample

A) Study population:

The population are individuals who use social media and make investment decisions. The data is to be collected from online surveys sent out to diverse range of age groups 18-65. As well as range of education level (degree, gpa, etc)

B) Level of analysis:

Individual as each is unique and shows individual characteristics

C) Number of observation:

Around 100 participants. Including individuals who report their usage of social media for a minimum of once a week and engage in forms of investment activity. Whether they invest in stocks, bonds, cryptocurrency etc.

Procedures used to collect the data

A) Study design:

Cross sectional survey to be collected at a single point in time due to time restrictions.

B) Original purpose of data collection

Getting a better understanding of how investment decisions are being made due to such demographic factors. How social media is creating an influence in decisions made that could be life altering.

C) How the data were collected

Online surveys sent out through social media platforms. Survey includes closed and open ended questions regarding their usage, habits, behavior along with their demographic information.

D) When data were collected

Data was collected September 2024

E) Where data were collected

Online in the Western region

Describe your variables

A) describe your explanatory and response variable measured

Explanatory variables

Age: 18-25, 25-35, 35-45, 45-55, 55-65+

Education level: high school, college graduate, postgraduate degree, Masters degree, PHD

Social media usage: daily, weekly, monthly, occasionally, rarely, never

Response Variable

1 if there is a positive relationship between social media influences and investment decisions

0 if there is no relationship between social media influences and investment decisions

B) Response scale for explanatory and response variables

Age: categorical

Education: ordinal

Social media usage: ordinal

Investment decision influence: binary 1(influenced) and 0(not influenced)

C) How the variables were managed

Data cleaning for example outliers were checked

Data transformation for easy comparison

Statistical analysis like chi square tests to asses the likeliness of being influenced by social media in investment decisions across different demographics such as age and education

1 note · View note