mscdatasciencecoursercourse1
Week 3: Regression Modelling in Practice- Writing About Your Dat
1 post
Don't wanna be here? Send us removal request.
Text
Week 1: Regression Modelling in Practice- Writing About Your Data
Study Population: The study focuses on understanding the factors that influence life expectancy across various countries. The population of interest comprises countries worldwide, representing diverse economic, social, and health conditions. This broad range of countries includes developed, developing, and underdeveloped nations, allowing for a comprehensive analysis of global health determinants.
Level of Analysis: The level of analysis in this study is aggregate, as it examines country-level data rather than individual or group data. The variables analyzed, such as life expectancy, health expenditure, and access to essential services, are measured at the national level.
Number of Observations: The dataset consists of 278 observations, each representing a unique country. However, for the analysis, a sample size of 176 countries was used after data cleaning and handling missing values.
Data Analytic Sample: The analytic sample comprises 176 countries for which complete data was available on key variables, including life expectancy, health expenditure, access to electricity, improved sanitation and water sources, fertility rates, under-5 mortality rates, rural population percentages, and GDP per capita. This sample was used to explore the relationship between these variables and life expectancy, providing insights into the factors that significantly influence health outcomes across different regions.
Data Collection Procedures
Study Design: The data utilized in this study was generated through data reporting mechanisms, primarily collected and aggregated by international organizations such as the World Bank and the World Health Organization. These organizations compile global data on various indicators related to health, economy, and social factors.
Original Purpose of Data Collection: The original purpose of the data collection was to provide comprehensive, standardized, and accessible information on key health, economic, and demographic indicators across countries. This data serves as a valuable resource for policymakers, researchers, and international organizations to monitor progress, design interventions, and formulate policies aimed at improving global health and socio-economic conditions.
Data Collection Methods: The data was collected through various means, including national surveys, administrative records, and reports submitted by individual countries to international organizations. These organizations then aggregate, standardize, and validate the data to ensure consistency and accuracy across countries. For example, health expenditure data might come from government financial records, while life expectancy and mortality data could be derived from national health surveys and vital statistics.
Data Collection Period: The data were collected over several years, with the most recent data points typically representing the year closest to the time of analysis. The specific year of data collection varies for each variable, but the data used in my research study were collected for the year 2012.
Geographic Scope of Data Collection: The data were collected globally, covering a wide range of countries across different continents. This comprehensive geographical coverage ensures that the analysis reflects diverse socio-economic conditions, health outcomes, and developmental stages. The data includes countries from all major regions, including Africa, Asia, Europe, North and South America, and Oceania.
Variables:
Response Variable: Life Expectancy at Birth (Total Years)
Predictor Variables: The initial dataset included 86 potential predictor variables, but due to the complexity of handling such a large number of features, a systematic feature selection process was employed.
Feature Selection Process:
Initial Filtering:
Correlation Analysis: A Pearson correlation analysis was conducted to assess the linear relationship between each predictor variable and life expectancy. Features with p-values greater than 0.05 were considered statistically insignificant and were dropped from further analysis. Additionally, variables with weak correlation coefficients (less than ±0.5) were also removed.
Multicollinearity Check: Variance Inflation Factor (VIF) scores were calculated to detect multicollinearity among the remaining predictor variables. Features with high VIF scores were dropped to ensure the independence of the predictors.
LASSO Regression:
A LASSO (Least Absolute Shrinkage and Selection Operator) regression model was applied to further refine the list of predictor variables. This technique shrinks the coefficients of less important variables to zero, effectively selecting only the most influential predictors.
Final Predictor Variables: The final set of predictor variables retained after the feature selection process included:
Life Expectancy at Birth (Years): This is the dependent variable and serves as the primary indicator of population health.
Health Expenditure per Capita (Current US$): Represents the financial resources spent on healthcare services per person.
Access to Electricity (% of Population): Used as a proxy for infrastructure development, which is essential for effective healthcare delivery.
Improved Sanitation Facilities (% of Population with Access): Indicates the percentage of the population with access to basic sanitation, crucial for preventing disease and improving health outcomes.
Improved Water Source (% of Population with Access): Represents the percentage of the population with access to clean drinking water.
Fertility Rate, Total (Births per Woman): Included as a control variable, as higher fertility rates can influence health outcomes and resource allocation.
Mortality Rate, Under-5 (Per 1,000 Live Births): Serves as an additional health indicator, reflecting the overall health environment.
Fixed Broadband Subscriptions (Per 100 People): Reflects technological advancement and access to information, which may indirectly affect health outcomes.
Survival to Age 65, Female (% of Cohort): Indicates the percentage of people surviving to age 65, highlighting the longevity and effectiveness of healthcare services.
Rural population (% of total Population): refers to the proportion of a country's population that lives in rural areas as opposed to urban areas. It is expressed as a percentage of the total population.
GDP per Capita (Current US$): Used as a control variable for economic development, which can influence both healthcare access and life expectancy.
0 notes