planetatkinson-blog - Tumblr blog

planetatkinson-blog · 2 years ago

Photo

Elon goes hardcore

0 notes

planetatkinson-blog · 2 years ago

Text

Preference whisperer “stung” by Animal Justice Party

Like a teenager playing Diplomacy, the Animal Justice Party submitted a different set of preferences to the Victorian Electoral Commission after agreeing to a preference swap with Victoria’s “preference whisperer” Glen Druery. Reminds me of the time I nearly got a win with Italy.

The AJP’s lead candidate in my region (Southern Metropolitan), Ben Schultz, said “The Animal Justice party does not agree with the wheelings and dealings of a preference whisperer and the backroom deals of predominantly older, white males. That time has come to an end”.

I love the gesture of showing Druery false orders. So adolescent! Such a great comment on the whole Victorian preference thing! Bring it on.

Fellow Vics might like to know that if you vote for Animal Justice above the line, your preferences will go to

1. The Greens

2. The Socialists

3. Fiona Patten’s Reason Party

4. Legalise Cannabis Victoria

5. The ALP

Looks like a pretty good option!

You can find your seat and region here and all the party’s preferences are listed here.

#election2022 electionvictoria animaljusticeparty

8 notes · View notes

planetatkinson-blog · 2 years ago

Text

Dumbest Mastodon server yet

Not only can you talk like a dolphin, you must

https://dolphin.town/about/more

0 notes

planetatkinson-blog · 2 years ago

Text

Octopuses throwing things at each other

Possibly the second most intelligent species on the planet. I don’t think so.

0 notes

planetatkinson-blog · 7 years ago

Text

Predicting Breast Cancer

Assignment 3 of the capstone course of the Coursera specialisation

The mission is to predict breast cancer rates from other fields in the Gapminder dataset. Step one is to run single-variable Ordinary Least Squares for all fourteen of the explanatory fields I have chosen to look at.

Here’s a quick summary of the fields I’m using:

And here they are with summary statistics relating to the single-variable model in which that field is used to provide an Ordinary Least Squares estimate of breast cancer rates, ranked in order of r-squared values:

As you can see, the percentage of people who have internet access provides a strong single-variable model, with a p-value of practically zero and an r-squared of more than 0.5. Here is what that model looks like:

You might be wondering what sort of relationship there could be between internet use and breast cancer. My preferred explanation is that internet use is a marker for inactivity and obesity, and that those things are what cause higher cancer rates. I don’t believe that there is a causal relationship between surfing the net itself and getting cancer.

Three of the variables (electricity use, income per person and CO2 emissions) had so little relationship with breast cancer rates that they actually generated “ill-conditioned system” warnings from the Ordinary Least Squares solver. What does an ill-conditioned system look like? One image of a perfectly ill-conditioned single-variable system would be a circular region evenly filled with data points - there are many possible lines that bisect the circle and the solver cannot decide which one to use. This is roughly illustrated in the case of using electricity consumption to predict breast cancer:

Another style of ill-conditioned system would be one in which most of the data points line up vertically. This is illustrated with CO2 emissions as the explanatory variable:

Multivariate Failure.

My process for building a multivariate model was to work my way down the list in order of descending r-squared (or ascending p-values, the two rankings are only slightly different) looking for models with p-values of less than 0.05 on each variable. After forming my collection of acceptable multivariate models, I would take the one with the lowest overall p-value and/or highest r-squared value as my preferred model.

There were very few multivariate models whose fields all carried p-values below 0.05 and of those that did, none had better model-level p-values or r-squared values than the simple single variable model based on internet usage. It looks like the addition of extra variables added more noise than signal.

In my earlier assignment I had intended to take my multivariate model and form new fields by cross-multiplying the columns. This set of columns plus cross-products would then be winnowed by Lasso regression, perhaps to reveal a multivariate model incorporating fields that were the products of some two fields from the original dataset with superior performance to the original multivariate model.

With no multivariate model to start from, I was unable to do this.

Lasso Regression Failure.

My next planned step was to do a Lasso Regression and see if that revealed some combination of fields that was more highly predictive than the largely manual process with which I had searched for a multivariate model. Putting all fourteen fields into Lasso led to the following weightings as the best mix (the other nine fields were given weightings of zero by Lasso):

The inclusion of electricity consumption, which on its own actually leads to a system which Ordinary Least Squares flags as ill-conditioned is surprising. The heavy (negative) weighting on Armed Forces Rate is also strange, as it is quite impossible to get an Ordinary Least Squares model worth looking that includes that field.

The notion of Lasso as a quick way to generate a pretty good shortlist of fields to concentrate on, perhaps feeding them into some highly sophisticated modelling technique (random forest, neural net, etc.) is looking shaky. If Lasso cannot give me a useful shortlist of fields when I am able to check things with using Ordinary Least Squares, how can I trust it when I’m using a sophisticated technique that I don't understand as well as my old friend OLS?

Or maybe I’ve missed something obvious ...

Data Management.

The Gapminder dataset has 213 countries in it. Setting aside 40% of the data for validation gives us a training data set of 127 countries. Removing rows with missing values in any of the fourteen fields (as I did before the big Lasso effort) reduces this to 33. So I was throwing away almost three-quarters of my data in the data management step.

Redoing my Lasso Regression and feeding it, first, just internet use rate (the strongest candidate variable in my single variate inspection of the data), then feeding it the first two such variables (internet use rate and life expectancy) and so on, shows me that Lasso’s behaviour changes when the total amount of data drops below a threshold:

For 1 to 5 fields, Lasso essentially used all of them in its preferred solution. For 6 and 7 fields, it started assigning zero weighting to 2 and then 3 of them. For 8 fields it settled on two fields (internet use rate and polity score). The two-field solution remained stable until it went a bit crazy on 12 fields and started recommending junk including the armed forces rate. The sudden extreme improvement in error and r-squared from the solution on 11 fields to the 12-field solution is a sign of junk.

So, don’t just throw everything at Lasso, particularly if you have a lot of missing values. Pass it a growing list of fields in order of single-variable reliability and watch for a period where it gives a stable solution with few fields.

I’ll compare models on validation data for the final assignment.

0 notes

planetatkinson-blog · 7 years ago

Text

Predicting breast cancer rates

Assignment 2 of the capstone course of the data specialisation on Coursera.

The data.

I will be using the Gapminder dataset to predict breast cancer rates in different countries based on fourteen other data fields in the dataset. The Gapminder dataset covers 213 countries, listing a variety of health, social and demographic indicators. This exploration will confine itself to the following indicators:

Gapminder is a charitable organisation that provides data and data tools to dispel ignorance about the world. Further details can be found at https://www.gapminder.org/

Statistical methods

40% of the data will be set aside for verification of predictive models.

The fourteen explanatory variable will be ranked by the strength of their relationship (p-value, R-squared) with the target variable (breast cancer rate) and then used, in rank order, to build a multivariable Ordinary Least Squares model.

Lasso regression will be used to find a model from all 14 explanatory variables.

A second Lasso regression will be used to find a model built from the same variables that were used in the Ordinary Least Squares model, with additional fields built by cross-multiplying columns.

The Lasso regressions will be considered for implications for additional features that might add value to the Ordinary Least Squares approach. If such implications are found, a second Ordinary Least Squares model will be built using features identified by Lasso.

The three or four models built will be compared for predictive power using the 40% of data that was set aside at the start, and compared in a more subjective way for explanatory value.

Data Wrangling Considerations

Although the dataset contains 213 countries, missing values will reduce the number of countries covered as the quantity of fields grows. Attention will be paid to this aspect of the data during the project, and I will be alert to the possibility that a simpler model can be better if its simplicity allows it more data as input.

0 notes

planetatkinson-blog · 7 years ago

Text

Predicting Blood Donations

This is a friendly competition hosted by DrivenData. The aim is to predict whether a blood donor will return to donate blood given their donation history. Successful prediction of donor behaviour enables forward planning of the entire blood supply chain and allows managers to identify shortfalls in blood products before they happen.

The training dataset contains the histories of 576 blood donors, with the response variable being whether or not each donor gave blood in March 2007. Predictor variables are:

Months since the donor’s last donation

Months since the donor’s first donation

Number of donations given by that donor

A fourth variable, volume of blood donated, is simply the number of donations multiplied by a constant (250 cc). This field adds no information to the number of donations given and is being left out of my analysis.

Two other features have been added to the predictor variables:

Number of months between first and last donations (length of career as a donor)

Rate of donations (number of donations divided by length of career as a donor)

This is a clean and very simple dataset which allows me to focus on the algorithmic side of data analysis rather than data cleaning and working through many columns. The predictor variables are highly skewed and the response variable is biased with 138 donors giving blood in March 2007 and 428 not doing so (a roughly 3:1 imbalance) - these are two issues I intend to look at soon in the project.

0 notes

planetatkinson-blog · 8 years ago

Text

K-means - the Final Frenzy

As so often happens, my ambitions leapt well ahead of my capabilities. I had hoped to:

1. Run ANOVA and Tukey’s significance test on all values of k (where k is the number of clusters in my k-means clustering) and see if the p-values, etc., moved in a fashion that corresponded to the quality of clustering in the plots.

2. Play with imputing missing values via the column average of all rows in the same cluster, rather than the column average of all rows in the whole dataset, and just write some code using the sklearn.preprocessing.Imputer module.

But instead I had a prolonged brain spasm over the indices of my DataFrames and the motivation for the .reset_index() command in the course’s official code. So most of my energy this week went into picking my way through this.

But before we start spasming around, we need to deal with the preliminaries. We’re working with a subset of the NESARC data. The response variable is the student’s grade in mathematics, and we are looking at various attitudinal and family relationship measures as explanatory variables. After importing all the usual libraries, we do a bunch of data management:

There are two things to note here. Firstly, there are a LOT of np.nan values leading to rows being dropped in the df.dropna() command. Secondly, printing the length of my data before and after the df.dropna() step was the last and most useful thing that I did in the course of getting my head around the indexes in Pandas.

Before the .dropna() the dataset was 6504 rows long and its index ran from 0 to 6503. After the .dropna() it was just 3939 rows long, but its index still contained values between 0 and 6503 which were inherited from the original index. Now I get it. The index starts as a list of consecutive numbers but it doesn’t have to be like that, in fact it functions more like the keys to a dictionary.

Note also that I have recoded the explanatory variables so that high values are associated with “good” answers. This might also make interpretation easier later on.

Next we sort out our columns a bit more and then do the test-train split:

The training data has 2757 rows and the test data has 1182. Both of these datasets are indexed with values inherited from the original 0-6503 range in the initial dataset.

Next we do the “elbow plot”:

which looks like this:

So we have a broad elbow between 2 and 4. My inclination is to pick 3 as the “right” value to use for k-means. But first I’ll run a loop and look at the PCA breakdown into 2 canonical variables for all of the values between 1 and 9. Here’s the code:

The output for k = 2, 3, 4 are of interest:

The k=2 model has two distinct clusters when viewed via the canonical variables and the addition of another cluster maintains distinct clusters, but the addition of a fourth cluster causes confused interpenetration of clusters. So k=3 is the best pick.

And in the code above we see my print of what we have and a comment on what it is. model is a DataFrame and its ‘labels_’ column contains the cluster assignment. The index still has values running from 0 to 6503.

Next you see a lot of code with multiple print statements as my poor brain tries to work out what is going on. i’m not going to say a lot about this, but there are a couple of points worth noting:

1. The .reset_index() command creates a new index that runs from 0 to the length of the data frame in order. It also puts the old index values into a column named “index”. So there’s the index and theres the column whose name is “index”. Confusing, or what?

2. I suspect the entire data manipulation process is excessively long-winded. I think a complete rewrite is the way to go here.

The value counts at the end are:

The largest cluster is about twice the size of the smallest. This does not seem excessively lopsided to me. All good.

Looking at the clusters in detail:

The “index” column is meaningless, an artefact of the mindless index manipulations.

We see cluster 1 (the largest, with 1274 members) gives the highest result for marks in maths (the response var, h1ed12) and is associated with positive values for all the other fields (remember, positive values were associated with “good” answers in the data management stage). These are the successful, well-adjusted, young people.

There are two distinct clusters of students who are not so successful at maths. I couldn’t be bothered describing them in detail at this stage, save to observe that they do differ significantly on quite a few fields.

Now to check that the clusters are statistically meaningful:

Which gives us the output:

So, yes. The grouping by clusters has a solid p-value and all three of the two-way cluster-partitions are significant by Tukey’s Test. Whew!

1 note · View note

planetatkinson-blog · 8 years ago

Text

Assignment 3: Lasso

I decided to use exactly the same data as I did in assignment 2, to see if Lasso differs much from Random Forest in its choice of most significant variables (spoiler alert: it differs a LOT).

At the start I had a few questions about Lasso. The comments in my code include:

''' main topics of interest 1. Will Lasso agree with Random Forest about the order of importance of my predictors (answer: NO) 2. Does it matter if I don't set sex to male as per the lecture? (answer: NO) 3. What happens with correlated predictors? (answer: don’t have time to investigate) 4. Can I plot MSE or R-squared as a function of alpha operating on test/validation data? (answer: don’t have time to make the attempt) 5. Will I actually have time to look at these issues? (answer: not really) '''

None of the above topics will be addressed today. Lasso only eliminated 2 of the 12 variables, which made me wonder if lasso was overfitting. But the real kicker is act the final result has an R-squared of only 3.3%. I am modelling a set of explanatory variable that have almost no connection with the response variable. What a waste of time. Anyway, here’s the write up done in the prescribed manner ...

A lasso regression analysis was conducted to identify a subset of variable from a pool of 12 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring major depression. Categorical predictors included census division, building type, sex (binary), parental death during childhood (binary), respondent having drank at least 1 drink in life (binary), respondent having drank at least 12 drinks in last 12 months (binary), drinking status category, alcoholic father (binary) and alcoholic mother (binary). Quantitative predictors were age, number of persons in household and number of persons over 18 in household. All predictors were standardised to have a mean of zero and a standard deviation of one.

Data were randomly split into a training set that included 70% of the observations (n = 23933 observations) and a test set of 10258 observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model i the training set, and the model was validated using the test set. The change in cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation MSE at each step

Of the 12 predictor variables, 10 were retained in the selected model. Sex was the most strongly correlated variable with major depression. The correlation was negative, indicating a strong relation ship between femaleness and depression, because the field had been recoded to 0 = female and 1 = male. Next most strongly correlated were the two ‘alcoholic parent’ fields - if I had more time I would cross reference this against sex to determine if it is same-sex versus opposite-sex or mother versus father that matters.

Going on in declining order of importance were: number of persons over 18 in the household, drank at least 12 drinks in the past 12 months, total number of persons in household.

Of minimal importance, but not actually eliminated by lasso were census division, age and building type. Lasso assigned zero significance to alcohol consumption style (the drinking status category) and drank at least 1 drink ever.

The 10 variables accounted for a miserly 3.3% of variation (the R-square value on the test data) in depression. I think this miserable statistic tells us that the explanatory variables have basically no relationship with the response variable. The model has been a waste of time, apart from the practise it has given me in writing code.

I’m not going back to do it again or to explore the questions I posed at the start of this post. It’s time to move on.

1 note · View note

planetatkinson-blog · 8 years ago

Text

Assignment 2 Random Forest

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The response variable was MAJORDEPLIFE from the NESARC dataset, which informs us whether or not the subject has clinical depression.

Explanatory variables were:

'CENDIV': 'Census division (location)', 'BUILDTYP': 'Building type', 'NUMPERS': 'Number persons in house', 'NUMPER18': 'Number persons over 18 in house', 'AGE': 'Age', 'SEX': 'Sex', 'S1Q2K': 'Parental death during childhood (stress/trauma)', 'S2AQ1': 'Drank at least 1 drink in life', 'S2AQ2': 'Drank at least 12 drinks in last 12 months', 'CONSUMER': 'Drinking status category', 'S2BQ1A1': 'Alcholic father', 'S2BQ1A4': 'Alcoholic mother',

The focus of the enquiry was on whether alcohol consumption is linked to depression.

Random forests with varying numbers of classifiers were tested with accuracy noted. The chart of accuracy as a function of number of classifiers follows.

The curve of this chart is essentially flat for values of 20+, so 20 was chosen as the number of classifiers to use. The resulting random forest had an accuracy of 77.9%. The five most important variables for depression were (in order of importance): age, census division, building type, number of persons in household, number of persons over 18 in household.

So, we see that demographic factors such as age and where you live are more important than anything to do with alcohol as far as depression is concerned. The motivating idea that alcohol consumption may lead to depression was not supported by this data. Rather, it seems that people are more likely to get depressed as time goes by (age) and that social and circumstantial factors like neighbourhood and social class are the stronger drivers of depression.

0 notes

planetatkinson-blog · 8 years ago

Text

My first tree

I used a decision tree to model the relationship between the binary categorical output and five continuous explanatory variables. Modelling was done with the NESARC dataset, with the response variable set to 0 for countries with below-median rates of breast cancer and 1 for countries with above-median rates. The Gini criterion was used to grow the tree. Since Python’s library lacks tree pruning, I ran through a range of values for the maximum number of leaves looking for the one that gave maximum accuracy at the lowest number of leaves. This optimum number of leaves for my data was 10.

The relationship between the explanatory variables and the output picture above is:

In statistical modelling (done in the previous course), alcohol consumption had the strongest statistical association with breast cancer, so it is no surprise that this variable was the first to separate the sample.

Overall, the decision tree classified 83 countries correctly and 11 incorrectly on the test data (89% accuracy).

I notice that all the splits are at the value of 0.5, which seems like an artefact of the algorithm. This is disappointing, and I feel that the standard Python toolkit is lacking in this area. If time permits I may experiment with alternatives, such as the xgboost algorithm as described in the tutorial here.

Other things that I learned were:

1. For many consecutive values of max_leaves the output using test data did not change in accuracy. My feeling is that there weren’t enough rows in the test data to comprehensively probe the different branches of the trees. I will switch to a larger dataset for work on machine learning. 94 rows is not exactly “big data”!

2. Results varied considerably when I chose different values for the random seed in the test_train_split() call. Again, I think this is probably due to my dataset being too limited, but I’m not sure.

0 notes

planetatkinson-blog · 8 years ago

Text

Assignment 4 - Logistic Regression

Once again, I’m using the Gapminder dataset to explore causation of breast cancer. This time with logistic regression.

To make the problem fit with logistic regression, I’ve binned all my variable into two-valued fields. A value of 1 means the country has an above-median level of the field in quesition, a value of 0 means the country has a below-median value.

The code should be here, but I’m having trouble getting it uploaded to bitbucket. If you don’t see binning for all variables at the start, it means I have been defeated by bitbucket and the real code will appear in a later post on this blog.

First, I’m trying out alcohol consumption in a basic 1-factor model. Here’s the result:

So, the association between alcohol consumption and breast cancer is statistically significant (p < 0.05). People in countries with above-median alcohol consumption are 11 times more likely (OR = 11.46, 95% CI = 5.98 - 21.96) to suffer above-median breast cancer rates than people in countries with below-median alcohol consumption.

Following my process from last week, I added incomeperperson into the model, giving this:

Both alcohol consumption and income are significantly associated with breast cancer (p < 0.05 for both of these variables). The Odds Ratios are almost identical for the two explanatory variables (for alcohol OR = 6.55, 95% CI = 3.23 - 13.31 after controlling for income, for income OR = 6.32, 95% CI = 3.11 - 12.85 after controlling for alcohol) and the overlapping confidence intervals for the Odds Ratios means we cannot say which is more significant.

For a third factor, I’ve chosen life expectancy. The resulting three-factor model’s summary is:

All three variables have p-values that indicate statistical significance. So all three of these variables are separately associated with breast cancer rates.

After adjusting for potential confounders (income and life expectancy), the odds of a citizen of a country with above-median alcohol consumption having above-median breast cancer risk are five times those of a person with below-median alcohol consumption (OR = 5.83, 95% CI = 2.76 - 12.32, p = 0.000). After adjusting for alcohol consumption and life expectancy, the odds of a citizen of a country with above-median income having breast cancer are three times those of a citizen of a below-median income country (OR = 3.01, 95% CI = 1.33 - 6.82, p = 0.008). After adjusting for alcohol consumption and income, the odds of a citizen of a country with above-median life expectancy having breast cancer are five times those of a citizen of a country with below-median life expectancy (OR = 5.08, 95% CI = 2.27 - 11.36, p = 0.000). This is going to get really boring for the four-factor model!

For the four-factor model, I threw in polityscore, which is an attempt to measure freedom/democracy across different countries. Here’s the four-factor model’s summary:

All four fields have p-values below 0.05. They’re all statistically significant. They all have overlapping confidence intervals for their odds ratios, so we cannot say at the 95% confidence level if any one is more statistically significant than any other one.

I am required by Coursera to write out all of the Odds Ratios, etc., in the “official” manner. God help me, here we go. No! I won't do it! Look at the bloody numbers for yourself.

Let’s get on to the five-factor model. I’ve thrown in urbanrate in the hope that it will give me something interesting to talk about. Here we go -

At last we have something different! Urbanrate (p = 0.052) and income (p = 0.179) are not statistically significant in the five-factor model. This means that those two variables are confounding each other. We could drop urbanrate and revert to the previous four-factor model or we could drop income and end up with a different (and perhaps slightly better) four-factor model. Given that the p-value associated with income is higher, I suspect that the model without income will be better than the model without urbanrate. Let’s see if I’m right ...

The resulting model has all four fields being statistically significant (p-values below 0.05). Its pseudo r-square of 0.4101 is better than the other four-factor model’s pseudo r-square of 0.4034, so it is slightly better. This will be my final model for this exercise.

With the highest Odds Ratio (OR = 5.23, 95% CI = 2.42 - 11.30, p = 0.000), the primary explanatory variable I chose for my modelling effort (alcohol consumption) is indeed supported (lucky guess!). After controlling for urbanrate, life expectancy and polity score, a person in a country with above-median alcohol consumption has about five times the chance of developing breast cancer compared to a person in a country with below-median alcohol consumption.

Lucky for me I don’t have breasts.

The next highest risk factor for breast cancer in this model is life expectancy. I guess the longer you live the more chance you have of getting cancer. After controlling for alcohol consumption, urbanrate and polity score, a person in a country with above-median life expectancy has about four times higher risk of developing breast cancer than a person in a country with below-median life expectancy. OR = 4.19, 95% CI = 1.84 - 9.54, p = 0.001.

Polity score and urban rate also impact the odds of getting breast cancer. I was a little surprised that these two fields did not confound, as I had imagined a correlation between the two. Now that I think about it, the planet has several giant metropolises in poorly governed countries, so their independence does make sense. A person in a country with above-median polity score has about 3-4 times the risk of developing breast cancer as a person in a country with below-median polity score (OR = 3.69, 95% CI = 1.67 - 8.14, p = 0.001) after controlling for alcohol consumption, urban rate and life expectancy. A person in a country with above-median urban rate has about three times the risk of developing breast cancer as a person in a country with bellow-median urban rate, after controlling for alcohol consumption, life expectancy and polity score (OR = 3.10, 95% CI = 1.37 - 7.04, p = 0.007).

If the pseudo r-square of 0.4101 is meant to be interpreted in the same way as r-square in least squares regression, then the model lacks explanatory power. Almost 59% of variation in breast cancer rates is not explained by this model.

Question

That ends the assessable part of the assignment. I have a question, and if someone can give me a clue, please contact me through Tumblr’s messaging system (whatever that is).

I am familiar with least squares generating a hyperplane

y = f(x1, x2, x3, ...) where f is a linear function

that interpolates the dots that are the observations in a way that minimises the least squares measure of error

error = SUM((f(X) - y)^2)

I have encountered logistic regression as being a process of mapping the hyperplane’s y-value onto the unit interval (0,1) by some sigmoid function such as p = tanh(y). And then you can interpret p as a probability.

The business of taking the exponent of the coef to obtain the Odds Ratio resembles this, but I can’t quite join the dots. My effort to put the intercept and coef values into a tanh in my notes started like this:

And ended like this:

Can anyone help?

0 notes

planetatkinson-blog · 8 years ago

Text

more on assignment 3

As bloody usual, I’ve done a report, posted it and then when I get to putting the URL into Coursera I get a bunch of tick-a-box assessment rubrics that I must retrospectively address.

Report whether or not your results supported your hypothesis for the association between your primary explanatory response variable.

Yes, the primary explanatory variable was alcohol consumption, and results of the simple (single factor) model supported the hypothesis of association between it and the response variable (breast cancer) with (p < 0.05, r-squared of approx 0.3)

Discuss whether or not there was evidence of confounding for the association between your primary explanatory and response variable.

The two factor model had strong statistical significance between both explanatory variables (alcohol consumption and per capita income) and the response variable. There was not a confounding relationship, it was a genuine two-factor situation.

Generate regression diagnostic plots and write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outliers.

Covered i previous post.

Include your multiple regression output in your blog.

Yeah, I did that in the previous post.

(Optional) Help your peer to refine the report.

What to do about kurtosis? I found some articles via google, but they are behind something called “research gate” which requires me to present myself as a researcher or something. What do I have to do to get into research gate?

0 notes

planetatkinson-blog · 8 years ago

Text

Assignment 3 - assessing a multiple regression model

At last the course is getting seriously “statistical”, with multiple regression appearing for the first time, plus a slew of confusing tools to assess and improve our multiple regression models. This assignment shows how much I still have to learn.

I noticed last week that my model for breast cancer as a function of alcohol consumption based on the Gapminder dataset had Australia as an outlier, with a much higher breast cancer rate than would be expected based on the country’s alcohol consumption. The model had a good p-value but a very ordinary r-squared (a bit under 0.3 as I recall), indicating that breast cancer depends on something else besides alcohol consumption.

If there’s one thing that stands out about Australia, it is that we are bloody rich. So I decided to try adding income per capita into my model as a second variable as my first excursion into the land of multiple regression.

First, a quick review of the original single-factor model:

And now a look at the two-factor model:

As you can see, the two-factor model is much better, with an improved p-value and a greatly improved r-squared Both alcohol consumption and per capita income are significantly associated with breast cancer rate.

Now the assignment gets interesting - what do we make of the mysterious Q-Q plot?

The main thing about the Q-Q plot is the divergence above the line for high quantiles and below it for low quantiles. Although there is also some wiggling in the middle section, I don’t think that is as significant.

i did some googling and found a nice picture in the first graphic on a youtube video about Q-Q plots here. I agree with the youtuber that histograms are easier to get one’s head around than Q-Q plots at my level of familiarity with Q-Q plots. His Q-Q plot summary indicates that this type of Q-Q plot means fat tails:

The histogram of my own model’s residuals looks like this:

And now we see why we use Q-Q plots as well as histograms. I cannot tell just by looking at the histogram that the tails are fatter than in a standard normal distribution. I could probably pick it if it were skewed, but kurtosis is a bit much. The Q-Q plot makes kurtosis more observable.

Moving on to the next tool, let’s have a look at the scatter plot of standardised residuals:

Theory says that about 1% of the points should be beyond the +/-2.5 standard deviations - that would be 1.7 observations. This dataset has heaps more than that many dots at far-away positions. This confirms what we saw in the Q-Q plot, the model has fat tails. There is maybe a few more dots just below the central line than above, suggesting weak positive skew, but really it is the kurtosis that stands out.

The leverage plot looks like this:

Observation number 111 has a lot of influence on the model parameters and its standardised residual is well off centre at -2 standard deviations. The model may well be improved by excluding this outlier.

So What?

So we have a better model than before (better p-value, better r-squared), and we can see that sill has flaws - it exhibits high kurtosis and there one definite outlier in the data.

Can we sort out the kurtosis by “going quadratic”? Here’s the summary of my quadratic model in the two explanatory variables:

In this model, alcohol squared has a non-significant p-value but all the other variable are significant. The model has slightly better p-value and r-squared. The Q-Q plot is basically unchanged (I’m not going to bother posting it here, it really does look the same) and indicates the same high kurtosis as before. So I would need to do something else to get a lower-kurtosis model.

The other issue with the two-factor model was observation number 111. Dropping that observation from the dataset and then rerunning the two-factor model gives the following result:

Again, a very slight improvement. The leverage chart looks just like the previous one, except that the outlier isn’t there (again, there’s no point in posting a repetitive chart).

So, we have tools. We can use them to achieve improved models. But we don’t know how to fix high kurtosis.

0 notes

planetatkinson-blog · 8 years ago

Text

breast cancer and alcohol - code listing

Here’s the boring bit. The code for the previous post.

import pandas as pd import os import statsmodels.api as sm import statsmodels.formula.api as smf

data_dir = os.environ['COURSERA_DATA_DIR'] df = pd.read_csv(os.path.join(data_dir, 'gapminder.csv'), low_memory = False)

sub1 = pd.DataFrame() sub1['breastcancerper100th'] = pd.to_numeric(df['breastcancerper100th'], errors = 'coerce') sub1['alcconsumption'] = pd.to_numeric(df['alcconsumption'], errors = 'coerce') sub1 = sub1.dropna()

# create two-value explanatory var mean_alcconsumption = sub1['alcconsumption'].mean() print(mean_alcconsumption) def is_high_alcohol(x): if x > mean_alcconsumption: return 1 return 0 sub1['high_alcohol'] = sub1['alcconsumption'].map(is_high_alcohol)

# frequency table for explanatory var group = sub1.groupby('high_alcohol') print(group.count())

reg1 = smf.ols('breastcancerper100th ~ high_alcohol', data = sub1).fit() print(reg1.summary())

sns.factorplot(x = 'high_alcohol', y = 'breastcancerper100th', data = sub1, kind = 'bar', ci = None) plt.figure(0) plt.xlabel('high alcohol') plt.ylabel('breast cancer per 100 thousand')

# quantitiative explanatory var sub1['centered_alcohol'] = sub1['alcconsumption'] - mean_alcconsumption reg2 = smf.ols('breastcancerper100th ~ centered_alcohol', data = sub1).fit() print(reg2.summary()) sns.regplot(x = 'centered_alcohol', y = 'breastcancerper100th', data = sub1) plt.figure(1) plt.xlabel('alcohol consumption') plt.ylabel('breast cancer per 100 thousand')

0 notes

planetatkinson-blog · 8 years ago

Text

alcohol consumption and breast cancer

I’m taking a quick look at national alcohol consumption rate as an explanation of breast cancer rates, using the gap minder dataset. Countries without valid values in both the alcconsumption and breastcancerper100th fields have been excluded, which leaves me with 168 countries.

I’ve collapsed alcohol consumption into a two-category field which is called high_alcohol in my code. Countries with alcconsumption higher then the mean have high_alcohol = 1 and countries with alcconsumption below the mean have high_alcohol = 0. The frequency distribution of this explanatory variable looks like this:

The mean rate of alcohol consumption was 6.616, measured in litres per person per year.

Taking a linear regression of breast cancer rate as a function of alcohol consumption gives an p-value of 2.6e-15 which indicates a strong statistical relationship (p is practically zero), but an r-squared of only 0.315 so only about 31.5% of variation in breast cancer rates is actually explained by this variable. I suspect there is some other factor that has more to do with breast cancer rates than alcohol consumption, but at this stage I don’t know what.

The model has an intercept of 26.0221 and the high_alcohol field has a coefficient of 25.8245.

Because I want the practise, I also set this up with a quantitative variable, alcohol consumption, centered at the mean. This quantitative explanatory model also has a very strong p-value (1.09e-11) and a lowish r-squared (0.243).

Have a look at the scatter chart. This is what a statistically significant but incomplete explanation of an effect looks like:

For what it’s worth, Australia has a centered_alcohol of about 4 (above zero, so the typical Aussie drinks more than the average global citizen) and a breast cancer rate of a bit above 80. So there’s a lot more breast cancer in Australia than is explained by the rate of drinking.

That country way over to the right with the highest rate of boozing on planet Earth is Moldova.

0 notes

planetatkinson-blog · 8 years ago

Text

Bitbucket link problem solved

In the previous post I put a link to the current version of my code on bitbucket and then whined about how it didn't work. It works fine in Chrome and Firefox, but Safari doesn’t like the blanks in the link URL.

I also dumped the link into the post as a piece of text, but Tumblr itself stumbled on the blank space in the link that time, resulting in a truly broken link.

So I can either rename everything with no blank spaces (lots of underscores, I guess) or I can stop using Safari and tell anyone viewing this blog not to use Safari. Fuuuuuuck!

0 notes