#anyway about the data overfitting issue ... | Explore Tumblr posts and blogs

thosearentcrimes · 3 years ago

Text

Let’s say you’re a dipshit who has stumbled into a field of study you know nothing about, but of which you consider yourself the savior. You know that in order to really sell your novel approach (overfitting vaguely-defined and arbitrary metrics onto the already observed data) to the field, you will need a groundbreaking new theory. But wait, your approach can’t actually create such a novel hypothesis. There is a tradition beloved of semi-frauds, where you take some research done in another language, translate it to your own, and then pass it off as original work. But that won’t work in this case, there probably aren’t any novel theories big enough for your purposes out there, and anyway translating is hard work and you’ll probably just get caught.

What if, instead, you translate from your own language? Say you’ve decided to definitively explain, once and for all, the causes of social unrest and upheaval. Well, there are some venerable (read: boring to historians) theories explaining it in terms of a struggle for political power between an economically and socially ascending but politically marginalized class, such as the bourgeoisie, and an existing ruling class, such as the old aristocracy. A nice and simple theory, but tragically, everyone already knows about it, and it’s been displaced by shiny new theories.

Much as the scholastic essay plagiarist, you exchange all the words for synonyms, and pass it off as new work. You’ll have to work a little harder, but it should be simple enough. First, confuse everyone by redefining terms to mean things they obviously do not mean. For example, you could claim that the “elite” does not consist of those people who wield power within a society (even though this is the definition you yourself use elsewhere), but those who feel entitled to wield power within a society (this could be a bourgeois manufactory owner, or a barista with an English Literature degree, doesn’t really matter). Then, you reframe the issue around that word, now bereft of its former meaning.

So you say that the problem, rather than a politically marginalized class seeking power proportionate to its economic and social influence, is that there is “elite overproduction” (this will play well in the press, everyone loves being anti-elitist), and since anyone who believes they should exercise political power is an elite according to your definition, this means more or less the same thing except much more vague. Hopefully nobody notices that all you’ve managed to figure out is that political crises are caused by more people wanting power than there is power to go around.

Now for all I know, this isn’t actually what Peter Turchin did. It’s possible he’s more ignorant than he is a fraud. There’s a fair chance that many of the people out there inventing the wheel genuinely aren’t copying existing wheels, and have figured out for themselves that a round object rolling has much less friction than a flat object against the ground. But I think it would nonetheless be a bad idea to give research institutes or scholarly journals to people on the merit of their successful invention of the wheel.

#fuck Peter Turchin

4 notes · View notes

wayneradinsky · 4 years ago

Link

'A statistician teaches deep learning'. He says, 'Statisticians have different training and instincts than computer scientists.' OMG is that so true. I realized that when I first tried to use the R programming language. R is a programming language made by and for statisticians, not computer scientists. I wish I had made a list of all the weird things about it (which I can't remember any more because I don't use it on a regular basis and probably won't as more and more of its functionality is being replicated in the Python world), as it is one of the whackiest programming languages in the world. I guess my mind is more "computer sciency"; statisticians seem twisted to me. Which is not to say I'm not going to learn more statistics (always have to learn more statistics).

Anyway, so, what does this statistician say the differences between statisticians and computer scientists are? Well, not computer scientists in general; he focuses on "deep learning computer scientists". He says:

"One contrast between statisticians and deep learning computer scientists is that we generally make predictions and classifications using linear combinations of basis elements -- this is the cornerstone of nearly all regression and much classification, including support vector machines, boosting, bagging, and stacking (nearest-neighbor methods are a rare exception). But deep learning uses compositions of activation functions of weighted linear combinations of inputs. Function composition enables one to train the network through the chain rule for derivatives, which is the heart of the backpropagation algorithm."

"The statistical approach generally enables more interpretability, but function composition increases the space of models that can be fit. If the activation functions in the deep learning networks were linear, then deep learning in a regression application would be equivalent to multiple linear regression, and deep learning in a classification application would be equivalent to linear discriminant analysis. In that same spirit, a deep variational autoencoder with one hidden layer and linear activation functions is equivalent to principal components analysis. But use of nonlinear activation functions enables much more flexible models. Statisticians are taught to be wary of highly flexible models."

I'm going to assume you all don't want to download the paper (19-page PDF) with equations so I'm going to continue on and pull out some more choice quotes.

"Statisticians worry about overparameterization and issues related to the Curse of Dimensionality. But deep learning frequently has millions of parameters. Even though deep learning typically trains on very large datasets, their size still cannot provide sufficient information to overcome the Curse of Dimensionality. Statisticians often try to address overparameterization through variable selection, removing terms that do not contribute significantly to the inference. This generally improves interpretability, but often makes relatively strong use of modeling assumptions. In contrast, deep learning wants all the nodes in the network to contribute only slightly to the final output, and uses dropout (and other methods) to ensure that result. Part of the justification for that is a biological metaphor -- genome-wide association studies find that most traits are controlled by many, many genes, each of which has only small effect."

"And, of course, with the exception of Chen et al., deep learning typically is not concerned with interpretability, which is generally defined as the ability to identify the reason that a particular input leads to a specific output. deep learning does devote some attention to explainability, which concerns the extent to which the parameters in deep networks can shape the output. To statisticians, this is rather like assessing variable importance.

"Statistics has its roots in mathematics, so theory has primacy in our community. Computer science comes from more of a laboratory science tradition, so the emphasis is upon performance. This is not say that there is no theory behind deep learning; it is being built out swiftly. But much of it is done after some technique has been found to work, rather than as a guide to discovering new methodology. One of the major theorems in deep learning concerns the universal approximation property."

"One of the most important theoretical results for deep learning are upper bounds on its generalization error (also known as expected loss). Such bounds indicate how accurately deep learning is able to predict outcomes for previously unseen data. Sometimes the bounds are unhelpfully large, but in other cases they are usefully small."

"In statistics, we worry about the bias-variance tradeoff, and know that overtraining causes performance on the test sample to deteriorate after a certain point, so the error rate on the test data is U-shaped as a function of the amount of training. But in deep learning, there is a double descent curve; the test error decreases, rises, then decreases again, possibly finding a lower minimum before increasing again. This is thought to be because the deep learning network is learning to interpolate the data. The double descent phenomenon is one of the most striking results in modern deep learning research in recent years. It occurs in CNNs, ResNets, and transformers."

"Deep learning models are often viewed as deterministic functions, which limits use in many applications. First, model uncertainty cannot be assessed; statisticians know this can lead to poor prediction. Second, most network architectures are designed through trial and error, or are based upon high level abstractions. Thus, the process of finding an optimal network architecture for the task at hand, given the training data, can be cumbersome. Third, deep models have many parameters and thus require huge amounts of training data. Such data are not always available, and so deep models are often hampered by overfitting. Potentially, these issues could be addressed using Bayesian statistics for uncertainty quantification."

"It is important to recognize that deep learning can fail badly. In particular, deep learning networks can be blindsided by small, imperceptible perturbations in the inputs, also known as data poisoning. The attackers need not even know the details of the machine learning model used by defenders.

"One important case is autonomous vehicles, where deep learning networks decide, e.g., whether the latest sequence of images implies that the car should apply its brakes. There are now many famous cases in which changing a few pixels in training data can creates holes in deep learning network performance. Symmetrically, a small perturbation of an image, such as applying a post-it note to a stop sign, can fool the deep learning network into classifying it has a billboard and not recognizing that the vehicle should brake."

I skipped a few things like Kullback-Leibler divergence reversal, which pertains to how a statistician would look at generative adversarial networks (GANs), which are the neural networks that generate fake things, like images of people that don't exist. But that section was pretty much all equations. He goes on to outline a curriculum that he thinks would be ideal for statisticians who want to learn deep learning.

#ai

0 notes

planetatkinson-blog · 8 years ago

Text

Assignment 3: Lasso

I decided to use exactly the same data as I did in assignment 2, to see if Lasso differs much from Random Forest in its choice of most significant variables (spoiler alert: it differs a LOT).

At the start I had a few questions about Lasso. The comments in my code include:

''' main topics of interest 1. Will Lasso agree with Random Forest about the order of importance of my predictors (answer: NO) 2. Does it matter if I don't set sex to male as per the lecture? (answer: NO) 3. What happens with correlated predictors? (answer: don’t have time to investigate) 4. Can I plot MSE or R-squared as a function of alpha operating on test/validation data? (answer: don’t have time to make the attempt) 5. Will I actually have time to look at these issues? (answer: not really) '''

None of the above topics will be addressed today. Lasso only eliminated 2 of the 12 variables, which made me wonder if lasso was overfitting. But the real kicker is act the final result has an R-squared of only 3.3%. I am modelling a set of explanatory variable that have almost no connection with the response variable. What a waste of time. Anyway, here’s the write up done in the prescribed manner ...

A lasso regression analysis was conducted to identify a subset of variable from a pool of 12 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring major depression. Categorical predictors included census division, building type, sex (binary), parental death during childhood (binary), respondent having drank at least 1 drink in life (binary), respondent having drank at least 12 drinks in last 12 months (binary), drinking status category, alcoholic father (binary) and alcoholic mother (binary). Quantitative predictors were age, number of persons in household and number of persons over 18 in household. All predictors were standardised to have a mean of zero and a standard deviation of one.

Data were randomly split into a training set that included 70% of the observations (n = 23933 observations) and a test set of 10258 observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model i the training set, and the model was validated using the test set. The change in cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation MSE at each step

Of the 12 predictor variables, 10 were retained in the selected model. Sex was the most strongly correlated variable with major depression. The correlation was negative, indicating a strong relation ship between femaleness and depression, because the field had been recoded to 0 = female and 1 = male. Next most strongly correlated were the two ‘alcoholic parent’ fields - if I had more time I would cross reference this against sex to determine if it is same-sex versus opposite-sex or mother versus father that matters.

Going on in declining order of importance were: number of persons over 18 in the household, drank at least 12 drinks in the past 12 months, total number of persons in household.

Of minimal importance, but not actually eliminated by lasso were census division, age and building type. Lasso assigned zero significance to alcohol consumption style (the drinking status category) and drank at least 1 drink ever.

The 10 variables accounted for a miserly 3.3% of variation (the R-square value on the test data) in depression. I think this miserable statistic tells us that the explanatory variables have basically no relationship with the response variable. The model has been a waste of time, apart from the practise it has given me in writing code.

I’m not going back to do it again or to explore the questions I posed at the start of this post. It’s time to move on.

1 note · View note