simplystatistics - Tumblr blog

simplystatistics · 13 years ago

Text

We've Moved!

Simply Statistics has moved to a new platform and so if you've been following our blog on Tumblr, you'll have to update your links/RSS feeds to the new web site. Apologies for the disruption, but this move allows us to add some new features to the blog that will be rolled out soon. Thanks for following us!

11 notes · View notes

simplystatistics · 13 years ago

Text

Sunday Data/Statistics Link Roundup (11/18/12)

An interview with Brad Efron about scientific writing. I haven't watched the whole interview, but I do know that Efron is one of my favorite writers among statisticians.

Slidify, another approach for making HTML5 slides directly from R. I love the idea of making HTML slides, I would definitely do this regularly. But there are a couple of issues I feel still aren't resolved: (1) It is still just a little too hard to change the theme/feel of the slides in my opinion. It is just CSS, but that's still just enough of a hurdle that it is keeping me away and (2) I feel that the placement/insertion of images is still a little clunky, Google Docs has figured this out, I'd love it if they integrated the best features of Slidify, Latex, etc. into that system.

Statistics is still the new hotness. Here is a Business Insider list about 5 statistics problems that will "change the way you think about the world".

I love this one in the New Yorker, especially the line,"statisticians are the new sexy vampires, only even more pasty" (via Brooke A.)

We've hit the big time! We have been linked to by a real (Forbes) blogger.

If you haven't noticed, we have a new logo. We are going to be making a few other platform-related changes over the next week or so. If you have any trouble, let us know!

#sunday links #slidify #html5 #statistics is sexy #salzberg #efron

10 notes · View notes

simplystatistics · 13 years ago

Text

Logo Contest Winner

Congratulations to Bradley Saul, the winner of the Simply Statistics Logo contest! We had some great entries which made it difficult to choose between them. You can see the new logo to the right of our home page or the full sized version here:

I made some slight modifications to Bradley's original code (apologies!). The code for his original version is here:

Here’s the code: ######################################################### # Project: Simply Statistics Logo Design # Date: 10/17/12 # Version: 0.00001 # Author: Bradley Saul # Built in R Version: 2.15.0 ######################################################### #Set Graphical parameters par(mar=c(0, 0, 0, 0), pty='s', cex=3.5, pin=c(6,6)) #Note: I had to hard code the size, so that the text would scale #on resizing the image. Maybe there is another way to get around font #scaling issues - I couldn't figure it out. make_logo <- function(color){ x1 <- seq(0,1,.001) ncps <- seq(0,10,1) shapes <- seq(5,15,1) # Plot Beta distributions to make purty lines. plot(x1, pbeta(x1, shape1=10, shape2=.1, ncp=0), type='l', xlab='', ylab='', frame.plot=FALSE, axes=FALSE) for(i in 1:length(ncps)){ lines(x1, pbeta(x1,shape1=.1, shape2=10, ncp=ncps[i]), col=color) } #Shade in area under curve. coord.x <- c(0,x1,1) coord.y <- c(0,pbeta(x1,shape1=.1,shape2=10, ncp=10),0) polygon(coord.x, coord.y, col=color, border="white") #Lazy way to get area between curves shaded, rather than just area under curve. coord.y2 <- c(0,pbeta(x1,shape1=10,shape2=.1, ncp=0),0) polygon(coord.x, coord.y2, col="white", border="white") #Add text text(.98,.4,'Simply', col="white", adj=1,family='HersheySerif') text(.98,.25,'St\\*atistics', col="white", adj=1, family="HersheySerif") }

Thanks to Bradley for the great logo and congratulations!

6 notes · View notes

simplystatistics · 13 years ago

Text

Reproducible Research: With Us or Against Us?

Last night this article by Chris Drummond of the Canadian National Research Council (Conseil national de recherches Canada) popped up in my Google Scholar alert. The title of the article, "Reproducible Research: a Dissenting Opinion" would seem to indicate that he disagrees with much that has been circulating out there about reproducible research.

Drummond singles out the Declaration published by a Yale Law School Roundtable on Data and Code Sharing (I was not part of the roundtable) as an example of the main arguments in favor of reproducibility and has four main objections. What I found interesting about his piece is that I think I more or less agree with all his objections and yet draw the exact opposite conclusion from him. In his abstract, he concludes that "I would also contend that the effort necessary to meet the [reproducible research] movement’s aims, and the general attitude it engenders, would not serve any of the research disciplines well."

Let's take his objections one by one:

Reproducibility, at least in the form proposed, is not now, nor has it ever been, an essential part of science. I would say that with the exception of mathematics, this is true. In math, usually you state a theorem and provide the proof. The proof shows you how to obtain the result, so it is a form of reproducibility. But beyond that I would argue that the need for reproducibility is a more recent phenomenon arising from the great complexity and cost of modern data analyses and the lack of funding for full replication. The rise of "consortium science" (think ENCODE project) diminishes our ability to fully replicate (what he calls "Scientific Replication") an experiment in any reasonable amount of time.

The idea of a single well defined scientific method resulting in an incremental, and cumulative, scientific process is highly debatable. He argues that the idea of a forward moving process by which science builds on top of previous results in an orderly and incremental fashion is a fiction. In particular, there is no single "scientific method" into which you can drop in reproducibility as a key component. I think most scientists would agree with this. Science not some orderly process--it's messy and can seem haphazard and discoveries come at unpredictable times. But that doesn't mean that people shouldn't provide the details of what they've done so that others don't have to essentially reverse engineer the process. I don't see how the disorderly reality of science is an argument against reproducibility.

Requiring the submission of data and code will encourage a level of distrust among researchers and promote the acceptance of papers based on narrow technical criteria. I don't agree with this statement at all. First, I don't think it will happen. If a journal required code/data, it would be burdensome for some, but it would just be one of the many requirements that journals have. Second, I don't think good science is about "trust". Sure, it's important to be civilized but if you claim a finding, I'm not going to just trust it because we're both scientists. Finally, he says "Submitting code -- in whatever language, for whatever system -- will simply result in an accumulation of questionable software. There may be a some cases where people would be able to use it but I would doubt that they would be frequent." I think this is true, but it's not necessarily an argument against submitting code. Think of the all the open source/free software packages out there. I would bet that most of that code has only been looked at by one person--the developer. But does that mean open source software as a whole is not valuable?

Misconduct has always been part of science with surprisingly little consequence. The public’s distrust is likely more to with the apparent variability of scientific conclusions. I agree with the first part and am not sure about the second. I've tried to argue previously that reproducible research is not just about preventing fraud/misconduct. If someone wants to commit fraud, it's easy to make the fraud reproducible.

I the end, I see reproducibility as not necessarily a new concept, but really an adaptation of an old concept, that is describing materials and methods. The problem is that the standard format for publication--journal articles--has simply not caught up with the growing complexity of data analysis. And so we need to update the standards a bit.

I think the benefit of reproducibility is that if someone wants to question or challenge the findings of a study, they have the materials with which to do so. Providing people with the means to ask questions is how science moves forward.

5 notes · View notes

simplystatistics · 13 years ago

Text

Pro-tips for graduate students (Part 4)

This is part of the ongoing series of pro tips for graduate students, check out parts one, two and three for the original installments.

You can never underestimate how little your audience knows/cares about what you are talking about (so be clear and start with the "why").

Perfect is the enemy of good (so do something good and perfect it later).

Learn about as many different areas as you can. You have to focus on one problem to get a Ph.D. (your dissertation) but the best way to get new ideas is to talk to people in areas with different problems than you have. This is the source of many of the "Big Impact" papers. Resources for talking about new ideas ranked according to formality: seminar, working groups, meeting with faculty/other students, going for a beer with some friends.

Here are some ways to come up with a new method: (i) create a new method for a new data type, (ii) adapt an old/useful method to a new data type, (iii) an overlooked problem, (iv) changing the assumptions of a current method, and (v) generalizing a known method. Any can be impactful, but the highest probability of high impact in my experience is (ii).

#pro tips #methods #seminars #speaking #impact #graduate school #education

24 notes · View notes

simplystatistics · 13 years ago

Text

Some Thoughts on Teaching R to 50,000 Students

Two weeks ago I finished teaching my course Computing for Data Analysis through Coursera. Since then I've had some time to think about how it went, what I learned, and what I'd do differently.

First off, let me say that it was a lot of fun. Seeing thousands of people engaged in the material you've developed is an incredible experience and unlike any I've seen before. I initially had a number of fears about teaching this course, the primary one being that it would be a lot of work. Managing the needs of 50,000 students seemed like it would be a nightmare and making sure everything worked for every single person seemed impossible.

These fears were ultimately unfounded. The Coursera platform is quite nice and is well-designed to scale to very large MOOCs. Everything is run off of Amazon S3 and so scalability is not an issue (although Hurricanes are a different story!) and there are numerous tools provided to help with automatic grading. Quizzes were multiple choice for me, so that gave instant feedback to students, but there are options to grade via regular expressions. For programming assignments, grading was done via unit tests, so students would feed pre-selected inputs into their R functions and the output would be checked on the Coursera server. Again, this allowed for automatic instant feedback without any intervention on my part. Designing programming assignments that would be graded by unit tests was a bit restrictive for me, but I think that was mostly because I wasn't that used to it. On my end, I had to learn about video editing and screen capture, which wasn't too bad. I mostly used Camtasia for Mac (highly recommended) for the lecture videos and occasionally used Final Cut Pro X.

Coursera is working hard on their platform and so I imagine there will be many improvements in the near future (some of which were actually rolled out as the course was running). The system feels like it was designed and written by a bunch of Stanford CS grad students--and lo and behold it was! I think it's a great platform for teaching computing, but I don't know how well it'll work for, say, Modern Poetry. But we'll see, I guess.

Here is some of what I took away from this experience:

50,000 students is in some ways easier than 50 students. When I teach my in-class version of this course, I try to make sure everyone's keeping up and doing well. I learn everyone's names. I read all their homeworks. With 50,000 students there's no pretension about individual attention. Everyone's either on their own or has to look to the community for help. I did my best to participate in the discussion forums, but the reality was that the class community was incredibly helpful and participating in it was probably a better experience for some students than just having me to talk to.

Clarity and specificity are necessary. I've never taught a course online before, so I was used to the way I create assignments in-class. I just jot down some basic goals and problems and then clarify things in class if needed. But here, the programming assignments really had to be clear (akin to legal documents) because trying to clear up confusion afterwards often led to more confusion. The result is that it took a lot more time to write homework assignments for this class than for the same course in-person (even if it was the same homework) because I was basically writing a software specification.

Modularity is key to overcoming heterogeneity. This was a lesson that I didn't figure out until the middle of the course when it was basically too late. In any course, there's heterogeneity in the backgrounds of the students. In programming classes, some students have programmed in other languages before while some have never programmed at all. Handling heterogeneity is a challenge in any course. Now, just multiply that by 10,000 and that's what this course was. Breaking everything down into very small pieces is key to letting people across the skill spectrum move at their own pace. I thought I'd done this but in reality I hadn't broken things down into small enough pieces. The result was that the first homework was a beast of a problem for those who had little programming experience.

Time and content are more loosely connected. Preparing for this course exposed a feature of in-class courses that I'd not thought about. In-class courses for me are very driven by the clock and the calendar. I teach twice a week, each period is 1.5 hours, and there are 8 weeks in the term. So I need to figure out how to fit material into exact 1.5 hour blocks. If something only takes 1 hour to cover then I need to cover part of the next topic, find a topic that's short, or just fill for half an hour. While preparing for this course, I found myself just thinking about what content I wanted to cover and just doing it. I tried to target about 2 hours of video per week, but there was obviously some flexibility. In class, there's no flexibility because usually the next class is trampling over you as the period ends. Not having to think about exact time was very liberating.

I'm grateful for all the students I had in this first offering of the course I thank them for putting up with my own learning process as I taught it. I'm hoping to offer this course again on Coursera but I'm not sure when that'll be. If you missed the Coursera version of Computing for Data Analysis, I will be offering a version of this course through the blog very shortly. Please check here back for details.

#R #MOOC

6 notes · View notes

simplystatistics · 13 years ago

Text

Sunday Data/Statistics Link Roundup (11/11/12)

Statisticians have been deconstructed! I feel vaguely insulted, although I have to admit I'm not even sure I know what the article says. This line is a doozy though: "Statistics always pulls back from the claims it makes..." As a statistician blogger, I make tons of claims. I probably regret some of them, but I'd never take them back :-).

Following our recent detour into political analysis, Here is a story about the statisticians that helped Obama win the election by identifying blocks of voters/donors that could help lead the campaign to victory. I think there are some lessons here for individualized health.

XKCD is hating on frequentists! Wasserman and Gelman respond. This is the same mistake I think a lot of critics of P-values make. When used incorrectly, any statistical method makes silly claims. The key is knowing when to use them, regardless of which kind you prefer.

Another article in the popular press about the shortage of data scientists, in particular "big data" scientists. I also saw a ton of discussion of whether Nate Silver used "big data" in making his predictions. This is another one of those many, many cases where the size of the data is mostly irrelevant; it is knowing the right data to use that is important.

Apparently math can be physically painful. I don't buy it.

#xkcd #literature #politics #silver #frequentist #bayesian #big data

3 notes · View notes

simplystatistics · 13 years ago

Text

Interview with Tom Louis - New Chief Scientist at the Census Bureau

Tom Louis

Tom Louis is a professor of Biostatistics at Johns Hopkins and will be joining the Census Bureau through an interagency personnel agreement as the new associate director for research and methodology and chief scientist. Tom has an impressive history of accomplishment in developing statistical methods for everything from environmental science to genomics. We talked to Tom about his new role at the Census, how it relates to his impressive research career, and how young statisticians can get involved in the statistical work at the Census.

SS: How did you end up being invited to lead the research branch of the Census?

TL: Last winter, then-director Robert Groves (now Provost at Georgetown University) asked if I would be interested in the possibility of becoming the next Associate Director of Research and Methodology (R&M) and Chief Scientist, succeeding Rod Little (Professor of Biostatistics at the University of Michigan) in these roles. I expressed interest and after several discussions with Bob and Rod, decided that if offered, I would accept. It was offered and I did accept. As background, components of my research, especially Bayesian methods, is Census-relevant. Furthermore, during my time as a member of the National Academies Committee on National Statistics I served on the panel that recommended improvements in small area income and poverty estimates, chaired the panel that evaluated methods for allocating federal and state program funds by formula, and chaired a workshop on facilitating innovation in the Federal statistical system. Rod and I noted that it's interesting and possibly not coincidental that with my appointment the first two associate directors are both former chairs of Biostatistics departments. It is the case that R&D's mission is quite similar to that of a Biostatistics department; methods and collaborative research, consultation and education. And, there are many statisticians at the Census Bureau who are not in the R&D directorship, a sociology quite similar to that in a School of Public Health or a Medical campus.

SS: What made you interested in taking on this major new responsibility?

TL: I became energized by the opportunity for national service, and excited by the scientific, administrative, and sociological responsibilities and challenges. I'll be engaged in hiring and staff development, and increasing the visibility of the bureau's pre- and post-doctoral programs. The position will provide the impetus to take a deep dive into finite-population statistical approaches, and contribute to the evolving understanding of the strengths and weakness of design-based, model-based and hybrid approaches to inference. That I could remain a Hopkins employee by working via an Interagency Personnel Agreement, sealed the deal. I will start in January 2013 and serve through 2015, and will continue to participate in some Hopkins-based activities. In addition to activities within the Census Bureau, I'll be increasing connections among statisticians in other federal statistical agencies, have a role in relations with researchers funded through the NSF to conduct census-related research.

SS: What are the sorts of research projects the Census is involved in?

TL: The Census Bureau designs and conducts the decennial Census, the Current Population Survey, the American Community Survey, many, many other surveys for other Federal Statistical Agencies including the Bureau of Labor Statistics, and a quite extraordinary portfolio of others. Each identifies issues in design and analysis that merit attention, many entail "Big Data" and many require combining information from a variety of sources. I give a few examples, and encourage exploration of www.census.gov/research. You can get a flavor of the types of research from the titles of the six current centers within R&M: The Center for Adaptive Design, The Center for Administrative Records Research and Acquisition, The Center for Disclosure Avoidance Research, The Center for Economic Studies, The Center for Statistical Research and Methodology and The Center for Survey Measurement. Projects include multi-mode survey approaches, stopping rules for household visits, methods of combining information from surveys and administrative records, provision of focused estimates while preserving identity protection, improved small area estimates of income and of limited english skills (used to trigger provision of election ballots in languages other than English), and continuing investigation of issues related to model-based and design-based inferences.

SS: Are those projects related to your research?

TL: Some are, some will be, some will never be. Small area estimation, hierarchical modeling with a Bayesian formalism, some aspects of adaptive design, some of combining evidence from a variety of sources, and general statistical modeling are in my power zone. I look forward to getting involved in these and contributing to other projects.

SS: How does research performed at the Census help the American Public?

TL: Research innovations enable the bureau to produce more timely and accurate information at lower cost, improve validity (for example, new approaches have at least maintained respondent participation in surveys), enhancing the reputation of the the Census Bureau as a trusted source of information. Estimates developed by Census are used to allocate billions of dollars in school aid, and the provide key planning information for businesses and governments.

SS: How can young statisticians get more involved in government statistical research?

TL: The first step is to become aware of the wide variety of activities and their high impact. Visiting the Census website and those of other federal and state agencies, and the Committee on National Statistics (http://sites.nationalacademies.org/DBASSE/CNSTAT/) and the National Institute of Statistical Sciences (http://www.niss.org/) is a good start. Make contact with researchers at the JSM and other meetings and be on the lookout for pre- and post-doctoral positions at Census and other federal agencies.

#census #government #tom louis #interview

0 notes

simplystatistics · 13 years ago

Text

Some academic thoughts on the poll aggregators

The night of the presidential elections I wrote a post celebrating the victory of data over punditry. I was motivated by the personal attacks made against Nate Silver by pundits that do not understand Statistics. The post generated a little bit of (justified) nerdrage (see comment section). So here I clarify a couple of things not as a member of Nate Silver's fan club (my mancrush started with PECOTA not fivethirtyeight) but as an applied statistician.

The main reason fivethrityeight predicts election results so well is mainly due to the idea of averaging polls. This idea was around way before fivethirtyeight started. In fact, it's a version of meta-analysis which has been around for hundreds of years and is commonly used to improve results of clinical trials. This election cycle several groups, including Sam Wang (Princeton Election Consortium), Simon Jackman (pollster), and Drew Linzer (VOTAMATIC), predicted the election perfectly using this trick.

While each group adds their own set of bells and whistles, most of the gains come from the aggregation of polls and understanding the concept of a standard error. Note that while each individual poll may be a bit biased, historical data shows that these biases average out to 0. So by taking the average you obtain a close to unbiased estimate. Because there are so many pollsters, each one conducting several polls, you can also estimate the standard error of your estimate pretty well (empirically rather than theoretically). I include a plot below that provides evidence that bias is not an issue and that standard errors are well estimated. The dash line is at +/- 2 standard erros based on the average (across all states) standard error reported by fivethirtyeight. Note that the variability is smaller for the battleground states where more polls were conducted (this is consistent with state-specific standard error reported by fivethirtyeight).

Finally, there is the issue of the use of the word "probability". Obviously one can correctly state that there is a 90% chance of observing event A and then have it not happen: Romney could have won and the aggregators still been "right". Also frequentists complain when we talk about the probability of something that only will happen once? I actually don't like getting into this philosophical discussion (Gelman has some thoughts worth reading) and I cut people who write for the masses some slack. If the aggregators consistently outperform the pundits in their predictions I have no problem with them using the word "probability" in their reports. I look forward to some of the post-election analysis of all this.

4 notes · View notes

simplystatistics · 13 years ago

Text

Nate Silver does it again! Will pundits finally accept defeat?

My favorite statistician did it again! Just like in 2008, he predicted the presidential election results almost perfectly. For those that don't know, Nate Silver is the statistician that runs the fivethirtyeight blog. He combines data from hundreds of polls, uses historical data to weigh them appropriately and then uses a statistical model to run simulations and predict outcomes.

While the pundits were claiming the race was a "dead heat", the day before the election Nate gave Obama a 90% chance of winning. Several pundits attacked Nate (some attacks were personal) for his predictions and demonstrated their ignorance of Statistics. Jeff wrote a nice post on this. The plot below demonstrates how great Nate's prediction was. Note that each of the 45 states (including DC) for which he predicted a 90% probability or higher of winning for candidate A, candidate A won. For the other 6 states the range of percentages was 48-52%. If Florida goes for Obama he will have predicted every single state correctly.

Update: Congratulations also to Sam Wang (Princeton Election Consortium) and Simon Jackman (pollster) that also called the election perfectly. And thanks to the pollsters that provided the unbiased (on average) data used by all these folks. Data analysts won "experts" lost.

Update 2: New plot with data from here. Old graph here.

20 notes · View notes

simplystatistics · 13 years ago

Text

If we truly want to foster collaboration, we need to rethink the "independence" criteria during promotion

When I talk about collaborative work, I don't mean spending a day or two helping compute some p-values and end up as middle author in a subject-matter paper. I mean spending months working on a project, from start to finish, with experts from other disciplines to accomplish a goal that can only be accomplished with a diverse team. Many papers in genomics are like this (the ENOCDE and 1000 genomes papers for example). Investigators A dreams up the biology, B develops the technology, C codes up algorithms to deal with massive data, while D analyzes the data and assess uncertainty, with the results reported in one high profile paper. I illustrate the point with genomics because it's what I know best, but examples abound in other specialties as well.

Fostering collaborative research seems to be a priority for most higher education institutions. Both funding agencies and universities are creating initiative after initiative to incentivize team science. But at the same time the appointments and promotions process rewards researchers that have demonstrated "independence". If we are not careful it may seem like we are sending mixed signals. I know of young investigators that have been advised to set time aside to demonstrate independence by publishing papers without their regular collaborators. This advice assumes that one can easily balance collaborative and independent research. But here is the problem: truly collaborative work can take just as much time and intellectual energy as independent research, perhaps more. Because time is limited, we might inadvertently be hindering the team science we are supposed to be fostering. Time spent demonstrating independence is time not spent working on the next high impact project.

I understand the argument for striving to hire and promote scholars that can excel no matter the context. But I also think it is unrealistic to compete in team science if we don’t find a better way to promote those that excel in collaborative research as well. It is a mistake to think that scholars that excel in solo research can easily succeed in team science. In fact, I have seen several examples of specializations, that are important to the university, in which the best work is being produced by a small team. At the same time, "independent" researchers all over the country are also working in these areas and publishing just as many papers. But the influential work is coming almost exclusively from the team. Whom should your university hire and promote in this particular area? To me it seems clear that it is the team. But for them to succeed we can’t get in their way by requiring each individual member to demonstrate “independence” in the traditional sense.

2 notes · View notes

simplystatistics · 13 years ago

Text

Sunday Data/Statistics Link Roundup (11/4/12)

Brian Caffo headlines the WaPo article about massive online open courses. He is the driving force behind our department's involvement in offering these massive courses. I think this sums it up: `“I can’t use another word than unbelievable,” Caffo said. Then he found some more: “Crazy . . . surreal . . . heartwarming.”'

A really interesting discussion of why "A Bet is a Tax on B.S.". It nicely describes why intelligent betters must be disinterested in the outcome, otherwise they will end up losing money. The Nate Silver controversy just doesn't seem to be going away, good news for his readership numbers, I bet. (via Rafa)

An interesting article on how scientists are not claiming global warming is the sole cause of the extreme weather events we are seeing, but that it does contribute to them being more extreme. The key quote: “We can’t say that steroids caused any one home run by Barry Bonds, but steroids sure helped him hit more and hit them farther. Now we have weather on steroids.” --Eric Pooley. (via Roger)

The NIGMS is looking for a Biomedical technology, Bioinformatics, and Computational Biology Director. I hope that it is someone who understands statistics! (via Karl B.)

Here is another article that appears to misunderstand statistical prediction. This one is about the Italian scientists who were jailed for failing to predict an earthquake. No joke.

We talk a lot about how much the data revolution will change industries from social media to healthcare. But here is an important reality check. Patients are not showing an interest in accessing their health care data. I wonder if part of the reason is that we haven't come up with the right ways to explain, understand, and utilize what is inherently stochastic and uncertain information.

The BMJ is now going to require all data from clinical trials published in their journal to be public. This is a brilliant, forward thinking move. I hope other journals will follow suit. (via Karen B.R.)

An interesting article about the impact of retractions on citation rates, suggesting that papers in fields close to those of the retracted paper may show negative impact on their citation rates. I haven't looked it over carefully, but how they control for confounding seems incredibly important in this case. (via Alex N.).

#retractions #caffo #moocs #data #open data #individualized health #global warming

3 notes · View notes

simplystatistics · 13 years ago

Link

Another MOOC article but this one features Brian Caffo.

1 note · View note

simplystatistics · 13 years ago

Link

MOOCs have been around for a few years as collaborative techie learning events, but this is the year everyone wants in. Elite universities are partnering with Coursera at a furious pace. It now offers courses from 33 of the biggest names in postsecondary education, including Princeton, Brown, Columbia and Duke. In September, Google unleashed a MOOC-building online tool, and Stanford unveiled Class2Go with two courses.

0 notes

simplystatistics · 13 years ago

Link

Microsoft is incorporating advanced computing technologies into many of its products, allowing users to comb huge amounts of data and get suggestions based on their habits.

3 notes · View notes

simplystatistics · 13 years ago

Text

On weather forecasts, Nate Silver, and the politicization of statistical illiteracy

As you know, we have a thing for statistical literacy here at Simply Stats. So of course this column over at Politico got our attention (via Chris V. and others). The column is an attack on Nate Silver, who has a blog where he tries to predict the outcome of elections in the U.S., you may have heard of it...

The argument that Dylan Byers makes in the Politico column is that Nate Silver is likely to be embarrassed by the outcome of the election if Romney wins. The reason is that Silver's predictions have suggested Obama has a 75% chance to win the election recently and that number has never dropped below 60% or so.

I don't know much about Dylan Byers, but from reading this column and a quick scan of his twitter feed, it appears he doesn't know much about statistics. Some people have gotten pretty upset at him on Twitter and elsewhere about this fact, but I'd like to take a different approach: education. So Dylan, here is a really simple example that explains how Nate Silver comes up with a number like the 75% chance of victory for Obama.

Let's pretend, just to make the example really simple, that if Obama gets greater than 50% of the vote, he will win the election. Obviously, Silver doesn't ignore the electoral college and all the other complications, but it makes our example simpler. Then assume that based on averaging a bunch of polls we estimate that Obama is likely to get about 50.5% of the vote.

Now, we want to know what is the "percent chance" Obama will win, taking into account what we know. So let's run a bunch of "simulated elections" where on average Obama gets 50.5% of the vote, but there is variability because we don't have the exact number. Since we have a bunch of polls and we averaged them, we can get an estimate for how variable the 50.5% number is. The usual measure of variance is the standard deviation. Say we get a standard deviation of 1% for our estimate. That would be a pretty accurate number, but not totally unreasonable given the amount of polling data out there.

We can run 1,000 simulated elections like this in R* (a free software programming language, if you don't know R, may I suggest Roger's Computing for Data Analysis class?). Here is the code to do that. The last line of code calculates the percent of times, in our 1,000 simulated elections, that Obama wins. This is the number that Nate would report on his site. When I run the code, I get an Obama win 68% of the time (Obama gets greater than 50% of the vote). But if you run it again that number will vary a little, since we simulated elections.

The interesting thing is that even though we only estimate that Obama leads by about 0.5%, he wins 68% of the simulated elections. The reason is that we are pretty confident in that number, with our standard deviation being so low (1%). But that doesn't mean that Obama will win 68% of the vote in any of the elections! In fact, here is a histogram of the percent of the vote that Obama wins:

He never gets more than 54% or so and never less than 47% or so. So it is always a reasonably close election. Silver's calculations are obviously more complicated, but the basic idea of simulating elections is the same.

Now, this might seem like a goofy way to come up with a "percent chance" with simulated elections and all. But it turns out it is actually a pretty important thing to know and relevant to those of us on the East Coast right now. It turns out weather forecasts (and projected hurricane paths) are based on the same sort of thing - simulated versions of the weather are run and the "percent chance of rain" is the fraction of times it rains in a particular place.

So Romney may still win and Obama may lose - and Silver may still get a lot of it right. But regardless, the approach taken by Silver is not based on politics, it is based on statistics. Hopefully we can move away from politicizing statistical illiteracy and toward evaluating the models for the real, underlying assumptions they make.

* In this case, we could calculate the percent of times Obama would win with a formula (called an analytical calculation) since we have simplified so much. In Nate's case it is much more complicated, so you have to simulate.

#nate silver #five thirty eigh #simulation #R

15 notes · View notes

simplystatistics · 13 years ago

Text

Computing for Data Analysis (Simply Statistics Edition)

As the entire East Coast gets soaked by Hurricane Sandy, I can't help but think that this is the perfect time to...take a course online! Well, as long as you have electricity, that is. I live in a heavily tree-lined area and so it's only a matter of time before the lights cut out on me (I'd better type quickly!).

I just finished teaching my course Computing for Data Analysis through Coursera. This was my first experience teaching a course online and definitely my first experience teaching a course to > 50,000 people. There were definitely some bumps along the road, but the students who participated were fantastic at helping me smooth the way. In particular, the interaction on the discussion forums was very helpful. I couldn't have done it without the students' help. So, if you took my course over the past 4 weeks, thanks for participating!

Here are a couple quick stats on the course participation (as of today) for the curious:

50,899: Number of students enrolled

27,900: Number of users watching lecture videos

459,927: Total number of streaming views (over 4 weeks)

414,359: Total number of video downloads (not all courses allow this)

14,375: Number of users submitting the weekly quizzes (graded)

6,420: Number of users submitting the bi-weekly R programming assignments (graded)

6393+3291: Total number of posts+comments to the discussion forum

314,302: Total number of views in the discussion forum

I've received a number of emails from people who signed up in the middle of the course or after the course finished. Given that it was a 4-week course, signing up in the middle of the course meant you missed quite a bit of material. I will eventually be closing down the Coursera version of the course--at this point it's not clear when it will be offered again on that platform but I would like to do so--and so access to the course material will be restricted. However, I'd like to make that material more widely available even if it isn't in the Coursera format.

So I'm announcing today that next month I'll be offering the Simply Statistics Edition of Computing for Data Analysis. This will be a slightly simplified version of the course that was offered on Coursera since I don't have access to all of the cool platform features that they offer. But all of the original content will be available, including some new material that I hope to add over the coming weeks.

If you are interested in taking this course or know of someone who is, please check back here soon for more details on how to sign up and get the course information.

#R #MOOC

3 notes · View notes