Statistics in Education: It’s harder than it looks

In 2009, when I started thinking about going back to graduate school, statistics was definitely on my mind. I had the perhaps erroneous perception that statistics was not done as well as it could be in educational research, and that perception ate away at my conscience.

This is why my original degree plan (pre-reqs started in 2014; acceptance in 2015) was to get a Ph.D. in Statistics.

So, it probably comes as absolutely no surprise to anyone that when a recent article that I was immensely interested in had incomplete statistics (IMHO), I said something about it on Twitter.

Some folks on Twitter, including Karen Cangialosi (@karencang) and Laura Gibbs (@OnlineCrsLady) engaged with my original thread. And then Jesse Stommel retweeted my original thread and things became a bit more interesting:

Folks were asking for clarification. I had already gone through the article with a fine toothed comb and was willing to send the marked up article via email.

I was not, at that point, willing to put it on because I did not, in any way, want to undermine the hard work the authors had put into this paper. But folks kept asking.

And then Colleen Flaherty from Inside HigherEd asked:

Hi Prof. Sorensen-Unruh,

I hope you’re well.

I’m writing about that study on active learning today, the one you tweeted about recently. Obviously you’re doing a really smart critique but could you dumb it down for me a bit? What would have made this stronger, in your opinion? Do you think the paper is getting too much attention?

Thanks for any thoughts you can share today or tomorrow,

Colleen Flaherty

Faculty Correspondent

I knew Colleen from a previous article on ungrading so I trusted her to handle my candid discussion of the article effectively.

So, I responded…

Hi Colleen,

And good morning (at least MDT)!

Let’s start with the notion that this paper had many important study characteristics that I appreciated.  From a science standpoint, having a control and an experimental group populated by students chosen at random, keeping as many variables as fixed as possible, and gathering quantitative data on the research subjects (I.e. students) are critical aspects of experimental design.

I also appreciated the authors’ mixed methods study design and their use of member checking in the interviews.

But I had multifold problems with the paper, many of which were stated more eloquently by Jesse and other parties:

  1. Do MCQs really measure learning or simply measure how to best take a MCQ test?
  2. Do the survey questions really measure student feelings or just temporary perception? (IMHO, these survey questions are really unclear in wording and in what they measure on student evals and they have not improved within the usage in this paper)
  3. Why wasn’t more time spent in the paper on the qualitative aspects of the study, which, in my opinion, were the most interesting and probably yielded the richest data set? The interviews conducted could have been greatly expanded upon; the interviews in conjunction with the surveys *might* have actually given us a better sense of student feelings and perception.

And then, of course, we come to the statistics in the paper, for which I had major issues:

  1. Where is a scatter plot of the data?
  2. How do we know that the descriptive data (i.e. Mean and SD) reported completely describe the original data set (i.e. how do we know the mean and the SD are sufficient statistics?)? We don’t even know how the data is distributed…Although we DO know what the authors assumed – that the data fit a normal/Gaussian curve (using the Central Limit Theorem (CLT) as the assumed reason) because they quoted z scores. Human data is weird and while their student groups were over 30 students (so most folks assume CLT applies), the data should still be checked to make sure it’s normal (hence the reason I wanted to see a scatter plot).
  3. The linear regression analysis description was about as clear as mud. Where are the models they used (and by this I mean actual equations of the form:

y = β0 + β1x1 + β2x2 + β3x3 + … + ε

where β’s are linear regression coefficients and ε is the error (usually analyzed by comparing residuals with the data)). I understand that the authors did multiple regression; I’m unclear as to what the y’s really were. From my analysis, the y for the FOL was PC1 (the first principal component from principal component analysis, which is a multivariate analysis that forms latent variables (i.e. principal components) from linear combinations of the data (the data being the Likert scale values from the survey)) some of the time and the Likert scale value for question 2 some of the time. Was the y for the TOL the overall percentage grade the students received on their MCQ test (or the number correct)? Why did the authors expect the independent variables (x’s) they used (in Table 3) to predict these outcomes (y’s) to work? Where are the correlations of the predictions with the students’ actual scores? And why weren’t the separate semesters (Fall and Spring) analyzed in a multivariate linear regression (a system of equations with y1 from Fall and y2 from Spring that are analyzed at the same time)? Is it really fair to say students in the fall and the spring are truly similar enough to bin in the same data set?

  1. Where is the checking of assumptions for this linear regression analysis through the analysis of residuals (i.e. error)? Linear regression requires the data to be normal, independent, and with constant variance. If a model doesn’t fit this criteria, then it is usually not a valid model and we transform the data to fit the model more accurately or we reduce the model. Where is the validation that transformation or reduction would not have resulted in a better model for this data?
  2. Why did the authors use linear regression to begin with? ANOVA (or in this case, MANOVA) would have been a better technique for the number of binary factors used in this analysis. At least some kind of dual analysis using linear regression and ANOVA (which are, in many ways, two sides of the same coin) would have helped convince the audience that the data was treated in multiple ways and ended up with the same result.

I’m going to stop there as I have already probably gone too far into the statistical analysis. I went after the statistical analysis on Twitter because the anticorrelation (or negative correlation) between the TOL and the FOL is what the authors used as their main finding, and if a statistical measure is your major finding, then your statistics better be done exceedingly well. But this is my two cents.

I actually wanted this paper to be an important seminal work because it verifies something we, as active learning professors teaching science, already know – students resist active learning in the classroom because they feel like they are learning less. I was hoping this paper would be one that could be cited many, many times. But the many flaws of the analysis, particularly the statistical analysis, make that hope (really that dream) impossible.

I hope this is clear. Let me know if you need a further breakdown.

Cheers and hope you are well!

rissa :o)

Let’s reiterate this point for a moment – I only went after this article on Twitter because it was going viral and I felt like they could not hang their major finding on a statistically significant anticorrelation (or negative correlation) and then not do the linear regression justice. So, not only are the authors’ statistics incomplete, but they are incomplete to the point that they cannot justify their major finding.

And here’s the other major point (which I may not have stated before) – the authors did a massive amount of work doing interviews and collecting thick, rich qualitative data. They then spent two paragraphs (more or less) on these findings.

If significant time had been spent on the qualitative analysis, I think it would have helped me believe their anticorrelation finding more. But they basically threw those findings in at the end as if they weren’t perhaps the MOST IMPORTANT data the authors had collected.

Anyway, Colleen did an amazing job with the article for Inside HigherEd:

And here’s what I really wish folks would know about statistics in human subject research: it is muddy and it is messy. There is nothing clean cut about it. As statisticians who want to do an analysis well, we run as many models as we can on the data set we’ve painstakingly gathered to find the “best” model based on choices and assumptions we have made throughout the analysis. Then we try to be as transparent as we possibly can be about what we did so that others can transfer our work or can try to replicate it if we can make our data set open.

This is NOT introductory statistics. By any stretch of the imagination.

When I defected to the Ph.D. in Learning Sciences in 2016, I kept the MS in Statistics. I wanted to build bridges between the STEM disciplines, particularly in terms of teaching, learning, and educational research, using a language we could all understand. And until educational research recognizes what medical research has already realized – that having statisticians on board from day 1 of the study is essential – I am going to be asking study authors who make their fundamental finding a statistically based one for more information.

Quick acknowledgement: A major thank you and shout out for this blog needs to go to Josh Eyler (@joshua_r_eyler), who helped me think through most of this blog via our discussion through emails and who suggested my thread to Colleen. Thank you Josh!

3 thoughts on “Statistics in Education: It’s harder than it looks

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s