Statistics in Education (Part 2): When Everything You Ever Wanted Isn’t Enough

I had the most bizarre experience this past weekend.

The ever awesome Ellen Yezierski (@EllenYezierski) emailed me an article she uses in her graduate level CER (Chemical Education Research) classes to peruse and discuss. The article details the “new” (i.e. 2018) APA (American Psychological Association) reporting standards that should be used for quantitative research. It (and the larger website here) has a flowchart (see Figure 1 from the article (shown below)) specifying what you should report if your study design used experimental manipulation (vs. not), what to report if you use time series data in your analysis, and clearly so much more.
Figure 1 shows which table in the article to use depending on what kind of study was conducted.
Figure 1 from Appelbaum, Kline, Nezu, Cooper, Mayo-Wilson & Rao (2018, p. 5) shows which table in the article to use depending on what kind of study was conducted.

Figure 1 points to Tables, so what do those look like? Here’s Table 1 (Appelbaum et. al, 2018, pp. 6-8):

HOLY SMOKES! Two and a half pages of so many things you should be considering in your study and reporting out. And that’s just Table 1, which is meant for all studies. You then go to the specific table that’s based on your experimental design to learn what else you should be reporting.

There are a total of nine tables in the article, most of them taking at least a page.
And I realized something really important in the midst of reading these tables.

At one point in my life, these tables were everything I ever wanted. But they aren’t anymore.

Five and a half years ago, I knew next to nothing about educational research design or statistics and was fulfilling my pre-reqs for the Statistics Ph.D. program, which included some educational psychology classes. This article is like a summation of what I learned in those Ed Psych classes; it has a LOT of information and is clear about what to use when.
But I have realized many things about myself in the midst of my Ph.D. journey. The two most important things I’ve realized that are relevant to this article discussion? I am a Learning Scientist and a Statistician.
As a Learning Scientist, I fundamentally disagree with the notion that learning can be measured in a lab. I embrace the messiness of the classroom, where variables are abundant and controls are few. I fundamentally disagree with the idea that anything that has to do with humans is generalizable. Chemical reactions are generalizable; if you do the same reactions under the same conditions, you should get the same result. Human interactions are not generalizable; I can’t get my students to behave the same way on two equivalent assignments in the same class timeframe, let alone class to class.
I also believe experimental designs are entirely context dependent and that more time should be spent on designing studies than conducting them.
As I’ve said before, humans are weird and statistical analyses conducted on human subject research data are not easy. The point in all of this is that experimental studies conducted on humans are IMHO at best transferable, not generalizable.
Those are just some of my disagreements with Educational Psychology on learning. But after spending the last four years honing my skills as a statistician, I find this article entirely too prescriptive in terms of statistical methodologies in so many ways and much too vague in others. Statistical analysis is so much more fluid than this article allows. But I completely understand why the APA wrote it; it is at least someplace to start for those who are a bit lost or are not as familiar with statistical methodologies.
You can probably guess what I find too prescriptive – the tables. Their problems are multi-fold – what happens when something (like an unexpected result) doesn’t fit nicely into the table? What happens if someone follows a table exactly only to find that their model’s fit just isn’t good enough (or that they used the wrong table)? What happens when more than one kind of analysis is performed?
Statisticians particularly enjoy doing the latter. We run analysis after analysis on the same data set, hoping that a “good” model will emerge for this data that we can use to infer future data. Then after assessing model fit for all possible “good” models, we run different analyses to double-check our results. If we can, we member check our results as well so that we have confirmation from the participants that the analysis speaks to the data they provided.
In my time as a statistician, if a study is meant to build a model to infer future data (or even if we’re just trying to fit a “good” model), it is NEVER enough to run one analysis or even just one type of analysis. How will you know if another model fits the data better if you only run one analysis?
I’m not saying that this article prescribes only running one model for a statistical analysis, but there is an entire table for just Structural Equation Modeling.
Admittedly, I’m also a mixed methods educational researcher through and through. I would also double check my quantitative analysis via narratives, focus groups, or interviews to make sure that the quantitative analysis tells the same story about the study participants that my qualitative analysis does. I see quant and qual like reaction energetics – each part (kinetics and thermodynamics) tells a different part of the story about what’s going on in a reaction, but you can’t tell the entire story without both.
So, what was vague about the article then? Sentences like “Assess model fit” or “Summarize the modifications to the original model…” are SUPER vague. Unless you know what those things mean statistically, they don’t really mean anything at all. It would help to clarify those kinds of sentences even further, specifically stating what they mean for at least one major kind of statistical model (like regression analysis).
Reading this article was like an experience I’ve had before. When I worked as a contract employee in Human Resources at Intel Corp. in 2000, I helped design a redeployment for current employees. At the time, all I wanted from Intel was to be hired full-time (i.e. to transition my contract status to permanent status). Unfortunately, in the midst of designing the redeployment initiative, I realized that if I had been hired permanently at the time I was hired as a contract employee, I would have been redeployed in the initiative I designed. In other words, I designed as a contract employee the mechanism of my own firing, had I been hired as a permanent employee.
It’s at times like these that I realize I’ve actually grown during my pursuit of my degrees. And I am thankful for the ability to converse with experts in the field about my “takes” on articles like this. So, thank you, Ellen, for this conversation and for our continued future discourse.
Appelbaum, M., Kline, R. B., Nezu, A. M., Cooper, H., Mayo-Wilson, E., & Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA publications and communications board task force report. American Psychologist, 73(1), 3-25. Retrieved from

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s