The Dunning-Kruger Effect is Autocorrelation

Have you heard of the ‘Dunning-Kruger effect’? It’s the (apparent) tendency for unskilled people to overestimate their competence. Discovered in 1999 by psychologists Justin Kruger and David Dunning, the effect has since become famous.

And you can see why.

It’s the kind of idea that is too juicy to not be true. Everyone ‘knows’ that idiots tend to be unaware of their own idiocy. Or as John Cleese puts it:

If you’re very very stupid, how can you possibly realize that you’re very very stupid?

Of course, psychologists have been careful to make sure that the evidence replicates. But sure enough, every time you look for it, the Dunning-Kruger effect leaps out of the data. So it would seem that everything’s on sound footing.

Except there’s a problem.

The Dunning-Kruger effect also emerges from data in which it shouldn’t. For instance, if you carefully craft random data so that it does not contain a Dunning-Kruger effect, you will still find the effect. The reason turns out to be embarrassingly simple: the Dunning-Kruger effect has nothing to do with human psychology.1 It is a statistical artifact — a stunning example of autocorrelation.

What is autocorrelation?

Autocorrelation occurs when you correlate a variable with itself. For instance, if I measure the height of 10 people, I’ll find that each person’s height correlates perfectly with itself. If this sounds like circular reasoning, that’s because it is. Autocorrelation is the statistical equivalent of stating that 5 = 5.

When framed this way, the idea of autocorrelation sounds absurd. No competent scientist would correlate a variable with itself. And that’s true for the pure form of autocorrelation. But what if a variable gets mixed into both sides of an equation, where it is forgotten? In that case, autocorrelation is more difficult to spot.

Here’s an example. Suppose I am working with two variables, x and y. I find that these variables are completely uncorrelated, as shown in the left panel of Figure 1. So far so good.

Figure 1: Generating autocorrelation. The left panel plots the random variables x and y, which are uncorrelated. The right panel shows how this non-correlation can be transformed into an autocorrelation. We define a variable called z, which is correlated strongly with x. The problem is that z happens to be the sum x + y. So we are correlating x with itself. The variable y adds statistical noise.

Next, I start to play with the data. After a bit of manipulation, I come up with a quantity that I call z. I save my work and forget about it. Months later, my colleague revisits my dataset and discovers that z strongly correlates with x (Figure 1, right). We’ve discovered something interesting!

Actually, we’ve discovered autocorrelation. You see, unbeknownst to my colleague, I’ve defined the variable z to be the sum of x + y. As a result, when we correlate z with x, we are actually correlating x with itself. (The variable y comes along for the ride, providing statistical noise.) That’s how autocorrelation happens — forgetting that you’ve got the same variable on both sides of a correlation.

The Dunning-Kruger effect

Now that you understand autocorrelation, let’s talk about the Dunning-Kruger effect. Much like the example in Figure 1, the Dunning-Kruger effect amounts to autocorrelation. But instead of lurking within a relabeled variable, the Dunning-Kruger autocorrelation hides beneath a deceptive chart.2

Let’s have a look.

In 1999, Dunning and Kruger reported the results of a simple experiment. They got a bunch of people to complete a skills test. (Actually, Dunning and Kruger used several tests, but that’s irrelevant for my discussion.) Then they asked each person to assess their own ability. What Dunning and Kruger (thought they) found was that the people who did poorly on the skills test also tended to overestimate their ability. That’s the ‘Dunning-Kruger effect’.

Dunning and Kruger visualized their results as shown in Figure 2. It’s a simple chart that draws the eye to the difference between two curves. On the horizontal axis, Dunning and Kruger have placed people into four groups (quartiles) according to their test scores. In the plot, the two lines show the results within each group. The grey line indicates people’s average results on the skills test. The black line indicates their average ‘perceived ability’. Clearly, people who scored poorly on the skills test are overconfident in their abilities. (Or so it appears.)

Figure 2: The Dunning-Kruger chart. From Dunning and Kruger (1999). This figure shows how Dunning and Kruger reported their original findings. Dunning and Kruger gave a skills test to individuals, and also asked each person to estimate their ability. Dunning and Kruger then placed people into four groups based on their ranked test scores. This figure contrasts the (average) percentile of the ‘actual test score’ within each group (grey line) with the (average) percentile of ‘perceived ability’. The Dunning-Kruger ‘effect’ is the difference between the two curves — the (apparent) fact that unskilled people overestimate their ability.

On its own, the Dunning-Kruger chart seems convincing. Add in the fact that Dunning and Kruger are excellent writers, and you have the recipe for a hit paper. On that note, I recommend that you read their article, because it reminds us that good rhetoric is not the same as good science.

Deconstructing Dunning-Kruger

Now that you’ve seen the Dunning-Kruger chart, let’s show how it hides autocorrelation. To make things clear, I’ll annotate the chart as we go.

We’ll start with the horizontal axis. In the Dunning-Kruger chart, the horizontal axis is ‘categorical’, meaning it shows ‘categories’ rather than numerical values. Of course, there’s nothing wrong with plotting categories. But in this case, the categories are actually numerical. Dunning and Kruger take people’s test scores and place them into 4 ranked groups. (Statisticians call these groups ‘quartiles’.)

What this ranking means is that the horizontal axis effectively plots test score. Let’s call this score x.

Figure 3: Deconstructing the Dunning-Kruger chart. In the Dunning-Kruger chart, the horizontal axis ranks ‘actual test score’, which I’ll call x.

Next, let’s look at the vertical axis, which is marked ‘percentile’. What this means is that instead of plotting actual test scores, Dunning and Kruger plot the score’s ranking on a 100-point scale.3

Now let’s look at the curves. The line labeled ‘actual test score’ plots the average percentile of each quartile’s test score (a mouthful, I know). Things seems fine, until we realize that Dunning and Kruger are essentially plotting test score (x) against itself.4 Noticing this fact, let’s relabel the grey line. It effectively plots x vs. x.

Figure 3: Deconstructing the Dunning-Kruger chart. In the Dunning-Kruger chart, the line marked ‘actual test score’ is plotting test score (x) against itself. In my notation, that’s x vs. x.

Moving on, let’s look at the line labeled ‘perceived ability’. This line measures the average percentile for each group’s self assessment. Let’s call this self-assessment y. Recalling that we’ve labeled ‘actual test score’ as x, we see that the black line plots y vs. x.

Figure 3: Deconstructing the Dunning-Kruger chart. In the Dunning-Kruger chart, the line marked ‘perceived ability’ is plotting ‘perceived ability’ y against actual test score x.

So far, nothing jumps out as obviously wrong. Yes, it’s a bit weird to plot x vs. x. But Dunning and Kruger are not claiming that this line alone is important. What’s important is the difference between the two lines (‘perceived ability’ vs. ‘actual test score’). It’s in this difference that the autocorrelation appears.

In mathematical terms, a ‘difference’ means ‘subtract’. So by showing us two diverging lines, Dunning and Kruger are (implicitly) asking us to subtract one from the other: take ‘perceived ability’ and subtract ‘actual test score’. In my notation, that corresponds to y – x.

Figure 3: Deconstructing the Dunning-Kruger chart. To interpret the Dunning-Kruger chart, we (implicitly) look at the difference between the two curves. That corresponds to taking ‘perceived ability’ and subtracting from it ‘actual test score’. In my notation, that difference is y – x (indicated by the double-headed arrow). When we judge this difference as a function of the horizontal axis, we are implicitly comparing y – x to x. Since x is on both sides of the comparison, the result will be an autocorrelation.

Subtracting y – x seems fine, until we realize that we’re supposed to interpret this difference as a function of the horizontal axis. But the horizontal axis plots test score x. So we are (implicitly) asked to compare y – x to x:

\displaystyle (y - x) \sim x

Do you see the problem? We’re comparing x with the negative version of itself. That is textbook autocorrelation. It means that we can throw random numbers into x and y — numbers which could not possibly contain the Dunning-Kruger effect — and yet out the other end, the effect will still emerge.

Replicating Dunning-Kruger

To be honest, I’m not particularly convinced by the analytic arguments above. It’s only by using real data that I can understand the problem with the Dunning-Kruger effect. So let’s have a look at some real numbers.

Suppose we are psychologists who get a big grant to replicate the Dunning-Kruger experiment. We recruit 1000 people, give them each a skills test, and ask them to report a self-assessment. When the results are in, we have a look at the data.

It doesn’t look good.

When we plot individuals’ test score against their self assessment, the data appear completely random. Figure 7 shows the pattern. It seems that people of all abilities are equally terrible at predicting their skill. There is no hint of a Dunning-Kruger effect.

Figure 7: A failed replication. This figure shows the results of a thought experiment in which we try to replicate the Dunning-Kruger effect. We get 1000 people to take a skills test and to estimate their own ability. Here, we plot the raw data. Each point represents an individual’s result, with ‘actual test score’ on the horizontal axis, and ‘self assessment’ on the vertical axis. There is no hint of a Dunning-Kruger effect.

After looking at our raw data, we’re worried that we did something wrong. Many other researchers have replicated the Dunning-Kruger effect. Did we make a mistake in our experiment?

Unfortunately, we can’t collect more data. (We’ve run out of money.) But we can play with the analysis. A colleague suggests that instead of plotting the raw data, we calculate each person’s ‘self-assessment error’. This error is the difference between a person’s self assessment and their test score. Perhaps this assessment error relates to actual test score?

We run the numbers and, to our amazement, find an enormous effect. Figure 8 shows the results. It seems that unskilled people are massively overconfident, while skilled people are overly modest.

(Our lab techs points out that the correlation is surprisingly tight, almost as if the numbers were picked by hand. But we push this observation out of mind and forge ahead.)

Figure 8: Maybe the experiment was successful? Using the raw data from Figure 7, this figure calculates the ‘self-assessment error’ — the difference between an individual’s self assessment and their actual test score. This assessment error (vertical axis) correlates strongly with actual test score (horizontal) axis.

Buoyed by our success in Figure 8, we decide that the results may not be ‘bad’ after all. So we throw the data into the Dunning-Kruger chart to see what happens. We find that despite our misgivings about the data, the Dunning-Kruger effect was there all along. In fact, as Figure 9 shows, our effect is even bigger than the original (from Figure 2).

Figure 9: Recovering Dunning and Kruger. Despite the apparent lack of effect in our raw data (Figure 7), when we plug this data into the Dunning-Kruger chart, we get a massive effect. People who are unskilled over-estimate their abilities. And people who are skilled are too modest.

Things fall apart

Pleased with our successful replication, we start to write up our results. Then things fall apart. Riddled with guilt, our data curator comes clean: he lost the data from our experiment and, in a fit of panic, replaced it with random numbers. Our results, he confides, are based on statistical noise.

Devastated, we return to our data to make sense of what went wrong. If we have been working with random numbers, how could we possibly have replicated the Dunning-Kruger effect? To figure out what happened, we drop the pretense that we’re working with psychological data. We relabel our charts in terms of abstract variables x and y. By doing so, we discover that our apparent ‘effect’ is actually autocorrelation.

Figure 10 breaks it down. Our dataset is comprised of statistical noise — two random variables, x and y, that are completely unrelated (Figure 10A). When we calculated the ‘self-assessment error’, we took the difference between y and x. Unsurprisingly, we find that this difference correlates with x (Figure 10B). But that’s because x is autocorrelating with itself. Finally, we break down the Dunning-Kruger chart and realize that it too is based on autocorrelation (Figure 10C). It asks us to interpret the difference between y and x as a function of x. It’s the autocorrelation from panel B, wrapped in a more deceptive veneer.

Figure 10: Dropping the psychological pretense. This figure repeats the analysis shown in Figures 79, but drops the pretense that we’re dealing with human psychology. We’re working with random variables x and y that are drawn from a uniform distribution. Panel A shows that the variables are completely uncorrelated. Panel B shows that when we plot y – x against x, we get a strong correlation. But that’s because we have correlated x with itself. In panel C, we input these variables into the Dunning-Kruger chart. Again, the apparent effect amounts to autocorrelation — interpreting y – x as a function of x.

The point of this story is to illustrate that the Dunning-Kruger effect has nothing to do with human psychology. It is a statistical artifact — an example of autocorrelation hiding in plain sight.

What’s interesting is how long it took for researchers to realize the flaw in Dunning and Kruger’s analysis. Dunning and Kruger published their results in 1999. But it took until 2016 for the mistake to be fully understood. To my knowledge, Edward Nuhfer and colleagues were the first to exhaustively debunk the Dunning-Kruger effect. (See their joint papers in 2016 and 2017.) In 2020, Gilles Gignac and Marcin Zajenkowski published a similar critique.

Once you read these critiques, it becomes painfully obvious that the Dunning-Kruger effect is a statistical artifact. But to date, very few people know this fact. Collectively, the three critique papers have about 90 times fewer citations than the original Dunning-Kruger article.5 So it appears that most scientists still think that the Dunning-Kruger effect is a robust aspect of human psychology.6

No sign of Dunning Kruger

The problem with the Dunning-Kruger chart is that it violates a fundamental principle in statistics. If you’re going to correlate two sets of data, they must be measured independently. In the Dunning-Kruger chart, this principle gets violated. The chart mixes test score into both axes, giving rise to autocorrelation.

Realizing this mistake, Edward Nuhfer and colleagues asked an interesting question: what happens to the Dunning-Kruger effect if it is measured in a way that is statistically valid? According to Nuhfer’s evidence, the answer is that the effect disappears.

Figure 11 shows their results. What’s important here is that people’s ‘skill’ is measured independently from their test performance and self assessment. To measure ‘skill’, Nuhfer groups individuals by their education level, shown on the horizontal axis. The vertical axis then plots the error in people’s self assessment. Each point represents an individual.

Figure 11: A statistically valid test of the Dunning-Kruger effect. This figure shows Nuhfer and colleagues’ 2017 test of the Dunning-Kruger effect. Similar to Figure 8, this chart plots people’s skill against their error in self assessment. But unlike Figure 8, here the variables are statistically independent. The horizontal axis measures skill using academic rank. The vertical axis measures self-assessment error as follows. Nuhfer takes a person’s score on the SLCI test (science literacy concept inventory test) and subtracts it from the person’s self assessment, called KSSLCI (knowledge survey of the SLCI test). Each black point indicates the self-assessment error of an individual. Green bubbles indicate means within each group, with the associated confidence interval. The fact that the green bubbles overlap the zero-effect line indicates that within each group, the averages are not statistically different from 0. In other words, there is no evidence for a Dunning-Kruger effect.

If the Dunning-Kruger effect were present, it would show up in Figure 11 as a downward trend in the data (similar to the trend in Figure 7). Such a trend would indicate that unskilled people overestimate their ability, and that this overestimate decreases with skill. Looking at Figure 11, there is no hint of a trend. Instead, the average assessment error (indicated by the green bubbles) hovers around zero. In other words, assessment bias is trivially small.

Although there is no hint of a Dunning-Kruger effect, Figure 11 does show an interesting pattern. Moving from left to right, the spread in self-assessment error tends to decrease with more education. In other words, professors are generally better at assessing their ability than are freshmen. That makes sense. Notice, though, that this increasing accuracy is different than the Dunning-Kruger effect, which is about systemic bias in the average assessment. No such bias exists in Nuhfer’s data.

Unskilled and unaware of it

Mistakes happen. So in that sense, we should not fault Dunning and Kruger for having erred. However, there is a delightful irony to the circumstances of their blunder. Here are two Ivy League professors7 arguing that unskilled people have a ‘dual burden’: not only are unskilled people ‘incompetent’ … they are unaware of their own incompetence.

The irony is that the situation is actually reversed. In their seminal paper, Dunning and Kruger are the ones broadcasting their (statistical) incompetence by conflating autocorrelation for a psychological effect. In this light, the paper’s title may still be appropriate. It’s just that it was the authors (not the test subjects) who were ‘unskilled and unaware of it’.


Support this blog

Economics from the Top Down is where I share my ideas for how to create a better economics. If you liked this post, consider becoming a patron. You’ll help me continue my research, and continue to share it with readers like you.

patron_button


Stay updated

Sign up to get email updates from this blog.



This work is licensed under a Creative Commons Attribution 4.0 License. You can use/share it anyway you want, provided you attribute it to me (Blair Fix) and link to Economics from the Top Down.


Notes

Cover image: Nevit Dilmen, altered.

  1. The Dunning-Kruger effect tells us nothing about the people it purports to measure. But it does tell us about the psychology of social scientists, who apparently struggle with statistics.↩︎

  2. It seems clear that Dunning and Kruger didn’t mean to be deceptive. Instead, it appears that they fooled themselves (and many others). On that note, I’m ashamed to say that I read Dunning and Kruger’s paper a few years ago and didn’t spot anything wrong. It was only after reading Jonathan Jarry’s blog post that I clued in. That’s embarrassing, because a major theme of this blog has been me pointing out how economists appeal to autocorrelation when they test their theories of value. (Examples here, here, here, here, and here.) I take solace in the fact that many scientists were similarly hoodwinked by the Dunning-Kruger chart.↩︎

  3. The conversion to percentiles introduces a second bias (in addition to the problem of autocorrelation). By definition, percentiles have a floor (0) and a ceiling (100), and are uniformly distributed between these bounds. If you are close the floor, it is impossible for you to underestimate your rank. Therefore, the ‘unskilled’ will appear overconfident. And if you are close to the ceiling, you cannot overestimate your rank. Therefore, the ‘skilled’ will appear too modest. See Nuhfer et al (2016) for more details.↩︎

  4. In technical terms, Dunning and Kruger are plotting two different forms of ranking against each other — test-score ‘percentile’ against test-score ‘quartile’. What is not obvious is that this type of plot is data independent. By definition, each quartile contains 25 percentiles whose average corresponds to the midpoint of the quartile. The consequence of this truism is that the line labeled ‘actual test score’ tells us (paradoxically) nothing about people’s actual test score.↩︎

  5. According to Google scholar, the three critique papers (Nuhfer 2016, 2017 and Gignac and Zajenkowski 2020) have 88 citations collectively. In contrast, Dunning and Kruger (1999) has 7893 citations.↩︎

  6. The slow dissemination of ‘debunkings’ is a common problem in science. Even when the original (flawed) papers are retracted, they often continue to accumulate citations. And then there’s the fact that critique papers are rarely published in the same journal that hosted the original paper. So a flawed article in Nature is likely to be debunked in a more obscure journal. This asymmetry is partially why I’m writing about the Dunning-Kruger effect here. I think the critique raised by Nuhfer et al. (and Gignac and Zajenkowski) deserves to be well known.↩︎

  7. When Dunning and Kruger published their 1999 paper, they both worked at Cornell University.↩︎

Further reading

Gignac, G. E., & Zajenkowski, M. (2020). The Dunning-Kruger effect is (mostly) a statistical artefact: Valid approaches to testing the hypothesis with individual differences data. Intelligence, 80, 101449.

Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6), 1121.

Nuhfer, E., Cogan, C., Fleisher, S., Gaze, E., & Wirth, K. (2016). Random number simulations reveal how random noise affects the measurements and graphical portrayals of self-assessed competency. Numeracy: Advancing Education in Quantitative Literacy, 9(1).

Nuhfer, E., Fleisher, S., Cogan, C., Wirth, K., & Gaze, E. (2017). How random noise and a graphical convention subverted behavioral scientists’ explanations of self-assessment data: Numeracy underlies better alternatives. Numeracy: Advancing Education in Quantitative Literacy, 10(1).

90 comments

  1. Isn’t it precisely the surprising thing that self-assessment contains so little information about task performance? I certainly found it surprising to find out how bad I am at telling apart what I know and what I don’t know. A handcrafted dataset that doesn’t contain the DK effect would be one where people perfectly know how good they are, i.e. the noise y would just be always 0. So, the implicit assumption that “totally random” means “no DK effect” seems just wrong and thus all conclusions based on that. Interesting to learn that the noise decreases with education level, though! 🙂

    • I’m no expert in statistics, though I have a good grasp of math in general (I estimate 😁). I understand the point about coupling/correlating variables, but I was wondering if there is a simple layperson’s way to understand this: if someone scores say 80/100, and then estimates their score randomly, they have a 20% chance of overestimating. If someone only scores 20/100, a random estimate will have a 80% chance of overestimating. Is this the autocorrelation in simple terms? Ie a low score has a much greater “opportunity” to overestimate?

      • That’s definitely part of the issue. By definition, someone with a top score cannot overestimate their skill, and someone with a bottom score cannot underestimate it.

    • This is exactly right. The problem is that any deviation between estimated and true scores will yield this effect. The ability estimation line can only be equal to the ability quartile line (if people estimate with perfect accuracy, at least within quartiles) or flatter than the ability quartile line, if people estimate with error. The problem is that this kind of categorical plotting by quartiles is commonly taught in psych programs and would not be standard in econ programs, where folks would want to see actual scores, or bins thereof.

  2. If you get statistical noise in response to asking people how tall they think they are, then that noise means that tall people tend to underestimate their height while short people tend to overestimate their height.

    DKE is the presence of statistical noise where one would not expect statistical noise.

    Plus the chart doesn’t measure X against X, it measures (for example) perceived height vs actual height. Those are different things.

    And the revised test doesn’t measure skill but something like social status. It isn’t measuring people’s expected height against themselves as short vs tall, but as a function of something g separate than their height.

    So for example the revised test just shows that people of different ages, where age is assumed to correlate with height, all similarly misguess their height. It doesn’t answer whether or not short people overestimate or underestimate their height. It answers wether or not old people overestimate or underestimate their height, which isn’t the question.

  3. “Here are two Ivy League professors…”

    While he may have worked at the university (as virtually every graduate student does), I can find no evidence that Kruger was ever a professor at Cornell. This paper was published the same year he graduated, so that would be remarkable if true.

  4. But sir, your article clearly shows that DKE is true. If, for example, people are good at estimating their level – you will get correlation on your figure 7, going as a line from 0 to 100. If you got statistical noice – it means people estimate their level incorrectly. For any test result you can take dots on a vertical axis, and dots ABOVE the diagonal line – are people that overestimate their level, and dots BELOW this line – are people that underestimate themself. Your figure 7 clearly shows that the higher the result – the less people overestimate themself, and the more people underestimate themself. So, if DKE is true – results will be the same as on your figure 7, and if they are NOT true – there will be a correlation. You are extremely dumb, my sir.

  5. Some of this is right, but for the wrong reasons. Agreed, the bothering with quartiles and mixing it with percentiles in the first place is what seems absurd – Evidently the only reason to do it was to get the perfectly straight 45 degree line (I.e. Yes, the lowest scores broken into percents are also the lowest percentiles in a linear way – Thank you..?)

    They could’ve just plotted actual performance by perceived performance. Just ask the people after they take the test how well they believe they did on the test. Easy peasy. You might not get a nice clean trend, but you’ll definitely get something. And I expect there will be less delta at the high score range between perception and reality. Though there probably wouldn’t be much delta at the very low end either. Just off the cuff I suspect the largest deltas would hover somewhere around the middle.

  6. Here is a graphical interpretation of how DK results can be obtained directly from null hypotheses that assume individuals are equally accurate in estimating performance at all levels of actual performance.

    1) Extend the “Perceived Ability” line noting intercepts at about (0,55) and (100, 75).

    3) Null Hypothesis 1: Assume everyone believes they got 25% of answers actually right wrong (25% type 1 error rate). That generates a line from (0,0) to (100,75).

    2) Null Hypothesis 2: Assume everyone believes they got 55% of answers actually wrong right (55% type 2 error rate).). That generates a line from (0,55) to (100,100).

    4) Assume both Hypothesis 1 and 2 are correct. This generates a straight line (0,55) to (100,75). The slope and intercepts of “Perceived Ability” are a function of type 1 and type 2 error rates. Any systemic effect of performance on perceived performance must be coded in the residuals, which are small.

  7. Ugh. I keep seeing this in my feeds presented as proof that the DK effect doesn’t exist, when what you’ve done is feed data demonstrating the DK effect into the calculations and seen the DK effect come back out.

    Could you please either take this post down, or at least add a preface explaining that this does not disprove DK? In its current form this article is misleading a lot of people.

Leave a Reply