Anscombe's Quarter of four data sets with near-identical summary statistics, but different distributions demonstrates why charts are important.

Charting Data is a Vital Part of Statistical Analysis

Why are charts important to statistical analysis? Why can’t you only rely on summary stats and other calculated numeric outputs? After all, visualizing data takes a fair amount of work. It’s tempting to save time and treat charts as an afterthought. Few perspectives could be more misguided. Save yourself a headache and always take the time to visualize your data — every time. No exceptions.

Illustrating Why Charts are Important

If you’ve heard of Anscombe’s Quartet, you know where this is going. It’s the “go-to” case study for people like me trying to get others on the Importance of Data Visualization train.

Frank Anscombe (1918 – 2001) was a British statistician who taught at Cambridge, and later Princeton and Yale. Fun fact: his wife’s sister was the wife of John Tukey! That’s right, he was the brother-in-law of the guy who developed the Fast-Fourier-Transform!

Anyway, Frank eventually got into statistical computing in the early 1970s. That work crystallized his belief that charts were an indispensable part of the statistical analytic process. In Vol 27 of The American Statistician, Frank explained why.


Excerpt from Frank Anscombe’s Article in The American Statistician (1973)

Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions: 

  1. Numerical calculations are exact, but graphs are rough; 
  2. For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;
  3. Performing intricate calculations is virtuous, whereas actually looking at the data is cheating. 

A computer should make both calculations and graphs. Both sorts of output should be studied; each will con­tribute to understanding.

Graphs can have various purposes, such as (i) to help us perceive and appreciate some broad features of the data, (ii) to let us look behind those broad features and see what else is there. Most kinds of statistical calculation rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct, and if they are wrong, we ought to be able to perceive in what ways they are wrong. Graphs are very valuable for these purposes.

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. The American Statistician. The American Statistician, Vol. 27, No. 1. 27 (1): 17–21. doi:10.2307/2682899. JSTOR2682899


To really drill Frank’s points home, let’s take a look at four datasets that have come to be known as Anscombe’s Quartet. (If you’d like to get your hands dirty in R, you can find a walkthrough here)

Anscombe’s Quartet and the Dangers of Not Charting

Take a look at the following four datasets. They’re straightforward, two-dimensional distributions. Nothing overtly strange here.

X1. Y1.      X2. Y2.    X3. Y3.    X4. Y4.   

10  8.04     10  9.14   10. 7.46   08. 6.58 
08  6.95     08  8.14   08  6.77   08  5.76
13  7.58     13  8.74   13  12.74  08  7.71
09  8.81     09  8.77   09  7.11   08  8.84
11  8.33     11  9.26   11  7.81   08  8.47
14  9.96     14  8.1    14  8.84   08  7.04
06  7.24     06  6.13   06  6.08   08  5.25
04  4.26     04  3.1    04  5.39   19  12.5
12  10.84    12  9.13   12  8.15   08  5.56
07  4.82     07  7.26   07  6.42   08  7.91
05  5.68     05  4.74   05  5.73   08  6.89

Generating Descriptive Statistics

Since the data is so clean and ready for analysis, the first thing you might do is run some quick descriptive stats in R. Here is the output.

Set Means            Mean_X     Mean_Y
1 Anscombe Set 1          9   7.500909
2 Anscombe Set 2          9   7.500909
3 Anscombe Set 3          9   7.500000
4 Anscombe Set 4          9   7.500909
Set Std. Devs.         SD_X        SD_Y
1 Anscombe Set 1   3.316625    2.031568
2 Anscombe Set 2   3.316625    2.031657
3 Anscombe Set 3   3.316625    2.030424
4 Anscombe Set 4   3.316625    2.030579

Performing a Correlation Analysis

At this point, you might be feeling good and decide to check correlations between x and y. You could even perform some regressions. Here’s what you’d get.

Correlations               r
 1 Anscombe Set 1   0.8164205
 2 Anscombe Set 2   0.8162365
 3 Anscombe Set 3   0.8162867
 4 Anscombe Set 4   0.8165214

Running Some Linear Regressions

Based on the information above, it appears that these datasets have nearly identical descriptive stats. Going on that information alone, you might assume that they have roughly similar distributions. Let’s take a look.

 ANSCOMBE SET 1

 Residuals:
      Min       1Q   Median       3Q      Max 
 -1.92127 -0.45577 -0.04136  0.70941  1.83882 
 
 Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
 (Intercept)   3.0001     1.1247   2.667  0.02573 * 
 x             0.5001     0.1179   4.241  0.00217 **

 
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Residual standard error: 1.237 on 9 degrees of freedom
 Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
 F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
 ANSCOMBE SET 2

 Residuals:
     Min      1Q  Median      3Q     Max 
 -1.9009 -0.7609  0.1291  0.9491  1.2691 
 
 Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
 (Intercept)    3.001      1.125   2.667  0.02576 * 
 x              0.500      0.118   4.239  0.00218 **
 
 
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
 Residual standard error: 1.237 on 9 degrees of freedom
 Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
 F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
 ANSCOMBE SET 3

 Residuals:
     Min      1Q  Median      3Q     Max 
 -1.1586 -0.6146 -0.2303  0.1540  3.2411 
 
 Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
 (Intercept)   3.0025     1.1245   2.670  0.02562 * 
 x             0.4997     0.1179   4.239  0.00218 **


 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
 Residual standard error: 1.236 on 9 degrees of freedom
 Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
 F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
 ANSCOMBE SET 4

 Residuals:
    Min     1Q Median     3Q    Max 
 -1.751 -0.831  0.000  0.809  1.839 
 
 Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
 (Intercept)   3.0017     1.1239   2.671  0.02559 * 
 x             0.4999     0.1178   4.243  0.00216 **


 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Residual standard error: 1.236 on 9 degrees of freedom
 Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
 F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

Charting the Data and Seeing Why Charts are Important

You’re smart people. You’ve probably realized I’m leading you down a path toward some big “Aha!” moment. Well, here it is. Even though those data sets have near-identical descriptive statistics, they have drastically different distributions.

Anscombe's Quarter of four data sets with near-identical summary statistics, but different distributions demonstrates why charts are important.
Anscombe’s Quarter of four data sets with near-identical summary statistics, but different distributions.
License2020 Xyzology | CC-BY-SA | This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Yikes! Look at Those Distributions

It looks like we made some incorrect assumptions. Let’s break these distributions down one at a time:

The first distribution (X1, Y1) is a straightforward linear relationship with two correlated variables and an assumption of normality

The second distribution (X2, Y2) has a relationship between the two variables but is certainly not linear. That Pearson Correlation Coefficient we calculated earlier is not relevant here at all. Ouch. Our linear regression is also not appropriate, and we should have tried a different model instead.

The third distribution (X3, Y3) has a linear relationship, but that outlier in the upper right quadrant is completely destroying our linear regression. A robust regression would have dealt with the outlier, but we might not think of using it unless we saw this plot first.

The fourth distribution (X4, Y4) is entirely skewed by that outlier way out in Right Field. It’s hard to say if there is any relationship between the other variables, but whatever is going on with that outlier is enough to produce a high correlation coefficient for the entire set!

I think the point is evident here. Descriptive statistics and regressions are great tools, but you should never use them alone. Conversely, you should never solely rely on plots. It’s when you use them together that you really get to leverage their power and perspectives. So take the time and make those charts. It’s hard to ensure accuracy without them.

Advertisements Disclosure

I will always make it clear if I am writing to endorse or recommend a specific product(s) or service(s). I hate it when I visit a site only to find out that the article is just one big ad.

Various ads may be displayed on this post to help defray the operating cost of this blog. I may make a small commission on any purchases you make by clicking on those advertisements. Thank you for supporting my work bringing you accurate and actionable information on data literacy, analytics, and engineering.

Advertisements

One Response

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.