Why are charts important to statistical analysis? Why can’t you only rely on summary stats and other calculated numeric outputs? After all, visualizing data takes a fair amount of work. It’s tempting to save time and treat charts as an afterthought. Few perspectives could be more misguided. Save yourself a headache and always take the time to visualize your data — every time. No exceptions.
Illustrating Why Charts are Important
If you’ve heard of Anscombe’s Quartet, you know where this is going. It’s the “go-to” case study for people like me trying to get others on the Importance of Data Visualization train.
Frank Anscombe (1918 – 2001) was a British statistician who taught at Cambridge, and later Princeton and Yale. Fun fact: his wife’s sister was the wife of John Tukey! That’s right, he was the brother-in-law of the guy who developed the Fast-Fourier-Transform!
Anyway, Frank eventually got into statistical computing in the early 1970s. That work crystallized his belief that charts were an indispensable part of the statistical analytic process. In Vol 27 of The American Statistician, Frank explained why.
Excerpt from Frank Anscombe’s Article in The American Statistician (1973)
Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:
- Numerical calculations are exact, but graphs are rough;
- For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;
- Performing intricate calculations is virtuous, whereas actually looking at the data is cheating.
A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.
Graphs can have various purposes, such as (i) to help us perceive and appreciate some broad features of the data, (ii) to let us look behind those broad features and see what else is there. Most kinds of statistical calculation rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct, and if they are wrong, we ought to be able to perceive in what ways they are wrong. Graphs are very valuable for these purposes.
Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. The American Statistician. The American Statistician, Vol. 27, No. 1. 27 (1): 17–21. doi:10.2307/2682899. JSTOR2682899
To really drill Frank’s points home, let’s take a look at four datasets that have come to be known as Anscombe’s Quartet. (If you’d like to get your hands dirty in R, you can find a walkthrough here)
Anscombe’s Quartet and the Dangers of Not Charting
Take a look at the following four datasets. They’re straightforward, two-dimensional distributions. Nothing overtly strange here.
X1. Y1. X2. Y2. X3. Y3. X4. Y4. 10 8.04 10 9.14 10. 7.46 08. 6.58 08 6.95 08 8.14 08 6.77 08 5.76 13 7.58 13 8.74 13 12.74 08 7.71 09 8.81 09 8.77 09 7.11 08 8.84 11 8.33 11 9.26 11 7.81 08 8.47 14 9.96 14 8.1 14 8.84 08 7.04 06 7.24 06 6.13 06 6.08 08 5.25 04 4.26 04 3.1 04 5.39 19 12.5 12 10.84 12 9.13 12 8.15 08 5.56 07 4.82 07 7.26 07 6.42 08 7.91 05 5.68 05 4.74 05 5.73 08 6.89
Generating Descriptive Statistics
Since the data is so clean and ready for analysis, the first thing you might do is run some quick descriptive stats in R. Here is the output.
Set Means Mean_X Mean_Y 1 Anscombe Set 1 9 7.500909 2 Anscombe Set 2 9 7.500909 3 Anscombe Set 3 9 7.500000 4 Anscombe Set 4 9 7.500909
Set Std. Devs. SD_X SD_Y 1 Anscombe Set 1 3.316625 2.031568 2 Anscombe Set 2 3.316625 2.031657 3 Anscombe Set 3 3.316625 2.030424 4 Anscombe Set 4 3.316625 2.030579
Performing a Correlation Analysis
At this point, you might be feeling good and decide to check correlations between x and y. You could even perform some regressions. Here’s what you’d get.
Correlations r 1 Anscombe Set 1 0.8164205 2 Anscombe Set 2 0.8162365 3 Anscombe Set 3 0.8162867 4 Anscombe Set 4 0.8165214
Running Some Linear Regressions
Based on the information above, it appears that these datasets have nearly identical descriptive stats. Going on that information alone, you might assume that they have roughly similar distributions. Let’s take a look.
ANSCOMBE SET 1 Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x 0.5001 0.1179 4.241 0.00217 ** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
ANSCOMBE SET 2 Residuals: Min 1Q Median 3Q Max -1.9009 -0.7609 0.1291 0.9491 1.2691 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.001 1.125 2.667 0.02576 * x 0.500 0.118 4.239 0.00218 ** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
ANSCOMBE SET 3 Residuals: Min 1Q Median 3Q Max -1.1586 -0.6146 -0.2303 0.1540 3.2411 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0025 1.1245 2.670 0.02562 * x 0.4997 0.1179 4.239 0.00218 ** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
ANSCOMBE SET 4 Residuals: Min 1Q Median 3Q Max -1.751 -0.831 0.000 0.809 1.839 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0017 1.1239 2.671 0.02559 * x 0.4999 0.1178 4.243 0.00216 ** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
Charting the Data and Seeing Why Charts are Important
You’re smart people. You’ve probably realized I’m leading you down a path toward some big “Aha!” moment. Well, here it is. Even though those data sets have near-identical descriptive statistics, they have drastically different distributions.

License: 2020 Xyzology | CC-BY-SA | This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Yikes! Look at Those Distributions
It looks like we made some incorrect assumptions. Let’s break these distributions down one at a time:
The first distribution (X1, Y1) is a straightforward linear relationship with two correlated variables and an assumption of normality
The second distribution (X2, Y2) has a relationship between the two variables but is certainly not linear. That Pearson Correlation Coefficient we calculated earlier is not relevant here at all. Ouch. Our linear regression is also not appropriate, and we should have tried a different model instead.
The third distribution (X3, Y3) has a linear relationship, but that outlier in the upper right quadrant is completely destroying our linear regression. A robust regression would have dealt with the outlier, but we might not think of using it unless we saw this plot first.
The fourth distribution (X4, Y4) is entirely skewed by that outlier way out in Right Field. It’s hard to say if there is any relationship between the other variables, but whatever is going on with that outlier is enough to produce a high correlation coefficient for the entire set!
I think the point is evident here. Descriptive statistics and regressions are great tools, but you should never use them alone. Conversely, you should never solely rely on plots. It’s when you use them together that you really get to leverage their power and perspectives. So take the time and make those charts. It’s hard to ensure accuracy without them.
Advertisements Disclosure
I will always make it clear if I am writing to endorse or recommend a specific product(s) or service(s). I hate it when I visit a site only to find out that the article is just one big ad.
Various ads may be displayed on this post to help defray the operating cost of this blog. I may make a small commission on any purchases you make by clicking on those advertisements. Thank you for supporting my work bringing you accurate and actionable information on data literacy, analytics, and engineering.
Trackbacks/Pingbacks