Why are charts important to statistical analysis? Why can’t you only rely on summary stats and other calculated numeric outputs? After all, visualizing data takes a fair amount of work. It’s tempting to save time and treat charts as an afterthought. Few perspectives could be more misguided. Save yourself a headache and always take the time to visualize your data — every time. No exceptions.

## Illustrating Why Charts are Important

If you’ve heard of Anscombe’s Quartet, you know where this is going. It’s the “go-to” case study for people like me trying to get others on the *Importance of Data Visualization* train.

Frank Anscombe (1918 – 2001) was a British statistician who taught at Cambridge, and later Princeton and Yale. Fun fact: his wife’s sister was the wife of John Tukey! That’s right, he was the brother-in-law of the guy who developed the Fast-Fourier-Transform!

Anyway, Frank eventually got into statistical computing in the early 1970s. That work crystallized his belief that charts were an indispensable part of the statistical analytic process. In Vol 27 of *The American Statistician, *Frank explained why.

*Excerpt from Frank Anscombe’s Article in The American Statistician (1973)*

*Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions: *

*Numerical calculations are exact, but graphs are rough;**For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;**Performing intricate calculations is virtuous, whereas actually looking at the data is cheating.*

*A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.*

*Graphs can have various purposes, such as (i) to help us perceive and appreciate some broad features of the data, (ii) to let us look behind those broad features and see what else is there. Most kinds of statistical calculation rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct, and if they are wrong, we ought to be able to perceive in what ways they are wrong. Graphs are very valuable for these purposes.***Source**: *Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. The American Statistician. The American Statistician, Vol. 27, No. 1. 27 (1): 17–21. doi:10.2307/2682899. JSTOR2682899*

To really drill Frank’s points home, let’s take a look at four datasets that have come to be known as Anscombe’s Quartet. (If you’d like to get your hands dirty in R, you can find a walkthrough here)

## Anscombe’s Quartet and the Dangers of Not Charting

Take a look at the following four datasets. They’re straightforward, two-dimensional distributions. Nothing overtly strange here.

X1. Y1. X2. Y2. X3. Y3. X4. Y4. 10 8.04 10 9.14 10. 7.46 08. 6.58 08 6.95 08 8.14 08 6.77 08 5.76 13 7.58 13 8.74 13 12.74 08 7.71 09 8.81 09 8.77 09 7.11 08 8.84 11 8.33 11 9.26 11 7.81 08 8.47 14 9.96 14 8.1 14 8.84 08 7.04 06 7.24 06 6.13 06 6.08 08 5.25 04 4.26 04 3.1 04 5.39 19 12.5 12 10.84 12 9.13 12 8.15 08 5.56 07 4.82 07 7.26 07 6.42 08 7.91 05 5.68 05 4.74 05 5.73 08 6.89

### Generating Descriptive Statistics

Since the data is so clean and ready for analysis, the first thing you might do is run some quick descriptive stats in R. Here is the output.

Set Means Mean_X Mean_Y 1 Anscombe Set 1 9 7.500909 2 Anscombe Set 2 9 7.500909 3 Anscombe Set 3 9 7.500000 4 Anscombe Set 4 9 7.500909

Set Std. Devs. SD_X SD_Y 1 Anscombe Set 1 3.316625 2.031568 2 Anscombe Set 2 3.316625 2.031657 3 Anscombe Set 3 3.316625 2.030424 4 Anscombe Set 4 3.316625 2.030579

### Performing a Correlation Analysis

At this point, you might be feeling good and decide to check correlations between x and y. You could even perform some regressions. Here’s what you’d get.

Correlations r 1 Anscombe Set 1 0.8164205 2 Anscombe Set 2 0.8162365 3 Anscombe Set 3 0.8162867 4 Anscombe Set 4 0.8165214

### Running Some Linear Regressions

Based on the information above, it appears that these datasets have nearly identical descriptive stats. Going on that information alone, you might assume that they have roughly similar distributions. Let’s take a look.

ANSCOMBE SET 1 Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x 0.5001 0.1179 4.241 0.00217 ** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217

ANSCOMBE SET 2 Residuals: Min 1Q Median 3Q Max -1.9009 -0.7609 0.1291 0.9491 1.2691 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.001 1.125 2.667 0.02576 * x 0.500 0.118 4.239 0.00218 ** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179

ANSCOMBE SET 3 Residuals: Min 1Q Median 3Q Max -1.1586 -0.6146 -0.2303 0.1540 3.2411 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0025 1.1245 2.670 0.02562 * x 0.4997 0.1179 4.239 0.00218 ** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176

ANSCOMBE SET 4 Residuals: Min 1Q Median 3Q Max -1.751 -0.831 0.000 0.809 1.839 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0017 1.1239 2.671 0.02559 * x 0.4999 0.1178 4.243 0.00216 ** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165

### Charting the Data and Seeing Why Charts are Important

You’re smart people. You’ve probably realized I’m leading you down a path toward some big “Aha!” moment. Well, here it is. Even though those data sets have near-identical descriptive statistics, they have drastically different distributions.

## Yikes! Look at Those Distributions

It looks like we made some incorrect assumptions. Let’s break these distributions down one at a time:

**The first distribution (X1, Y1)** is a straightforward linear relationship with two correlated variables and an assumption of normality

**The second distribution (X2, Y2)** has a relationship between the two variables but is certainly not linear. That Pearson Correlation Coefficient we calculated earlier is not relevant here at all. Ouch. Our linear regression is also not appropriate, and we should have tried a different model instead.

**The third distribution (X3, Y3)** has a linear relationship, but that outlier in the upper right quadrant is completely destroying our linear regression. A robust regression would have dealt with the outlier, but we might not think of using it unless we saw this plot first.

**The fourth distribution (X4, Y4)** is entirely skewed by that outlier way out in Right Field. It’s hard to say if there is any relationship between the other variables, but whatever is going on with that outlier is enough to produce a high correlation coefficient for the entire set!

I think the point is evident here. Descriptive statistics and regressions are great tools, but you should never use them alone. Conversely, you should never solely rely on plots. It’s when you use them together that you really get to leverage their power and perspectives. So take the time and make those charts. It’s hard to ensure accuracy without them.

###### Advertisements Disclosure

I will always make it clear if I am writing to endorse or recommend a specific product(s) or service(s). I hate it when I visit a site only to find out that the article is just one big ad.

Various ads may be displayed on this post to help defray the operating cost of this blog. I may make a small commission on any purchases you make by clicking on those advertisements. Thank you for supporting my work bringing you accurate and actionable information on data literacy, analytics, and engineering.

[…] post is a small companion to Why are Charts an Important Part of Statistical Analysis? In it, you’ll find the R scripts for exploring Anscombe’s Quartet and reproducing the […]