Thinking Critically About Statistics and Their Sources
In the sciences, we use theory and methods to empirically assess “reality”. While we can often play with data to explore the relationships between our concepts(our variables), it is important to frame what we’re doing with good theory.
An interesting graph has made its rounds through social media lately. It shows a strong relationship between Internet Explorer market share and murders in the U.S.
I first encountered this graph on Facebook when a friend sent it to me so I could use it in my statistics classes. When I post graphs or other forms of data, I like to include the source so that students can assess the veracity of the data and whether or not to trust that it is accurate.
In my search for its source, I thought it was first posted on Twitter (1/21/2013) but then traced it to reddit and imgur (1/18/2013) – which then reference each other as a source. Gizmodo picked it up on 1/22/2013 and it reached me in April 2013.
The comments on the gizmodo site provide other examples of such spurious relationships: telephone poles and rapes, temperature and number of pirates, and other nonsensical pairs that co-exist but are not directly related.
These graphs and other relationships show clearly how correlation does not equal causation. Showing an apparent relationship with statistics may illustrate a correlation, yet such a relationship does not prove that one variable causes a change in the other or that they are causally related.
This chart is a great example of how data may show similar trends – which we may interpret as a relationship – although there is no reason or logic as to why a relationship might exist between them. The relationships could be spurious or simply coincidence that they have the same trend line. Or perhaps this apparent relationship could be due to the relationship of these two variables to other variables that are unmeasured in this analysis.
In any case, I could not find the source of the data used to create the graph – thus this “data” could be fabricated and it is most likely false. I did find a site – geek.com – that did post a graph of “real” data and found that the trend line is not as similar for these two variables.
They, however, did not include a source of the data in the post. At the end of their post, they included a link to the Twitter post mentioned previously and to a blog. This graph came from that blog and the source of the data – cited clearly and with links – came from Wikipedia and w3schools.com. Some of the Wikipedia data is attributed to the Bureau of Justice Statistics, but it is unclear what the source was on the “Crime in the United States” Wikipedia page. The w3schools page is a web development site that logged their tally of browsers used. Used how? That isn’t clear.
Is murder rate the same as homicide rate? Is Internet Explorer market share the same as browser usage? These are just some of the problems with how these concepts were defined and measured.
OK, then. What do we know for sure here? Is murder rate closely linked with Internet Explorer usage (or market share) or not?
Is there a good reason to find this out? Will it tell us anything about those two phenomena?
Generally, this question raises the importance of theory to guide our research and statistical analysis. Theory provides the reasons why things in society may be connected and helps us understand why those things may be connected.
You may notice that published high quality research start with a review of theory and the literature to assess what previous research has discovered about the phenomena in question. We use that to guide how we investigate “reality” in that we might replicate a study to ensure what they found can be found again or to pursue a new angle.
In any case, there must be some logical reason why we connect two variables – why might they be related and how might that work? To do that, we need theory and clear definitions of what we’re looking at (or for) and valid and reliable data. We might find that something as simple as population density affects both of the original variables.
If we cannot find the source of the data or details of the research, we should not trust that the illustrated relationship is real and accurate.
Have you heard of other relationships that sound odd or illogical? If so, has research uncovered the reason(s) why they might be related? Was this relationship established by empirical data and is its source reliable?
Comments