**Statistics for journalists: what experts think we should learn to work with numbers**

I have been chasing statisticians for the last three months because of the MA in Data Journalism that I am finishing at Birmingham City University.

They have explained to me measures such as the odds ratio, the caveats of surveys and polls, methods to control against population and models to include probability in the news (yes, I owe many coffees to them).

And we have also talked about the statistical literacy in the newsrooms.

The bad news is that we have work to do, but the good one is that there is less math than we think.

Here is some advice that they have given me.

**1.** **Know your data**

Professor **Josu Mezo Aranzibia** teaches statistics to data journalists and he also runs the blog Malaprensa, where he flags errors in the (Spanish) news. Few of those mistakes have to do with getting the maths wrong, but with **flaws in the understanding of the data and the concepts**.

Besides remembering the basic statistics from the high school, journalists should examine the data and know its meaning. How a term is defined, how the data was collected, who produced the number… In short, understanding the information they are working with.

My conversation with Professor **Kevin McConway**, who has been involved with BBC4 program *More or Less*, began along the same lines.

I wouldn’t start with detailed methods, but with the way of thinking. Questions like: how the dataset was produced, don’t take the number at face value, what the motivation (of the source) is… I’ve found journalists more reluctant to ask questions about numbers.

And **Ana Kolar**, statistical consultant, even suggested reporting the definitions of the measures in the story.

For instance, including that by unemployment we refer to people registered in the system. “Consequently, **the unemployment statistics are typically underestimated**,” she explained to me, because “those unemployed but not willing to follow the procedures are excluded for the register.”

**2.** **Challenge the data**

In the book Crime statistics in the news, **Jairo Lugo-Ocando** warns that journalists tend to **use statistics as an element of “objectivity.”** As a consequence, we don’t challenge the figures nor cross-reference them to be validated, and we trust them just because they are number.

But think of the limitations of the data. Its quality, its collection methods, what is measuring and what not, the purpose for gathering it, if there is any target in the measure…

Michael Blastland and Andrew Dilnot wrote in *The tiger that isn’t*

The mechanics of counting are anything but mechanical. To understand numbers in life, start with flesh and blood. It is people who count.

And don’t cherry-pick to confirm your biases. When talking about causal thinking, the understanding of what causal questions can be explained by the data, Ana Kolar highlighted:

Data journalists should always

seek for the accuracy in data insights even if the obtained insights do not support their story. By trying to understand the insights rather than bending them, we can discover new knowledge

**3.** **Report absolute risk**

**Ben Goldacre** says in *Bad Science:*

When reporting on a risk: I want to know who you are talking about (e.g. men in their fifties); I want to know what the baseline risk is (e.g. four men out of a hundred will have a heart attack over ten years): and I want to know what the increase in risk is, as a natural frequency (two extra men out of that hundred will have heart attack over ten years). I also want to know exactly what’s causing that increase in the risk.

**4.** **Familiarise with statistical inference**

This is important, especially when reporting a scientific paper and dealing with surveys and polls.

These are usually more advanced models, but statisticians don’t mean learning them, but “**knowing what these methods can tell you and what cannot**,” McConway said.

In other words, understanding when the conclusions of surveys and polls can be generalised to the whole population, and what a random and representative sample is. “Not knowing this is a classic journalists’ mistake,” Josu Mezo told me.

Add to your lists**: the margin of error, the confidence intervals and the difference between randomised experiments and observational studies**.

**5.** **Get comfortable with uncertainty**

That sounds paradoxical, but “statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty,” wrote the statisticians and member of the American Statistics Association, **Marie Davidian and Thomas Louis**.

And **Victor Cohn** said in the magazine Significance: “scientists deal with uncertainty by invoking probability.”

**6.** **Invest some time on descriptive statistics**

These are the methods data journalists use the most, so it makes sense going deep in measures of central tendency and dispersion and shapes of data distributions.

In particular, sociology Professor **Julián Cárdenas** recommended learning about regression analysis.

I think journalism is mainly descriptive and not explicative. Explanations are based on anecdotes and not in statistical analyses, even when the data is available. (…) More advanced statistical models which show relationships would facilitate the watchdog role of the journalism.

But, “be aware that regressions are correlation-based tools, and that **correlation does not imply causation**,” Kolar warned.

**7.** **Watch out with populations**

**Elizabeth McLaren**, a statistician from the Office of National Statistics (ONS), told me:

Small numbers are a problem for us for topics like infant deaths and stillbirths. There is standard GSS notation to indicate when rates are unreliable (u symbol), but I suspect this isn’t understood by journalists. We sometimes aggregate three years of data, but people do like to have figures for individual years even if not robust.

McConway also mentioned that changing over time is not a “good basis for comparisons; producing confidence intervals would be a good practice.”

From **Ben Goldacre and Paul Barden**, I learned how to use the funnel plot in stories that include comparisons and populations of different sizes.

**8.** **Understand basic probability**

As with statistical inference, statisticians don’t expect us to build probabilistic models, “but it could be good if they would know something about that, **so that they can, at least, understand how the person is doing the work and ask sensible questions**,” Kevin McConway advised.

**BONUS**

The problem of **data falsification** came across in some of the conversations. While statisticians usually address it from the point of view of data manipulated in scientific research, we can also consider some of their methods for our needs. In the end, we all are using data.

**Benford’s law** is one of the methods to detect ‘dodgy’ figures. The Mexican journalist Diego Valle-Jones used it to test the homicide data from one official body dissimilar from other official sources. And Ben Goldacre wrote about a macroeconomic report on the 27 EU countries which employed it to detect fraud in accounting data, and they found that Greece “shows the greatest deviation from Benford’s law among all euro states.”

In this paper, researchers applied terminal **digits analyses, variance tests and multivariate associations** to detect data fabrication. And in this other, the authors determine “**the probability that an anomalous pattern in question may have occurred** by chance in the dataset” to spot false data.

A post about the “strange” number of deaths in car accidents by Carlos Gil Bellosta also made me wonder if **Poisson distribution** can detect manipulated data, and he gave me a ‘statistical answer’:

Maybe.

“My post referred to the low variance in the figures,” what is called **‘infradispersion’,** he explained in a mail conversation. The typical example in which this new concept happens is when there is a target.

For instance, if police earn a bonus for every 100 fines per week, then you may see many of them giving 100, 101… and few giving 98 or 99 fines. So, when you see infradispersion in official data like that, you might think ‘what if there is a target?’, ‘what if someone wants to record less than last year?’, ‘what if…?’.

“It’s difficult to prove or reject, but it gives you food for thought…,” he added; or the starting point of a journalistic investigation.