**Statistics for journalists: A list of descriptive methods and a brief outline of inferential and probability**

**Warning**: This is not an attempt at listing all the statistical methods, not even all of those useful for journalists. It is just my list, which was hidden in a ‘dirty’ and ‘messy’ Word Document.

I have been building this catalogue since June when I started the challenging (and sometimes frustrating) adventure of exploring the statistical analysis as part of my MA in Data Journalism at Birmingham City University.

This won’t be my final ‘personal’ list, because I am planning to work as a data journalist for some more years. At least, until winning the lottery (I know what statisticians say about it, but what if…).

Hopefully, you will find it useful and, hopefully, you can contribute to increasing it.

**Which statistics?**

First things first. The structure.

My brain needs to visualise the primary structure of the subject before storing the knowledge, but I jumped into statistics as in a swimming pool full of balls.

To reduce my brain’s anxiety, I looked for the chapters in which statistics are divided to be able to organise the balls.

The standard categories are descriptive and inferential, and they have a ‘surname’ depending on the number of variables involved.

**Descriptive Statistics — Univariate**

These are the most used in newsrooms. They help us to summarise our data, to describe them.

**1.** **Which average?**

Mean, median, and mode… Each of them has its caveats. The median is better than the mean if there are outliers, but taking the middle value and ignoring those extreme sounds like missing information (if this is reminding you to an adolescent nightmare, check this and this).

A better middle measure is the **geometric mean**, which is also helpful to compare ratios in different scales. For instance, a story with the best books for the summer according to several rankings. I also used it in this descriptive analysis of the West Midlands’ funding, in which nor the mean or the median convinced me as a fair measure of central tendency.

Here and here are two good explanations about it, and they also include the **harmonic mean**, an interesting one if your idea involves travel times.

The last one that I learned was the **trimmed mean** (also called the truncated mean), which is interesting to deal with extremely skewed distributions.

**2.** **Standard deviation, for what?**

It took me a while to understand why a journalist would ever want to calculate the standard deviation (SD). Although this measure is not usually included in the story, it gives you a sense of the variability. **How spread your data is**. (See chapter three and this website).

However, as it happens with the mean, a more robust method to see the spreadability is **the interquartile range (IQR).** That is the third quartile minus the first quartile, so it is not affected by outliers.

The SD is also essential to understand other concepts and methods such as the empirical rule and confidence intervals, so worth not missing it.

**3.** **Percentiles and quantiles**

They are also useful in distributions. For instance, if your data is about the wages in your country, the 25th and 75th percentiles may limit the upper and lower boundaries for the middle class.

**4.** **Percentages and proportions**

Old friends. They help us to put figures into context and to compare sizes.

However, there are some warnings with percentages.

One:

Second:

When talking about risk in percentage, **transform the relative risk into absolute risk and report it as a natural frequency**.

As an example:

WRONG WAY: Eating PANTUFLAS increase the risk of breast cancer by 50%.

RIGHT WAY: Eating PANTUFLAS increase the risk of breast cancer from 12 women in 100 to 18 women in 100.

REASON: 50% is the relative risk. The absolute risk of having breast cancer for women is 12%. So, the risk of breast cancer if eating PANTUFLAS is 0.12x1.5=0.18.

Third:

Percentages are evil, and **they can exaggerate**. For instance, an increase between 2 and 5 is 150%, while between 50 and 72 is 44%.

**Elizabeth McLaren**, a statistician from the Office of National Statistics (ONS) told me:

Small numbers are a problem for us for topics like infant deaths and stillbirths.

There is a standard notation to indicate when rates are unreliable(u symbol in the ONS and n symbol in Eurostat) but I suspect this isn’t understood by journalists.

The BBC Trust Impartiality Report (2016) also says:

Sometimes the short-term change in the numbers is not as statistically important as we assume it to be and an emphasis on longer-term trends might be more useful in helping audiences to interpret the figures and understand their implications. It is important to be clear when things have in fact not changed significantly.

The ONS usually **aggregates several years of data**, “but people do like to have figures for individual years, even if not robust,” added McLaren.

I also found the **funnel plots** (Ben Goldacre, 2011) as a way of controlling for the random variation (add this concept to your vocabulary) of small populations. Here the story where I used it and I published my code here.

**5.** **Ratios, rates, and per capita**

As with percentages, we often use it to normalise figures and to compare values among groups of different size.

But I once came across with a dataset where I needed to calculate a ‘**relative per capita**.’ Let me go into details because that was something new and useful.

The data was about the proportion of gas emissions by each EU-country. So, logically, Germany had a higher percentage than Latvia.

Hence, I divided the percentage by the population of each country, I ordered the results from smaller to bigger, and I divided each one by the smallest one. The result gave me the number of times each of the countries polluted compared to the one which pollutes the least.

As an example, I got that Romania emitted 1.04 time as much as Latvia (or 4% more than Latvia), and Luxemburg emitted 4.95 as much as Latvia (or 3.95 times more than Latvia or 395% more than Latvia).

*Need help with ‘times more' and ‘times as’? Look this chapter.

**6.** **Regression to the mean**

These slides from Maarten Lambrechts are worth revising them. They are written in a journalistic language, like this example about the regression to the mean.

**Descriptive statistics bivariate and multivariate**

These methods include two or several variables. They are used to explain the relationships between the variables of our data.

**7.** **Correlation**

I did this correlation analysis for a BBC story, and one of the keys was knowing **when we can talk about a strong correlation**.

Some authors consider a strong correlation over [.6, -.6] and over [.8/, -.8] is a very strong correlation. However, in this case, strong is over [.7, -.7]; while in this other is over [.5, -.5].

The second key was that “**correlation does not mean causation**.” And the third one was the **variation of small samples**. For another project I ran cor.test in R with only seven observations and, although the correlation coefficient was around .5, the results went from -.3 to .9.

**8.** **Regression analysis**

Charles Wheelan explained in Naked Statistics (2012):

Regression analysis allows us to quantifythe relationship between a particular variable and an outcomethat we care about while controlling for other factors.

But it also works for predictions, a common method in machine learning.

I carried out my first regression analysis a month or so ago, and I published my “step by step” and the explanation to interpret the outcome here.

I did a linear regression, and a multilinear regression, which takes several variables to explain the dependant one (in my case, the age for young people to leave parents’ home).

But among the basic regressions, we also find the logistic and the exponential ones, and you have to decide which model fits best your data.

**Inferential statistics**

They are not as common as the descriptive methods in the newsroom, but we tend to report their outcome when we deal with surveys, polls and many scientific papers.

Inferential statistics are used to **make predictions and/or inferences from a sample whose results can be generalised to the population**.

I covered this chapter of the statistics in two other posts:

Although I have calculated by myself only a few of these methods, I found necessary to **understand them and the jargon to interpret their results in our stories, to consider some of their caveats before reporting them, and to be able to communicate with the statisticians’ community when looking for help** in forums, reports, websites or in person.

Some of the recurrent methods I have come across are: hypothesis testing, confidence intervals, p-values (also applied for regressions), Type I and II Errors (or false positives and false negatives), Bonferroni corrections, the central limit theorem, statistical significance, odds ratio, margin of error, simple distributions and sampling distributions, bootstrapping.

**Probability**

Probability is not part of the statistics discipline, but we use it in our data stories, for instance, when **communicating polls or deaths rates**.

An introductory course of probability would refresh you the basic calculations, the law of large numbers, Bayes theorem, and a bit of Bayes inference (prior and posterior probability), also used in machine learning.

It inevitably includes a chapter about uncertainty, one of our main problems. As **Nate Silver**, editor of FiveThirtyEight, wrote in this article about the media probability problem, **“a probability forecast is an expression of uncertainty.**”

People associate numbers with precision, so using numbers to express uncertainty in the form of probabilities might not be intuitive. (…) Also, both probabilities and polls are usually listed as percentages, so people can confuse one for the other — they might mistake a forecast showing Clinton with a 70 percent chance of winning as meaning she has a 70–30 polling lead over Trump, which would put her on her way to a historic, 40-point blowout.

Moving a step forward, you may end up in probability distributions. The most common ones that I have seen are the normal, binomial, Bernoulli and Poisson.

But there are only a **few media which are using probabilities to “produce” stories**.

FiveThirtyEigh took Poisson for its World Cup predictions, as well as El País for its forecast. The Pudding explained the Birthday Paradox with an interactive piece, and they addressed the likelihood of automation by careers. The Upshot calculated the likelihood of a personal record when switching to different shoes. And La Nación estimated how much would cost complete a football sticker album, a recurrent topic in other websites.

And there is a common element in all of them: **the presence of a non-journalist professional or a journalist with statistics background involved in those stories**.

Given our basic statistical literacy, moving freely in the world of probability might be a long-term goal. **But having an elementary knowledge will facilitate the communication with people who can help you in the meantime**.

That was my “modus operandi” for my last piece, in which we used methods from the descriptive, inferential and probability chapters. Hence, don’t consider them as sealed boxes.

*Any mistake? Please, let me know. Comments are welcome, as well as new methods to increase the list.*