R and automation to deal with a pandemic as a solo data journalist
The first coronavirus map that I did for the digital site was back in early February and it told how the virus was spreading across the Chinese provinces.
There were no European Centre for Disease Prevention and Control (ECDC), World Health Organisation or Johns Hopkins databases yet, and we gathered the data manually, using the news wires and translating Chinese official websites, Chinese news sites and science blogs.
Five months later, not only have I lost count of the number of maps, charts and analyses I’ve done, but I have also developed a more automatic way of doing my job.
This system has allowed me to increase the number of pieces I used to produce and to handle more efficiently a story that has moved very quickly. A political, health and science story that has pushed us to understand complex terminology, to learn new statistical methods and to explore new ways of communicating it. A data story told at the same time that the process of gathering, cleaning, standardising and assessing the quality of the data is occurring.
The two key elements of this process are R programming language and automating repetitive processes.
Paul Bradshaw, who runs the MA in Data Journalism and the MA in Multiplatform and Mobile Journalism at Birmingham City University -and my professor-, said here this was a good moment for journalists to learn to code. In fact, knowing programming languages has helped me to be more efficient and to understand the basis of automation.
I used to do some projects in R before the pandemic, especially those that involved large datasets or complex analyses. But I am now using R daily for almost every story, no matter how simple it is. The main reason is repetition and preparation.
If it is very likely I will have to do an analysis several times (e.g. calculating the excess of mortality or pulling out the numbers since the 50th/100th/1000th case/death), that is an indicator to better spend time writing a script. That also means the first analysis might take longer, but it will save time in the future.
It is worth including in the script not only the analysis and visualisations but also the cleaning process, although it might sometimes seem easier to open the file and rename some columns, change the format of the dates, remove empty rows or fix a cell with a weird character. But when a file is regularly updated, it will probably keep the same structure every time (this is what common sense says… then surprises happen).
Sometimes repetition means changing the names of the countries in a vector and press a button to generate a dataset. But on many other occasions that repetition involves more work.
An analysis of the situation in Spain has some parts in common with another one of Italy, Germany or the US, and, therefore, there are lines of code that can be copied and pasted. But after that, a closer look at the data, a different comparison or new sources add interesting new lines in each story.
There is also an element of preparation and speediness. If you know some new data will be published on Tuesday at 9:30 am and you already know the structure of the data, why not writing a script beforehand which will allow you to perform the analysis faster and to turn that into a story quickly?
Apart from saving time and making my job faster, knowing programming languages has opened me the door of some scientific work and experts’ help.
I was able to reproduce a couple of models when we were still talking about flattening the curve because I understood a Python script. And, — mental note — I owe a couple of beers to a statistician friend that helped me with some calculations on a Saturday afternoon. I shared my script on Github, and he replied back with the answer.
But also running analysis in R reduces the risk of human mistakes and makes the revision process clearer. As the data cannot be “directly touched” — I mean, you can’t directly edit a cell without writing the instruction — there’s less risk of changing a number by mistake. Also, everything is documented, step by step.
Performing analysis by writing scripts helps the revision process. I usually write an explanation of what I am doing and why; so, once I finish, I can go back and review every step and repeat some calculations with a different process to check I have the same results.
It helps, too, hearing myself loud going through this steps and judge the decisions, so I try to explain the process to the reporter I am working with (or my new colleague in my new office).
And, finally, coding has helped me to understand more complex tasks of automation (or at least to know what is possible and look for people with the expertise).
After creating that first Chinese map, I was asked to update it regularly. In the beginning, it was just a map, then another one, then a line chart… and so on.
I was spending much time updating visualisations and we were still in the pre-pandemic era when we had more work than just coronavirus.
I must also admit that I am not a big fan of monotony nor doing the same task over and over again. So, quite soon I started to think that if a task is repetitive, there should be a way of automating it.
I broke down the problem and identified potential solutions for some of each task, based on previous experience with data. But I got stuck in how to get rid of any human element connecting the visualisations in Flourish with the data.
I didn’t (and don’t) have the knowledge, but I knew it was possible and who could help me: a developer who understands data, the newsroom (yes, we are not easy people) and what “time” means for journalists, especially in a story that has moved so fast.
Automating visualisations has saved me dozens of hours that I was able to invest in different analyses and specific stories (all COVID-19 related, as you may guess). But we’ve gone beyond the visualisations.
There are nine official sources for coronavirus data in the UK. Some of them publish it daily; some, weekly. Some, in the morning; some, in the afternoon.
Most of them have now created dashboards to get the data via files or using their feeds -or more complex techniques. Many have changed the way of reporting -several times-, from publishing a figure as plain text in a paragraph, to create a table, and then a dashboard. And changes have also affected the structure of the data.
We didn’t expect so many sources nor so many changes when we started, but this wasn’t enough and we were about to add another task to our mega-busy weeks: understanding the geographical-administrative system in the UK.
I naively thought it couldn’t be so complicated. I thought I had seen everything after a local and a general election…
After very long days and weekends, we managed to add a field to our database with what we called “the metadata”: ONS codes, lookup from local authorities to regions, health boards, NHS codes, latitudes and longitudes, populations… All elements needed for mapping and calculations were now in one single place.
When we started with the automation, the priority was feeding visualisations. But having everything cleaned and structured in a single JSON file has been incredibly useful to perform analyses, and it has been used by other departments in the newsroom.
The story isn’t over, and it is very likely that the database will grow and change again. We still have work to do, problems to fix, things to learn from mistakes, and more good and bad decisions to make.
Decisions sometimes taken unconsciously of their whole impact and long-term repercussions, as the idea of automating a Chinese map.