Inside the OpenLux investigation

Conversations with Data: #81

Do you want to receive Conversations with Data? Subscribe

379a931b 4f2b 46a3 86d6 c949e71914ff 3

Welcome to the latest Conversations with Data newsletter brought to you by our sponsor, a web hosting company that was established in Iceland to provide safe harbour for freedom of speech, free press and whistle-blower projects. offers members a special package with free shared hosting webspace and a free domain name. To access the 5% discount on all their products, use the promo code OpenLux.

Captura de pantalla 2021 09 08 a las 16 41 05

Shrouded in secrecy, Luxembourg is a global financial hub known as the crossroads of Europe. After the LuxLeaks investigation in 2014 revealed its close ties to offshoring and tax evasion, journalists at Le Monde sought to find out if that was still the case.

This led to the OpenLux investigation, a cross border collaboration with Le Monde, OCCRP and numerous other news organisations around the world. Instead of relying on leaked documents, the investigation began by scraping an open database provided by the Luxembourg government using Python.

To find out more, we spoke with OCCRP editor Antonio Baquero and Le Monde journalist Maxime Vaudano about the hidden side of the Luxembourg offshore industry.

They explained the data collection and analysis involved along with the benefits of using OCCRP's Aleph, an investigative data platform that helps reporters follow the money.

You can listen to the entire podcast on Spotify, SoundCloud, Apple Podcasts or Google Podcasts. Alternatively, read the edited Q&A with Antonio Baquero and Maxime Vaudano below.

What we asked

Talk to us about OpenLux and how the investigation began.

Maxime: Luxembourg is not completely new for journalists interested in tax avoidance. There was a big scandal in 2014, the LuxLeaks based on leaked documents from tax rulings from the accounting firm PricewaterhouseCoopers (PwC). We wanted to know whether this scandal and the regulation that came afterwards had any impact on Luxembourg. Was it still a tax haven or had it become something else?

This investigation was not a leak (as in the LuxLeaks). Instead, it relied on open source data. This came about because a few years ago the European Union voted for a new regulation that asked all the EU members, like Luxembourg, to publish an online register saying who really owns the companies in every country. We call this the ultimate beneficial owner. It's a great way to bring transparency to the corporate world. There was this register that was published online in 2019 and we just used it by scraping the data.

We did some data analysis and gathered a dozen of partners from different countries and tried to figure out what we could do with that. What did all those names mean? What did this mean for the Luxembourg economy? Could we prove that Luxembourg is still a tax haven or that it moved to something else? And that was pretty much the starting point.

OpenLux was a cross-border investigation. How did the different newsrooms work together?

Antonio: From the beginning, the data belonged to Le Monde. At OCCRP, we were focused on criminality and we found many interesting profiles from the data. We shared all of them with Le Monde and with the other media partners. We invited any partner if they saw any interesting topic or name to create a group and to start a project together. It was amazing because even in cases where Le Monde, or other media, were not interested in an aspect of a story, they helped us.

There were some cases where OCCRP wouldn't publish that story, but we helped the other media partners. At the end of the project, personally, I was exhausted because it was a year and it wasn't an easy project. But I was really satisfied not only by the final result but by the path we took to deliver that result. We built a project based on friendship and true cooperation from journalists from all over the world.

0378855c 548a ca0b 59e4 e526f97e2830

Talk to us about the data scraping process.

Maxime: We had a developer at Le Monde who did most of the scraping process. He used Python to scrape the register. It's quite easy to scrape compared to other websites or registers because there are no real technical difficulties. Each company in Luxembourg has a different number. So you just type the number "B123" or "B124" and you can obtain information on that company. What we did was to automate the code, so it went to the website register to gather information.

Then the challenge was to be able to keep it updated because the register is updated every day. We had to do it very regularly to be able to gather new names to determine when the name was wiped out of the register. This was because if someone was no longer a beneficial owner anymore, it disappeared from the register forever. Being able to scrape it every two or four days made it easy for us to have a history for our investigation. And then there was a big challenge of what we do with all this data.

How did you sort through this massive amount of data?

Antonio: That was the big contribution of OCCRP because we have a tool called Aleph, where you can put in every kind of data that you want. It makes it very easy to organise this data and to cross-check it with other data sets. We put the data in Aleph and with a few manipulations, we were able to cross-check it with other registers in other countries or with names of politically exposed persons. So it makes this process of selecting interesting names very easy compared to going through every name individually. That was the first part of the data storytelling process.

Tell us how Le Monde handled its approach to finding relevant stories from this data.

Maxime: As a data journalist, I'm very used to working with a quantitative approach. For Le Monde, we found out that there were many very rich families from France in the data set. Instead of focussing on one family or two families, we invested a lot of time trying to map all the assets of the top rich families. We were able to prove that 37 of the most 50 wealthiest families in France were in the data set and owned assets in Luxembourg.

That was more striking for us to say that most rich French families have their assets in Luxembourg than focussing on one and doing a name and shame approach. We always had to give names in order for the public, for the reader to be able to understand what it's about. But it's more striking to have big numbers and to be able to determine whether there is a trend. This data was so rich that we could do this.

How did data visualisation come into play for this investigation?

Maxime: We didn't have a big focus on data visualisation for Le Monde's OpenLux publication, but during the process of working and investigating the stories, it was important. For example, we used a mind mapping tool to be able to reconstruct the structures of the company because it's usually very complicated. The structures were very complex with subsidiaries in numerous countries. We were able to rebuild the structure from zero by looking at the documents. Using this data visualisation tool helped us understand what it's about. Because it's so complicated to digest for the public, it was not worth it to publish the raw visualisation. But it's still very useful for us to be able to have a clear mind about what we are looking at.

Antonio: For most of the stories, we wrote articles. But there's one story where we decided to use a visualisation. The story aimed to explain how heads of state from all over the world used Luxembourg for having real estate properties across Europe. We created an interactive map so the reader can see a map of Europe and see the properties and who the owner is.

Captura de pantalla 2021 09 08 a las 20 01 06

What did you learn from this investigation?

Antonio: For me, it was the first big investigative project that I worked on as a coordinator in OCCRP. So for me, I learned how important it is to be fair and to be frank when you try to cooperate with others. When you work with others on a project like this, you need to commit to sharing everything. I also learned how important it is to ask for help. This is not a competition amongst journalists to see who is the most intelligent. This is a collaboration and if you don't know something, ask for help.

I also learned how important it is to not only have journalists in other countries, you also need to have journalists with different skills. Especially relating to data, financial documents or company records. It is really essential to realise you need others to make your work the best journalistic work in the world.

Latest from

Our latest long read article features case studies examining how a team of researchers, journalists and students used digital recipes to delve into COVID-19 conspiracy content sold on Amazon, the world's largest online retailer. Written by Jonathan Gray, Marc Tuters, Liliana Bounegru and Thais Lobo, this walk-through piece serves as a useful guide for researchers and journalists seeking to replicate a similar collaborative investigation using digital methods. Read the full article here.

405af550 a272 da57 a6d2 3a16784f2eea

Drones aren't just for photojournalists. Data journalists can also take advantage of them for their stories. Monika Sengul-Jones explores how to boost your storytelling with this technology, as well as the potential pitfalls for using them. She also provides a guide for journalists getting started. Read the full article here.

F6369179 3bb8 e689 8f50 f9d2ecb11121 1

Data journalism training opportunity

Are you a freelance journalist reporting on development issues? Do you want to gain data journalism skills? Then the data bootcamp for freelancers is for you! Organised by the Freelance Journalism Assembly, this interactive 20-hour, two-week virtual training will teach you how to find, clean and analyse data. You'll also learn how to create data storytelling formats. Apply for one of the 25 scholarships. Deadline: 10 September, 24:00 CEST.

Captura de pantalla 2021 09 08 a las 19 50 25

Our next conversation

Our next conversation will feature data journalist Clayton Aldern from Grist, a nonprofit, independent media organisation dedicated to telling stories of climate solutions and a just future. Clayton is a writer and data scientist currently working at the intersection of climate change, environmental degradation, neuroscience, and mental health. We will discuss how to best cover environmental issues and climate change through data journalism.

Captura de pantalla 2021 09 08 a las 19 54 36

As always, don't forget to let us know who you would like us to feature in our future editions. You can also read all of our past editions here or subscribe to the newsletter here.


Tara from the EJC data team,

bringing you, supported by Google News Initiative.

PS. Are you interested in supporting this newsletter or podcast? Get in touch to discuss sponsorship opportunities.

subscribe figure