The quest, challenges and exciting possibilities of unlocking data.
Over the course of the last six months, my work has revolved around EITI data. I am what you could call an ‘EITI geek’. I have assisted the EITI Secretariat in the long journey towards open data, and this is a short summary of my experience.
Open data …
First, I want to clarify what I mean by open data. As I have come to realise, there are many different interpretations of this concept.
In short, I define open data as information publicly available and accessible.
That information is publicly available is relatively straight forward, though it is worth acknowledging that data can be publicly available even if it is not available on the internet. So data can be considered public, even if you need to contact someone to get it.
To make public data open, you need to provide easy access to it. The steps needed to acquire the data should be as few as possible. Of course the publisher is often interested in finding out who is accessing his or her dataset, but it should be up to the users to decide how much information about themselves they want to leave. Accessibility further means that the data is in file formats that are easily manipulated and compatible with the most common tools for analysis.
Recently the United Nations (UN) published a report identifying data as the “lifeblood of decision-making and the raw material for accountability.” The World Bank Group’s ‘Databank’ is one of many already long-standing extensive databases. Many other international institutions are catching up, making growing amounts of data available.
The Natural Resource Governance Institute (NRGI) recently published a dataset containing information collected from 223 EITI Reports. They used publicly available EITI data and made accessible by removing the barrier of pdf-files. NRGI have already begun using the information, to explore what insights can be gained and how to visualise the underlying data.
… at the EITI Secretariat
At the Secretariat a number of different efforts are under way to make more data open. A key improvement is the Summary Data Template. This template is used to collect a wide range of fiscal, legal and contextual data related to the extractive industries per country, including disaggregated numbers for revenues from the extractive companies, classified by cross-country standards.
The objective is to create a dataset that contains the most common and important information without having to read through hundreds of pages of reports.
Compiling this dataset is the key function of the ‘Open Data’ team of the International Secretariat.
In my perspective, the role of the national secretariats is to find, analyse, and open up information in a national context and to translate this information to an understandable format for their population. They should further ensure that this information is comparable over time.
Building on the national work, the International Secretariat can introduce an even broader perspective, making the information comparable across countries.
While the use of these summary templates will be a significant step forward, there are pitfalls with such summary information. Country specific and contextual information may not be captured. National peculiarities might be lost and sometimes also distorted. But by definition, summary data is not meant to capture every aspect and will subsequently leave out some information. The importance is to make sure what data is included is the most relevant in order to provide information to satisfy the broadest possible use.
Scraping underneath the surface
Different users have different needs. The needs of an academic are different from those of a journalist or parliamentarian. It is not easy to accommodate for all users in a single approach. We must therefore recognise that the International Secretariat, national secretariats and others will use ‘raw’ EITI data in many different ways. It is important to note that while datasets present a great opportunity to share information, they are not substitutes for EITI reports themselves. During my time at the Secretariat, I have scrutinised reports just shy of 70 reporting years, for more than ten countries. These ten different country contexts cannot be thoroughly addressed through summary data alone.
Scraping data of 70 + reports for specific pieces of data might not sound appealing to most people, but I have enjoyed the experience (mostly). For example, scanning a report for a specific report’s definition of oil lifting in order to provide accurate production- and export-volumes, you cannot help but picking up on other interesting information, not to mention differences of approaches and government systems. Prior to my internship, I must confess my resource economics course at university did not prepare me for the difficulties encountered in EITI data compilation. It did not prepare me for the difficulties in handling revenues collected by and transferred from different government agencies, nor the numerous epiphanies of just how complicated some fiscal regimes are.
Another thing that data scraping made me realise is how much EITI reports have improved and how both the quality and quantity of data has increased over the years. They have gone from merely focusing on revenue-disclosures and resolving discrepancies to including more and more relevant information regarding the sectors, licenses, contracts and ownership. Interestingly enough, the latest reports (covering 2012 & 2013) include information are rivalling most market analyses I have seen, that are commercially available. You may say that this is precisely what EITI Reports are for, but previous to the adoption of the EITI Standard, the amount and quality of information in most reports did not resemble the comprehensiveness of the newer ones.
Knock-knock – what’s there?
Of course there are challenges in gathering this data. Still certain PDF documents are ‘locked’ so one cannot even copy text or numbers from them. Some contain endless tables of information that cannot be replicated without meticulously punching in each separate number. For such reports it is even more important to overcome the barrier of inaccessibility and create files in which the calculations are available. This way, you can also verify the summary figures in these tables.
Another challenge is that some specific information is missing altogether. Numbers such as Gross Domestic Product (GDP), and values for the extractive sector’s share of GDP, are crucial to understand how important the sector is for a country. Yet, in several instances these are not even mentioned in the reports.
There are also more technical challenges, such as in-kind revenues or payments not valued correctly, as price-statistics are not included in the EITI reports. Summary data is supposed to reflect the EITI Report, and most reports do manage to provide average prices of standard benchmarks such as Brent or WTI. But the inclusion of actual prices used for these transactions, especially relating to different oil grades are usually lacking. When valuations of these revenues are not available, EITI’s numbers for government revenues will distort the real picture. At the same time, estimating the value of in-kind revenues by resorting to benchmark prices would distort the picture as well, by using wrong approximations. As my colleague Alex Gordy explained during a discussion on the topic; “We are in the business of resolving discrepancies, not creating them.” The best way to resolve this dilemma would be to have the responsible agency or company report what actual prices used in the transaction (not only volume).
Next: hand over the control pad
Finally, for me, open data is not only about providing information, but giving users of data as much control over the data as possible.
I am of a computerised generation, and if there is one thing that Wikipedia, Linux OS and other collectively produced innovations have taught us, it is that people in general wish to improve on information, not dismantle it. Having data open will improve it. This not only holds true for datasets, but also for our presentation (speak visualisation) of them. Instead of locking our data solely behind the bars of graphs, the graphs in themselves should also be subject to the possibility of manipulation.
This is what I mean by truly open data.
Christoffer Claussen was an intern at the International Secretariat from February to July 2015. He will be continuing his Masters in environmental resource and development economics at the University of Oslo.
 IEAG (2015). A World That Counts: Mobilising the data revolution for sustainable development. United Nations Secretary-General’s Independent Expert Advisory Group on a Data Revolution for Sustainable Development (IEAG), http://www.undatarevolution.org/.
 NRGI: Dataset – Unlocking EITI Data for Meaningful Reform, http://www.resourcegovernance.org/publications/dataset-unlocking-eiti-data-meaningful-reform
 EITI: Summary Data Template, https://eiti.org/document/eiti-summary-data-template
 WTI – West Texas Intermediate