Summary: In this post, Giles outlines a simple approach that can be used to identify potential ‘data gaps’ using R. Data gaps are assumed to denote the difference between the data required to achieve an outcome and the currently data available. To demonstrate how the methodology might be applied, the availability of data for Tanzania is compared against a set of geographic and economic peers.
Packages: wbstats, tidyverse and kableExtra.
Data: Data used in this post have been sourced using the wbstats R package.
Introduction
I first came across the idea of ‘data gap mapping’ when a client engaged our team to develop a composite index to improve how they monitored national economic, political and social risks.
In most cases a ‘data gap mapping’ tries to compare what data is needed with what’s available. This can to reveal not only where additional data is needed, but also the shortcomings of the data available to us – whether in quality, quantity, frequency, or granularity. Aside from this helping to highlight potential blind spots that might emerge in our analysis, it can also be helpful for highlighting where extra effort might be needed to fill any data gaps that are identified.
Setting aside the controversies surrounding the use (and abuse)1 of composite indices, for us the gaps analysis required comparing the characteristics of data available to us2 with what’s needed to build the client’s ‘ideal’ index. Unfortunately, because we needed to complete the analysis before knowing what the index would look like, we needed to find a reasonable substitute.
An Outline of the Approach
Because we had a sense of the themes the index should cover, we could determine the data available to us by reviewing national and international sources relevant to each theme. For instance, knowing that measures of political and democratic stability were likely to be important, we reviewed every national and international data source for indicators that might be suitable for the index. We then repeated this across each theme to produce a categorized database of indicators and corresponding metadata.
To proxy what data we needed to build the index, we decided to compare the availability of data for the country we were building the index on with its peers. This approach had the advantage of highlighting themes where data may be particularly scarce when designing and building the index, even when taking account of the statistical capacity of the focus country.
If the availability of data for our focus country is lower than its peers, it might indicate there are limited options available for covering the theme in an index or that more time will be needed to data sources to fill these gaps. Knowing where data is scarce would also be helpful during stakeholder consultations so requests for input can be focused on themes where data might be hard to come by.
To illustrate the approach, we’ll use Tanzania as our focus country.3
Data We Have
To determine the data available to us, we manually searched for databases focusing on indicators relevant to measuring and/or proxying political, economic and social risks. For instance, the International Monetary Fund’s International Financial Statistics database might be used as a source for economic and financial statistics. Whereas UNdata might be used as a source for social statistics.
For each data source, we then reviewed the availability of indicators for the focus country and its peers. To get a sense of the quality and quantity of each indicator we also collected metadata, such as the name of the indicator, its publication frequency, the period over which data is available and its primary source.
Although our full assessment covered 250+ databases, we’ll keep the example simple by focusing on data that can be accessed via the World Banks’s API (via the wbstats package). Because the wb_data() function can sometimes fail when an indicator isn’t available we will limit analysis to the World Development Indicators (source_id=2):
Loading packages and selecting sample data
#load relevant packages library(wbstats) library(tidyverse) #get indicator metadata #(We've focused on the World Development Indicators as they're more reliably available) ref_indicators<-wb_indicators() |> filter(source_id==2) #create a list of indicator IDs by topic #(Note: Some indicators are assigned to multiple topics) ref_dta_topics<-ref_indicators |> select(indicator_id, topics) |> unnest(topics) |> rename(topic=value) #select a random sample of indicators for each topic #(set seed for reproducibility) set.seed(321) #select sample of indicators across topics ref_selected_indicators<-ref_dta_topics |> group_by(topic) |> slice_sample(n=3)
Data We Need
As we didn’t know what the index should look like, we decided to proxy this based on the availability of data for a group of peers. Although our selection of peers was quite involved in the full project, we’ll keep it simple here and naively assume the statistical peers for Tanzania are countries in the same regional and income groupings:
Specifying peers and downloading data
#specify focus country iso3c code ref_fcs_iso<-"TZA" #select countries ref_peer_isos<-wb_countries() |> filter(admin_region_iso3c=="SSA", income_level_iso3c== "LMC") #download data for selected indicators dta_wb<-wb_data(country = ref_peer_isos$iso3c, indicator=ref_selected_indicators$indicator_id, return_wide=FALSE) #view peers kableExtra::kable(ref_peer_isos |> select(country))
Selected Peers:
country |
---|
Angola |
Benin |
Cote d’Ivoire |
Cameroon |
Congo, Rep. |
Comoros |
Cabo Verde |
Ghana |
Guinea |
Kenya |
Lesotho |
Mauritania |
Nigeria |
Senegal |
Sao Tome and Principe |
Eswatini |
Tanzania |
Zambia |
Zimbabwe |
Putting aside whether the countries listed can be considered ‘peers’ for Tanzania4, the next question is how to sensibly compare the availability of data across countries. For instance:
- Should we compare Tanzania against the best performing peer, the median or is the average good enough?
- Should the frequency of publication be considered?
- Should more recent data be prioritized?
- etc…
As you might have guessed, I’m not going to give a simple answer here because there isn’t one. But, because our assessment covered 14 thousand indicators across 250+ databases, we took a relatively simple approach by dropping all indicators without values for recent periods and comparing the availability of data for our focus country with the peer average.
The code below illustrates the basic idea, by comparing the percentage of missing values for Tanzania compared with its peers for each indicator, ignoring the recency of observations available for an indicator:
Estimating the comparative proportion of missing values
#label peers vs focus country tranform date to integer dta_wb<-dta_wb |> mutate(date=as.integer(date), iso_group= if_else(iso3c==ref_fcs_iso, "Focus","Peer")) #Calculate the % of missing values #Add topic (noting some indicators are assigned multiple topics) sum_na_bechmarking<-left_join(dta_wb,ref_dta_topics) |> group_by(iso_group,topic) |> summarize(pct_na=(sum(is.na(value))/n()) )|> pivot_wider(values_from = pct_na, names_from = iso_group) |> mutate(na_data_gap=round(Focus-Peer,2)) #present largest data gaps using the missing value approach sum_na_bechmarking |> select(topic, na_data_gap) |> arrange(na_data_gap) |> head(5) |> kableExtra::kable()
Potential Data Gaps (missing values approach):
topic | na_data_gap |
---|---|
External Debt | -0.11 |
Climate Change | -0.10 |
Infrastructure | -0.10 |
Urban Development | -0.09 |
Energy & Mining | -0.06 |
Data We’re Missing
From the looks of it, the average availability of data on the topics external debt, climate change and infrastructure are lower for Tanzania for its peers. Suggesting it might be relatively harder to find data on these topics.
Of course, because the primary source for many of the indicators are the statistical agencies of each country, a country level comparison makes more sense. It’s also likely to be more informative to compare the availability of data at the indicator level, before summarizing data gaps by topic to avoid our results being skewed by the number of indicators available for each country.
Another potential issue is that in many cases missing values exist before data for an indicator was collected by Tanzania or its peers. A good example of this is the median lead time to import in days (ID: LP.IMP.DURS.MD), which for Tanzania is available only after 2007 and on an irregular basis. Making a data gap based on the percentage of missing values potentially misleading; as newer data series will have missing values in periods before data was first collected for an indicator.
To address this, the code below calculates the proportion of values available after a value is first available for an indicator. This follows a similar logic to using the percentage of missing values, except that it only penalizes a country’s score once there is evidence data has been collected for an indicator. It also provides a better indicator of how reliably data has been collected after the first observation:
Estimating Data Gaps at the Indicator Level:
#create a set of summary stats for indicators sum_dta_benchmarking_by_indic<-dta_wb |> filter(!is.na(value)) |> group_by(indicator_id,indicator,iso_group,iso3c,country) |> summarize(n_obs=n(), start_year=min(date), end_year=max(date)+1) |> mutate(max_n_obs = end_year-start_year, pct_obs= (n_obs)/max_n_obs) |> #aggregate to iso_group level group_by(indicator_id,indicator,iso_group) |> summarize(avg_pct_obs=mean(pct_obs,na.rm=TRUE)) |> #reshape to compare peers pivot_wider(names_from=iso_group, values_from=avg_pct_obs, values_fill= 0) |> mutate(data_gap=round(Focus-Peer,2)) # present indicators with a 'data gap' above 10 percentage points sum_dta_benchmarking_by_indic |> filter(data_gap <=-0.1) |> arrange(data_gap) |> ungroup() |> select(indicator,data_gap) |> head(5) |> kableExtra::kable()
Potential Data Gaps (% Values Since First Year at Indicator Level):
indicator | data_gap |
---|---|
Industrial design applications, resident, by count | -0.82 |
Agricultural irrigated land (% of total agricultural land) | -0.71 |
Market capitalization of listed domestic companies (current US$) | -0.54 |
Patent applications, residents | -0.53 |
Children in employment, study and work, female (% of female children in employment, ages 7-14) | -0.38 |
At the indicator level, around 1/5 of indicators have a ‘data gap’ above ten percentage points. If this seems like an arbitrary cutoff, it is. The main point is really to highlight the most ‘unusual’ data gaps that are likely to say something about the availability of data. For example, in our full assessment we defined a ‘data gap’ to be two standard deviations lower than the peer average.
It’s also important to keep in mind that there are a range of reasons a ‘data gap’ might emerge in this assessment. For instance, some indicators might be less relevant for Tanzania, so they aren’t collected . Or maybe their collection and reporting methodology doesn’t suit the requirements to be included in the database being surveyed (in this case the World Bank). The assessment also says nothing about the accuracy, timeliness, sensitivity or granularity of the data available to us.
Reflecting this, we chose to to report the data gaps at the topic level. Making it clearer that any ‘data gaps’ are meant to be indicative of where data may be difficult to come by when building the index. The code below demonstrates the basic approach using an indicator’s topic:
Estimating Data Gaps at the Topic Level:
#estimate the average data gap by topic sum_dta_benchmarking_by_topic <- sum_dta_benchmarking_by_indic |> left_join(ref_dta_topics) |> group_by(topic) |> summarize(avg_data_gap=mean(data_gap) |> round(2)) #merge with data gaps estimated via missing values sum_dta_benchmarking_by_topic<-sum_dta_benchmarking_by_topic |> left_join(sum_na_bechmarking |> select(topic,na_data_gap)) #view top ten 'data gaps' by topic sum_dta_benchmarking_by_topic |> arrange(avg_data_gap) |> head(5) |> kableExtra::kable()
Potential Data Gaps (% Values Since First Year at Topic Level):
topic | avg_data_gap | na_data_gap |
---|---|---|
Infrastructure | -0.25 | -0.10 |
Science & Technology | -0.24 | 0.01 |
Agriculture & Rural Development | -0.17 | 0.00 |
Financial Sector | -0.16 | -0.05 |
Social Development | -0.12 | -0.04 |
Based on the topic-level assessment by country and indicator (avg_data_gap), it appears that data gaps are likely to exist for infrastructure, science and technology, and agriculture and rural development.5 Suggesting that extra effort may be needed to source data in these areas or find suitable proxy variables to avoid this becoming a blind spot of the final index.
In our case, this is exactly how we interpreted the results: leading to us devoting significantly more time identifying additional data sources to help fill these gaps and requesting direct input from stakeholders on where to look. But, the results were also useful beyond this: with the identified gaps being a useful input into data advocacy discussions held with policymakers. Due to the analysis indicating to the focus country where effort should be focused to catch up with their peers.
Which is more or less why this blog exists: As identifying data gaps, and lobbying for them to be filled, has formed the backdrop of almost every analytics project that I’ve been involved in. So next time you find yourself staring at a project that would benefit from more data, consider taking our back-of-the-envelope methodology out for a spin (and letting us know how you improved it along the way!).
A note how AI was used: The majority of this post and code were produced by the author. AI tools were used to critique the methodology and suggest alternative approaches for communicating some ideas presented in the post.
Related