To exemplify our approach, we present a use case based on a real project, which will serve as running example, where data are used on the fight against Neglected Tropical Diseases (NTDs) at the World Health Organization (WHO). NTDs form a group of 21 diseases with different, sometimes very complex aspects, all having in common that they affect population typically from economically challenging, rural areas of the world. Depending on the disease (e.g., Chagas disase, Leishmaniasis), transmission routes can vary (e.g., congenital from a mother to a child, blood transfusion, organ transplantation, insect bites), as well as the causes of its spreading worldwide (e.g., high presence and reproduction of insects in endemic areas, traveling to endemic areas, migration flows, etc.). All this makes the control and eventual elimination and/or eradication of NTDs very challenging. This is a paradigmatic example requiring to cross internal and external data to meet the informational needs of an organization and for which our approach fits naturally.
Recently, WHO has initiated building a data infrastructure that enables the collection and comprehensive analysis of NTD-related data (i.e., WISCENTDFootnote 8). In particular, two subsystems have been created, (1) WHO Integrated Data Platform (WIDP),Footnote 9 which is powered by District Health Information System 2 (DHIS2),Footnote 10 used to provide support for routine surveillance of the case diagnosis, treatment and to standardize data collection of NTDs at countries as well as country’s reporting to WHO, and (2) WHO Integrated Medical Supply System (WIMEDS),Footnote 11 powered by BonitaSoft,Footnote 12 used to facilitate the distribution of medicine and diagnostic kits for NTDs to countries, as well as to enrich data collection with epidemiological information related to cases detected and transmission routes. However, given that NTDs have been typically overlooked by the national health information systems in the past, currently, relevant historical data are either non-existent or they must be searched elsewhere (e.g., non-governmental organizations, researcher repositories, United Nations, etc.).
To create a more comprehensive epidemiological picture about the status of NTDs at a country or globally (e.g., discover epidemiological silence, studying comorbidity of diseases), WHO needs an effective infrastructure to capture both the data being collected through their in-house systems (i.e., WIDP and WIMEDS) and data from existing external data sources. For instance, they require part of United Nation’s dataFootnote 13 referring to countries’ population and immigration, which UN reports yearly. Notice that this is important for monitoring the main indicators about the disease, which are typically computed over the territory population or total number of potentially affected people. Lastly, in order to enable NTD data analysis for a specific geographical area (i.e., country with its lower administrative levels) or for a specific health facility, WHO requires to include additional master data in the form of geographical hierarchy. Such master data are stored separately within the WIDP system, where they are updated after being reported by a member country (e.g., changes in the administrative level organization, new health facilities, requiring data collection at the lower administrative level).
For the sake of simplicity, we focus on these three most relevant data sources for analyzing NTDs (WIDP with diagnosis and treatment data, and geographical master data, WIMEDS with medicine request and shipment data, and United Nation’s population and migration data). Hereinafter, we will use this use case to exemplify our technical contributions. In the following, we discuss the challenges raised during the development of WISCENTD, and the need to address them via operationalization and automation.
Data management challenges in WISCENTD
The data management challenges that arise in the WISCENTD use case, and which are addressed in our approach, are conveniently depicted in Fig. 2. Each challenge is represented with a shaded ellipse of a different color, where the incoming arrows indicate the data source(s) they arise from, and the outgoing arrows indicate what target (ready-to-be-consumed) data would their proper handling result in. Details and examples for each challenge follow next.
Heterogeneity of data formats and models
WISCENTD data sources consist of both structured and semi-structured data. Precisely, both WIDP and WIMEDS data are available in JSON formats (see for example an excerpt of diagnosis and treatment data of WIDP in Listing 1), while UN data are provided as CSV/Excel files.
Semi-structured models like JSON provide high flexibility for structuring the data in different ways (i.e., replication and redundancy, nesting, cross-referencing, etc.), however they introduce certain complexity when it comes to data representation and its automatic processing. Take for instance the diagnosis and treatment data snippet, extracted from WIDP (see Listing 2) we can see that metadata used to create a data collection form (data_element) is mixed with data values. This is due to the data model provided by the underlying Web API of the DHIS2 platform, which is used to build WIDP. Unlike WIDP, in the case of WIMEDS, JSON format that stores the data coming from underlying Bonitasoft process management platform is free of other possible noise, allowing its automatic processing. See an excerpt of such data in Listing 2).
Handling the variability of different data models is a challenging task for data governance, while additional noise in form of application specific metadata could easily hinder the otherwise automatable data processing. Nevertheless, note that data sources typically expose the data in such a form as good habits promote and, in cases where they do not, their conversion to such form would require a simple source-specific pre-processing. For example, in Listing 3, we show a JSON snippet that resulted from converting the data in Listing 2, containing the same relevant information, but freed from additional unnecessary content (i.e., data_element and value keys).
Moreover, having the additional metadata in form of a schema (e.g., JSONSchema) would allow automatic parsing of the a specific source format, thus making its data easily available for the rest of the data pipeline.
Data accessibility for external data sources
While the internal data sources can be ingested by dumping them periodically to the data lake, accessing external data sources can imply more complex processes, typically retrieving data via a publicly available API. Both WIDP and WIMEDS are accessed via Web APIs. The former is accessed by supplying at least programID that refers to the Web form from which the data should be obtained and lastUpdatedDate to specify the data period to which the ingested data refer. However, some APIs require more complex querying. For instance, the calls to the WIMEDS API require first to specify the business data type of interest (e.g., requests, shipments, medicines), and then, depending on the complexity of the information requested, a default or pre-defined custom business data query needs to be specified inside the API call (e.g., requests by disease, shipments by country, medicines by IDs).
UN data source (external to the WISCENTD system) is accessible via a publicly available API,Footnote 14 by providing the artifact specifying the data category (in our case population and migration data) and artifactID specifying the variables to be retrieved (e.g., population totals, population by age range, population by gender, international migrant stock in absolute number or as percentage of population). Additional parameters can be provided to filter the results by specific countries or specific years.
From these examples we can see that accessing external data sources may sometimes require complex querying schemes that vary depending on the complexity of the exposed API. While the processing of specific data models could be generalized (e.g., reading of keys and values is common to mostly any JSON document), access patterns to obtain the data from the source would require source-specific programs, that can be parametrized to obtain the required portion of the input data (e.g., from specific endpoints or for specific time periods).
Co-existing schema versions and schema evolution
Moreover, being based on web forms, the schema of WIDP data is easily extensible to include new fields to be collected at countries. For instance, during the pandemic of COVID-19, another important parameter for treating Chagas disease patients could be whether they were previously infected by COVID-19, so that the potentially toxic and strong medicines for treating Chagas disease are administrated with care. In such case, a new variable (CH_PR_16_COVID19) is added to include such information to the WIDP data (see snippet in red in Listing 4).
Although data formats like JSON allow to easily extend the schema to dynamically support modeling of the changing reality, handling the data with different schema versions is a challenging data governance task. To support the evolution of source schemata and allow the co-existence of different schema versions in the system, it is essential to maintain the metadata of different schemata and more importantly to guarantee the correct relationship between the data and its corresponding schema version.
Data sources with complementary information
Given that WIMEDS collects data about requesting medicines for treating Chagas patients (e.g., Lampit), it serves as a complementary data source to WIDP for treatment data. Such sources are important in the case of NTDs in order to create more complete picture about the current status of a disease in a territory and detect potential (intentional or unintentional) misreporting by the country.
This however introduces additional challenge in managing NTD data, given that it requires integration of such data and properly handling the possible duplicates. Importantly, proper data governance would require sufficient metadata to facilitate the integration of such data into a common dataset (e.g., conceptual model of the target domain and mappings of different source schemata to that model) and further enable other (instance-level) data integration steps like entity resolution.
Deltas during data ingestion
The lastUpdatedDate parameter in the WIDP Web API allows incremental ingestion of diagnosis and treatment data to the data lake, i.e., to ingest only the data reported after the previous run of the ingestion process. In WIMEDS, there is no system level parameter that enables incremental ingestion, however there are custom defined parameters that allow incremental ingestion per business data type (i.e., get requests by the date of request, get shipments by the date of shipment). Using the UN data API, we can at any time obtain a subset of data for specific time periods (i.e., years). However, geographical master data from WIDP, given their hierarchical structure, do not allow incremental ingestion (e.g., modification of intermediate administrative levels can obscure previously existing hierarchies and the update would require complex reconstructions). Thus, the ingestion of such data into a data lake is always complete.
Handling both incremental and complete data ingestion must be effectively supported in data governance systems in order to easily keep the needed freshness of the input data, while preventing unnecessary duplicates or inconsistencies in the data.
On the need to operationalize and automate data governance
Following from the specific challenges detected in the WISCENTD use case, (but common to most Big Data projects), in this paper we propose a general framework to operationalize the data governance through a novel architectural proposal of a zoned Data Lake (inspired by [6]), as well as to automate complex data management tasks to drive the data through the complete data lifecycle, by means of concisely defined metadata artifacts.
Overall, the proposed framework allows an easy incorporation of internal and external data sources into data analysis pipelines, while effectively handling both structured and semi-structured data formats. New data sources are expected to be accompanied by minimum necessary metadata to automatically access, query, and parse their data. Such metadata can be either provided by the data source itself or automatically extracted using available bootstrapping techniques (e.g., [32]). The framework supports both incremental and historical (complete) data ingestion, accommodating thus a wide variety of source access schemes. Moreover, the framework provides effective mechanisms to deal with evolving data sources, enabling the propagation of source schema changes to the rest of data lifecycle. Lastly, by maintaining the correspondences between the data sources’ variables and a common (domain-specific) schema, the framework facilitates the integration among complementary or akin data, providing more complete and consistent view of the domain. In the following sections, we describe the proposed framework that operationalizes the data governance, as well as the automation of such framework under clearly defined assumptions, each denote as Ax.