DaLiF: a data lifecycle framework for data-driven governments

The public sector, private firms, business community, and civil society are generating data that is high in volume, veracity, velocity and comes from a diversity of sources. This kind of data is known as big data. Public Administrations (PAs) pursue big data as “new oil” and implement data-centric policies to transform data into knowledge, to promote good governance, transparency, innovative digital services, and citizens’ engagement in public policy. From the above, the Government Big Data Ecosystem (GBDE) emerges. Managing big data throughout its lifecycle becomes a challenging task for governmental organizations. Despite the vast interest in this ecosystem, appropriate big data management is still a challenge. This study intends to fill the above-mentioned gap by proposing a data lifecycle framework for data-driven governments. Through a Systematic Literature Review, we identified and analysed 76 data lifecycles models to propose a data lifecycle framework for data-driven governments (DaliF). In this way, we contribute to the ongoing discussion around big data management, which attracts researchers’ and practitioners’ interest.

In the public sector, the implementation of big data tools, technologies and tactics offers opportunities including effective public service delivery, evidence-based decision making for policymakers, growth in the digital economy, creation of new professional jobs, encouragement of civic participation in the definition and enhancement of public policies [6][7][8][9].
A Big Data Ecosystem (BDE) is a complex set of various interconnected components related to big data, models, organisational structures, and roles covering the whole data lifecycle [10]. It consists of diverse components, including data infrastructure, data models, data analytics, as well as organisational and cultural components [10][11][12].
Leveraging data to provide the most public value rests on PA's capability to cultivate and sustain a useful and effective data management environment. While achieving this requires various functions, roles, responsibilities, abilities, and skills for both technology and people, it is critical to pay special care to the data life cycle [13]. The data lifecycle represents all phases during its life, from its planning, collection to its distribution, use, and reuse, and destruction [14]. The data life cycle provides a high-level overview of the phases involved in the successful management of big data for its use and reuses [6,15]. Data lifecycle is helpful to identify dataflows and work processes for stakeholders in the GBDE. Moreover, identifying the data lifecycle that fits a company's data usage is a critical task for the organizations, and the way the data lifecycle is managed is also important to transform data into knowledge [16] and to extract value from big data to improve organization data operations [17].
The remainder of the paper is structured as follows. In Section 2, we mention the background of our research work. Section 3 explains our research methodology. Section 4 presents the results of the literature review and research implications. In the last section, we illustrate the conclusion, the limitations of this study, and propose future work.

Background and scope of this work
This section illustrates the background of this research, BDE fundamental theories, identified research gaps that have attracted our attention, and our contribution.

Overview of the Government Big Data Ecosystem and data lifecycle fields
The word "data" originates from the Latin phrase 'datum' [5]. Data is a discrete, limitless entity that has an unstructured and unprocessed shape. Organizations further process such kinds of data, as per their needs, to illustrate relevant objects, events, concepts, or facts [5].
Big data is a concept that is characterized by data that have high volume, veracity, velocity, and variety. Big data needs economical, advanced ways of information processing to be used for generating insight and supporting decision making [21][22][23]. In datadriven organizations, big data is regarded as a critical strategic asset [24]. The capability to manage big data generates opportunities for organizations to attain viable benefits in the present digitized marketplace [25,26]. This capability requires cost-effective and distinctive novel big data tools to be put in place [22,26,27]. The exploitation of big data signifies a paradigm shift in tactics to comprehend and study the world [6]. Private and and related aspects. Such theories are often influenced by socio-technical, ecosystems, platform, actor-network, business process management, and value chain theories. These mixed theories are usually used in the literature to cover the theoretical gap as the big data field is in the early stages [11,76]. Numerous business, research, and industry communities study the big data field [76][77][78]. For example, in the case of definitions of BDE, some definitions stay relevant to specific domains like humanitarian [79][80][81], and personal data ecosystems [67]. Such studies have a narrow perspective, focus on a specific notion with partial details, and describe BDE definitions and other related terminologies that vary considerably [81][82][83]. We observe similar issues in the case of data lifecycles studies as well. This is due to the fact that the BDE area is in its infancy, and different research and business communities have been investigating the area separately [76]. Therefore, currently, the existing BDE theory does not offer a full conceptual foundation for further studies into the research field. To extend insights into the current state of the BDE, data lifecycle, and other related aspects, we conduct this holistic SLR as a theoretical groundwork about BDE.

Research gaps
We identify the following research gaps while studying the data lifecycle for GBDE.
We also observed that the proposed phases of the data lifecycles also vary [56] while often for the same phase, different terms are used. For example, in the case of data collection, some studies use the term "generation" [95,96] while in other data lifecycles call it "receive" [17,90], "acquir" [55,63,73,97], or data "capture" [84]. So, there is a lot of confusion for the research community and practitioners to understand and use such phases and associate them with management practices in their organizations. So far, we could not find an all-inclusive data lifecycle model for GBDE.
In the literature, we did not find many studies that attempt to align their data lifecycles with industry-standard for data management like DM.BOK.

Our contribution
In this research article, we mainly concentrate on addressing the above-mentioned literature research gaps by suggesting an all-inclusive big data lifecycle for data-driven governments. We call this DaLiF. We found and analysed 76 data lifecycle models published during the last 25 years. We provide our research approach, a detailed description of the literature for data lifecycles, and DaLiF in the forthcoming sections of the study. The explanation about what we add on the top of our proposed data lifecycle for GBDE is as below: (i) It can be applied in various public sector areas like health, agriculture, education while considering these areas related to public organizations' business processes, requirements, and environment. (ii) This data lifecycle type is evolving; i.e., there is no requirement for government big data to go through the whole lifecycle before a fresh iteration can be commenced. Moreover, DaLiF phases can be passed in several different orders and, in principle, for an infinite number of times. (iii) DaLiF contains phases that can contribute to creating Open Government (OGD's) values, namely politic, economic, and social values as proposed in [98] as future research work. (iv) It is the first big data lifecycle consists of fourteen phases, which were reported as missing phases in [17,61,98], and this data lifecycle corresponds to the public administration vision about data management in GBDE. (v) Specific functions at each phase prove proposed lifecycle completeness with respect to the big data 4Vs challenges, i.e., Volume, Variety, Velocity, and Veracity. (vi) We consider the data protection phase that exists at each phase of the data lifecycle to tackle major concerns of big data like privacy, integrity, availability, confidentiality, and data security protection issues in GBDE. (vii) The mining of valuable data from a large influx of information is a critical issue in big data.Therefore, we include data enrichment phase in DaLiF to enrich data to qualify and validate the related aspects in GBDE. (viii) DaLiF contains data quality' phase along with key functions, like conformance to data quality business rules, in order to ensure high data quality. (ix) We consider data quality, protection, storage, archive at each phase of the data lifecycle for appropriate data management in GBDE. (x) We categorically focus on the data governance phase, which is the overall process of managing and controlling government big data to maximize data usage and contribute to value creation in GBDE. (xi) Finally, we mapped the DaLiF with DM-BOK, which is an industry-standard for data management.

Research method
In the earlier section, we found the research gaps while reviewing the GBDE, including data lifecycles. This study aims to mitigate the above-mentioned research gaps by identifying relevant exiting data lifecycle models, performing analysis of these literature models, and proposing the DaLiF based on the existing data lifecycles models.
To accomplish the above-mentioned research goal, we performed qualitative research about the government big data ecosystem and data lifecycles by conducting a Systematic Literature Review (SLR). SLR is a research approach to find, evaluate, and explain research work, literature created by scholars, researchers, and practitioners [99]. Fink described the literature review as a systematic, specific, and reproducible method to find, evaluate, and synthesize the existing research work delivered by researchers, scholars, and practitioners [99]. We practiced this research methodology following guidelines from the literature [18][19][20]. We describe our approach in five steps. The SLR process or research review protocol's first step is focused on devising the research questions. The second step mainly concentrates on three sub-activities: selecting digital research libraries, creating search strings, and literature search. The third step is primarily focused on identifying the relevant research articles and applying quality assessment to the articles. It is supported by specifying and applying inclusion and exclusion criteria. The fourth step intends to examine the research articles, mine relevant information, perform verification of results, and connect the findings to research gaps. The last step describes the research outcomes and organizes them in our proposed data lifecycle for the GBDE.
The Fig. 1 summarises the steps. We briefly present them in the subsequent sections.

Goal and research questions
Our research aims to identify and analyse the existing data lifecycle models, find common and complementary phases with their distinct labels of data lifecycles, and suggest an extensive data lifecycle framework for data-driven governments. For this, we outline the Research Questions.
To produce the research questions, we conducted a preliminary review of the related literature about GBDE. We identified the above-mentioned research gaps related to the data lifecycles and used these gaps to focus on our research questions. We implemented a combination of the following two Basic gap-spotting modes: confusion spotting and application spotting, as an approach to construct our research questions [100,101]. We partially implemented the PEO framework [101][102][103] to give appropriate structure to our research questions to ensure clarity. Additionally, we evaluated the robustness of our qualitative research question by using FINER criteria [103][104][105] to ensure that our research questions are feasible, interesting, novel, ethical, and relevant.
As a result of the research review protocol's step 1, "formulating the research questions", we devised the following research questions.
(RQ1) What are the existing data lifecycle models described and the phases they introduce in the literature? (RQ2) What is an all-inclusive big Data Lifecycle Framework for data-driven governments and its phases? RQ1 aims to find and organise the literature for the data ecosystem lifecycle. RQ2 intends to propose a new big data lifecycle framework for data-driven governments and Fig. 1 Steps in the SLR process its phases, we call this DaLiF, which considers and builds on top of the current state-ofthe-art data life cycle models.
We incorporate a comprehensive analysis of literature data lifecycle models and explain DaLiF and its phases. We showcase a graphical view of DaLiF and deliberate our research findings.

Finding relevant research work
SLR uses a comprehensive method to review the findings presented in prior published research. Literature research aims to carry out a systematic review that demands extensive coverage of current research on the research topic of interest within the stipulated period. Most researchers endeavored to work out a systematic review opinioned in a research survey that they did not stop off their literature search action until they believed they had attained their target [19]. This section explains the above-mentioned SLR process second step that centers on offering the following details about our selected digital research libraries, the procedure for the discovery of search strings, and the searching process.
Selection of digital research libraries To carry on with the SLR, the selection of digital research libraries is the phase when researchers make a decision about where to search and how to search for required studies that contain relevant information for the research questions of the study [106].
As a result of this sub-step, We selected and examined the following four digital research libraries for the literature search, ACM, Science Direct, IEEE Xplore, and Springer Link.
Formulation of search string We devised search strings to discover research articles from the above-mentioned selected digital research libraries based on the following actions: 1. We formulated search strings related to the above-mentioned RQs. 2. In the case of the critical aspect, like data actor, we find alternate words and synonyms for these keywords. 3. We use Boolean operators like 'OR' , ' AND' to extend the search by adding other words and synonyms. 4. In some cases, we also adjust/alter the search string. 5. Each search string includes one or more than one keywords like "data lifecycle" or "data model".
The above-mentioned measures are applied to formulate search strings for our research questions. As a result of this sub-step, we formulated and used different search strings which are mentioned in Additional file 1 to meet the journal's page limits. Literature search We began the literature search to obtain relevant research papers in February 2019 and carried out this procedure until May 2021. All results from searches are primarily based on titles, keywords, and abstracts. We applied the above-mentioned measures to formulate search strings for our literature search process. To get relevant literature about DaLiF, we have completed a literature search activity in the following two stages.
Stage-I: We used the aforementioned four digital libraries to search strings and their variants, which are based on the following keywords:-"DATA LIFECYCLE", "DATA LIFE CYCLE", "DATA ECOSYSTEM", "BIG DATA LIFECYCLE", "BIG DATA LIFE CYCLE", "GOVERNMENT DATA ECOSYSTEM", "DATA LIFECYCLE FOR GOVERN-MENT", "DATA-DRIVEN GOVERNMENT, "DATA LIFECYCLE FOR DATA-DRIVEN GOVERNMENT", "DATA LIFECYCLE FOR PUBLIC ADMINISTRATION" along with choices "exact phrase" and "matches all". We examined the outcomes of the above-mentioned first stage and matched the results with our crucial research sub-topics regarding big data lifecycle of GBDE. We observed that we require additional relevant research papers, and then we decided to perform the following stage-II.
Stage-II: In this stage, we expanded the search queries performed in stage-I by adding "matches any" instead of options "exact phrase" and "matches all".
Result of SLR process step 2 As a result of the research review protocol's step 2, "finding relevant research work", in total, we collected 1217 research articles. We kept the literature search outcomes in a spreadsheet where every row correlate to a research paper. We recorded various attributes and metadata per paper like paper ID, authors, title, source, keywords, authors, abstract, year of publication, unique viewer tag, searching date, associated search term, and study goal.

Identifying preliminary studies and article quality assessment
The procedure to find preliminary studies, utilizing inclusion and exclusion criteria, and implement a quality assessment is based on the following measures. We rigorously pursued our inclusion and exclusion criteria, given below, to evaluate the relevance of the studies with our research objectives. We manually completed all the next steps to identify preliminary studies. Consequently, we achieved a detailed scrutiny process in the following three phases. We describe our inclusion and exclusion criteria, and then we explain the three phases of our process.
Inclusion criteria To select only the relevant research articles, we included resources that fulfill one or more of the following criteria: Scrutiny process In this section, we describe the following key phases to scrutinize the research articles along with the quality assessment of the studies.
Remove duplication We merged and retained the research articles that were found in the preceding literature search phase in a single common folder and removed the research papers that are duplicates. From the above-mentioned literature search process, we gathered a total of 1217 research articles. In this stage, we removed duplications from our study dataset. It decreased the papers to a total of 1198 research articles.
Initial scrutiny based on abstract and title Initially, we investigated the research articles based on abstract and title. If a research paper was not judged for inclusion or exclusion based on these traits, it was added for the next step of examination. Next, titles and abstracts were independently assessed by two researchers. Each researcher noted the research articles that have some uncertainty to decide about the research article's inclusion or exclusion for the next scrutiny phase. Both researchers mostly found similar outcomes, and there was negligible disagreement about the inclusion and exclusion of papers between them. However, both researchers held meetings to settle differences and to discuss disputed and marginal research articles. As a result of the above-mentioned initial scrutiny, we reduced the papers to a total of 578 out of 1198 research articles.
Scrutiny based on the full text In this phase, the research team examined the full text of articles that are already agreed upon in the above-mentioned initial scrutiny phase. The same researchers comprehensively studied and analyzed the full text of the studies. While the third researcher validated and verified the outcomes. The researchers assessed the quality of the research articles based on inclusion and exclusion criteria. We discovered satisfactory quality assessment outcomes in the scrutiny process based on the above-mentioned method that includes, but is not limited to, strict implementation of inclusion and exclusion criteria, internal meetings to resolve minor variances between the researchers, and validation of the results. Moreover, we concentrated on the different vital factors like selection and assessment bias, related to threats to validity as well. The execution of this phase decreased the papers to a total of 232 out of 578 studies. Thus, our preliminary studies include 232 articles.
Result of SLR process step 3 As a result of the research review protocol's step 3, "identifying preliminary studies and applying inclusion and exclusion criteria", our preliminary studies include 232 articles.
We present our literature search strategy results in Fig. 2. We add up research articles of our preliminary studies in a reference manager tool to access, use, and manage references in our research work.

Analysis
In this section, we describe the fourth step of our methodology, i.e., the required data extraction from the research articles and presentation and organization of findings that propose DaLiF to "fill-in" the above-mentioned research gaps.
We comprehensively examined the research articles from our preliminary studies. We mined relevant information that includes existing data lifecycles, phases of data lifecycles, objectives, domains, and nature of these data lifecycles. Subsequently, we gathered and classified the mined information and relevant research articles to answer the research questions RQ1-2. We utilized a spreadsheet program as a data extraction template to capture and record the studies' information.
We applied the following steps to extract the outcomes. First, we acquired the general information, like authors, publication year, title, and publication type. Second, studies were examined according to the above-mentioned inclusion and exclusion criteria. In the third step, we placed mined data in the datasheet based on the critical aspects of our above-mentioned research questions.
To perform a detailed analysis, two researchers independently examined and analyzed the full text of the research articles. Both researchers compared their results and found minor disagreements. Later, both researchers organized meetings to review and settle their disagreements about text extraction. While the third researcher performed data extraction on a random sample, and then he verified and agreed with the results. The analysis work reporting was based on synthesized outcomes. We applied a descriptive synthesis method to explain the results in a manner consistent with our research questions.
Result of SLR process step 4 As a result of the research review protocol step 4, "Analysis including verification", we described our analysed information in the forthcoming sections to answers the above-mentioned research questions.
We depict a remarkable descriptive statistic from our SLR process. Such figures confirms the growing hype around the area of this study. Big data lifecycle is one of the vital study areas among stakeholders. We noticed that digital research libraries, particularly Springer and IEEE, stimulate data lifecycle research works. Our primary studies period distribution of research articles is given in Fig. 3.

Results
The last step of the above-mentioned SLR process describes the research outcomes and presents a foundation for the study. Therefore, we thoroughly reviewed 232 research articles to extract relevant information. In this section, we map our investigation to our research questions. As RQ1 aims to find and organise the literature for our proposed data lifecycle for GBDE, therefore, we present existing related models in this segment. Whereas RQ2 intends to offer a new data lifecycle framework for data-driven governments and its phases, we rigorously explain the approach, considering and building on top of the current state-of-the-art data life cycle models, i.e., an outcome from addressing RQ1, in this section as well.

RQ1: Existing data lifecycle models and their phases
In the literature, we analysed data lifecycles models mentioned in the research articles of our preliminary studies. We selected the 76 data lifecycle models by using the following logical rules: concerned researchers carry out formal research work, offer a sound basis to propose a model, somehow map with industry standards for data management like DM-BOK, relates to areas of government, like public scientific research, open government, and consists of distinctive phases to propose DaLiF. An overview of the data lifecycles and their phases is as below.
Overview of literature data lifecycles We present a few data lifecycle models and their related phases to provide a more detailed understanding. Additionally, we showcase a number of graphical figures indicating findings of the data lifecycles: A lifecycle for an organization's security data In 2020, A. Shameli-Sendi, presented a lifecycle for an organization's data protection [47]. This data lifecycle consists of the following phases, data creation, edit, display, process, transfer, and store. IBM data lifecycle model IBM defined a lifecycle model in 2013. This lifecycle model phases include data creation, data use, analysis, data sharing, data update, data protection, archive, store/retain, and dispose [51]. IBM introduced the following additional layers in its data lifecycle model: data masking and archiving [51].
Data lifecycle for OGD In 2019, H. ELMEKKI et al. proposed a data lifecycle that concentrates on Open Government Data (OGD) [98]. This data lifecycle consists of the following phases, data collection, data publication, data transformation, data quality, use, data interoperability, share, user feedback, and data archive.
Research data lifecycle In 2020, R. Raszewski et al. mentioned a data life that is based on UK Data Service Research Data Lifecycle. They aim to find out the extent of data management education in nursing doctoral programs [107]. This data lifecycle consists of the following phases: plan research, data collection, process and analysis of data, publish data, share data, store data, and use and reuse data. Abstract data lifecycle model (ADLM) Knud Möller defined Abstract Data Lifecycle Model (ADLM) in 2012. This lifecycle is specifically intended for the semantic web [108]. This data lifecycle consists of the following phases, planning, creation, enrichment, access, store, archive, feedback, and termination phases.
USA (NIST)-big data lifecycle model This data lifecycle model is introduced by the National Institute of Standards and Technology-NIST, Deptt: of Commerce, the USA in 2015. The lifecycle consists of the following phases, collection, preparation, analysis, and action (analytics, visualization, access) [91]. This model concentrates more on the data analytics part compared to data planning, data filtering, and data enrichment.
Research data lifecycle In 2019, M. Jetten et al. described a research data lifecycle [109]. This data lifecycle consists of the following phases, plan, create, process & analysis, use, preserve, and access research data.
OGD lifecycle In 2015, J. Attard et al. mentioned an Open Government Data lifecycle [110]. This data lifecycle consists of the following phases, data creation, data selection, data analysis, data curation, data publishing, data discovery, data exploration, data storage, and data exploitation.
Data Lifecycle for industrial and healthcare applications: In 2020, Kumar Rahul et al. mentioned a data lifecycle for industrial and healthcare applications and focused on managing big data analytics [111]. This data lifecycle consists of the following phases, create, store, analysis, use, share, archive, and destroy.
Web content management lifecycle In 2003, S. McKeever, Dublin Institute of Technology, Ireland, defined Web Content Management (WCM) lifecycle [112]. This lifecycle consists of the following phases, creation, deployment, share, data control, and administration of the contents, storage, archive, use, and workflow.
Research360 Data Lifecycle: Research 360 data lifecycle is defined by A. Ball, University of Bath, UK. This lifecycle focuses on scientific research data, and it is an outcome of the Research360 project [84]. This data lifecycle consists of the following phases, Design, collect & capture, interpret and analyses, manage and preserve, release, and publish, discover, and reuse.
Smart data lifecycle EL Arass et al. define a data lifecycle, identified as smart data lifecycle, in 2018. This lifecycle focuses on how to transform raw and worthless data into Smart Data [60]. This data lifecycle consists of the following phases: Planning, Collection, Integration, Filtering, Enrichment, Analysis, Visualization, Access, Storage, Destruction, Archiving, Quality, and Security.
Hindawi lifecycle Nawsher Khan et al. introduced Hindawi data lifecycle in 2014. This data lifecycle mainly concentrates to applies the tool, technologies, and terminologies of big data [61]. This lifecycle consists of the following phases, collection, filtering & classification, data analysis, storing, sharing & publishing, and data retrieval & discovery.
Data lifecycle for multistage privacy protection in IoT environment In 2019, Soltani Panah et al. introduced a data lifecycle specifically centered on IoT environment [97]. The data lifecycle consists of the following phases, data acquisition, data processing, data storage, use, and data dissemination.
Data lifecycle for information Science: Gislaine P. F. et al. introduced this data lifecycle in 2021. It consists of the following phases: data collection, storage, visualization, and data dispose [113]. This lifecycle focuses on influence mapping of its phases with GDPR principles and how Blockchain technology manages big data in each phase of the lifecycle.
In Fig. 4, we present some statistics related to our findings. The first figure indicates the distribution of 76 analyzed data lifecycle models concerning the authors' affiliations: industry, academia, and joint authorship. Although most of the data lifecycle model articles were written by researchers, still approximately one quarter of model studies were written by the industry or jointly by academia and industry.
The second figure identifies the source area from where the data lifecycle models research stems. Almost half come from business areas (e.g., health, manufacturing, etc.). One quarter is technology driven (IoTs, smart cities, cloud computing, etc.), a bit less than a quarter from scientific research. Five models depart from the areas of open government and the semantic web.
We also investigated the geographic distribution of the data lifecycle models research. Using the ITU regional classifications [151], the third figure indicates that European researchers contribute more to this body of work (37%), followed by North America and Asia & Pacific regions (21% each).

RQ2: create a big data lifecycle framework for data-driven governments
In this part, we introduce our proposed data lifecycle framework for GBDE. We thoroughly investigated data lifecycles to learn from existing work. Specifically, we wanted to identify commonalities and all the different characteristics introduced by the models to take them into consideration. We explain how we discover and define the phases of DaLiF.
We adopted a comprehensive approach to propose DaLiF. Our approach consists of five steps. In step 1, we explain the exclusion of data lifecycles based on logical rules. In the second step, we discover and enlist phases from the identified 76 data lifecycles. We already described the selection criteria of the 76 data lifecycles in the preceding section. After a detailed analysis, we ended up with fourteen distinct phases based on certain logical rules. In step 3, we group the fourteen phases into mandatory and optional. In the fourth step, we apply our analysis to validate the categorization of phases, and we conclude with six mandatory phases. In the last step, we also map phases with DM-BOK functions. We thoroughly describe the above-mentioned steps of our approach to propose the DaLiF as below:

Approach to DaLiF
Our approach to defining the DaLiF consists of the following steps.
Step1: In the first step, we perform a thorough analysis of data lifecycles mentioned in the research articles of our preliminary studies. We excluded 35 data lifecycles based on the following logical rules: • Data lifecycle does not propose any distinctive phases. • There is only a description of the data lifecycle and its phases without providing detailed work.
Step2: Enlisting phases for our proposed data lifecycle: In this step, we identify and enlist more than 500 phases, including duplicate phase titles, coming from the analysed 76 data lifecycles. We examined these phases using the following logical rules: • Grouping phases by title with similar meaning, i.e., synonym detection, or relevant terms used by the various models to propose a phase title that covers the same concept, e.g., for delete, destroy, dispose of, destruction, end of life, we include the phase "end of life". • Consideration of a single phase that is described in the literature with identical titles, e.g., for visualization [62], visualization [60], visualization [17], we included the phase "visualization". • Removal of phases that are either too generic, confusing, or just introduced based on a specific research topic by some researchers (e.g., data value, ontology, transformation). • Removal of phases that were incorrectly labeled, e.g., design, big data. • Combing phases with similar aims and activities to present a holistic phase with a common objective, e.g., for sharing, publishing, we combine them into a single phase, "sharing/publishing".
The detailed analysis work based on the above led us to fourteen distinct groups/concepts, i.e., phases that we present in Table 1 together with the "source" references from where we find the various phases. The pictorial representation of the phases matrix with the terms that have been grouped to a more general term appears in Fig. 5. We detail these phases along with their key functions in the next section.
Step 3: Proposing mandatory and optional phases for the data lifecycle: After concluding the above-mentioned fourteen distinct phases in step 2, we further investigate these phases to categorize them into mandatory and optional phases for DaLiF. We adopted the concept of mandatory vs. optional phases of the data lifecycle from the literature. It is a fact that data related to different realms do not all follow the same phases of a lifecycle. Therefore, some phases can be considered mandatory, while other phases are opted as optional [17,142]. To classify phases as mandatory for the proposed data lifecycle based on the following criterion.
• A phase that appears in most data lifecycles which are found in the preliminary studies.
We classify phases as optional in case a phase does not comply with the above-mentioned criterion. After a thorough analysis, the categorization of the phases based on the above-mentioned criteria is described in Table 2. We assign a "mandatory" category to phase that has > 70% of appearance in the data lifecycles, and in other cases, we assigned it "optional" category as shown in Table 2.
Step 4: Validation of the categorization of phases of the data lifecycle: In this step, we apply our own analysis to validate the categorization of phases. In this analysis, we adopted the following criteria.
• A phase that remains relevant in government-related areas data lifecycles To implement the above-mentioned criteria, first, we identify the main groups of government-related areas data lifecycles. To achieve the said goal, we analyzed 76 data lifecycles. As an outcome of this analysis work, we found the five main groups of government-related areas data lifecycles, as shown in Table 3.
We implement the above-mentioned criteria to validate the categorization of phases of the DaLiF. The phase that remains relevant in government-related these five areas data lifecycle is considered as mandatory, and in other cases, the phase is assigned an "optional" category. The outcome of the above-mentioned validation approach is presented in Table 4.
Step5: Mapping of phases with DM-BOK framework: In this step, we attempt to map our proposed phases with above-mentioned DM-BOK functions [70] as described in Table 5.

Proposed data Lifecycle Framework for data-driven governments
The DaLiF is shown in Fig. 6. Figure 6 presents the mandatory phases in green, whereas phases in blue are optional. Moreover, data lifecycle phases, in gray, are horizontal phases performed throughout the lifecycle. Moreover, the storage phase is mandatory and horizontal as well. Hence, DaLiF consists of the following fourteen phases, planning, collection, preparation, analysis, visualization, storage, access, share/publish, use, re(use) & feedback, archive, and end of life. In the forthcoming paragraphs, we comprehensively described these phases of DaLiF.

Description of DaLiF phases and their critical functions
In this part, we present the description of the fourteen phases along with the functions of DaLiF. A function indicates an essential activity related to data to be performed by an entity in a phase. We also incorporate the principles of the DM-BOK concept in the respective phases to align our work with the said data management standard.
Planning phase The planning phase illustrates activities to be performed in the medium and long terms during the data lifecycle [6,53]. This phase consists of formulating a project (e.g., research or business project) to achieve PAs desired goals. Through this phase, it is possible to know the overall objective of the data management, what policies, procedures will be required to treat the data like procedures to collect or generate government data, what data types, sources, and methods are needed to analyze the government data. Additionally, how and where government data is to be stored when such data will be archived or destructed, how it safely will be accessed by the authorized users [46,48,49,55,149,158]. The planning phase can help the PAs, including public sector scientific researchers, to save time, boost effective governance, and meet the desired needs about planning for data management in GBDE [6,53]. The output of this phase is a holistic data management plan [48].
Planning phase key functions The planning phase of DaLiF includes the following key functions: • Plan for all required resources, including finance and personnel, metadata contents and formats, data storage, data security, and expected outcomes for each phase of the big data lifecycle [15,53,55]. • Identification of requiring individuals, descriptions of skills that each of the individuals necessitate to acquire, define roles, and assign roles and responsibilities to the individuals and other public sector stakeholders [53]. • Define a data management plan that is a live document in nature and covers numerous public data aspects like data lifetime, approaches for data quality, data security, and data archive [6,15,53]. • Provide a detailed description of data that will be compiled, by whom, and how the data will be managed, made accessible, shared, and reused throughout the lifecycle [6,15,53,62]. • Develop an appropriate plan to select modernized and extensible tools for the data phases, including public data collection [15,58]. • Plan to prioritize public data that have more possibility for use and be published on the web [93]. • Expert people in the handling of data, records, and content should be fully engaged in planning [70,71]. • Plan activities that apply quality management techniques to measure, assess, improve, and ensure the fitness of data for use [70].
Collection phase Collection phase consists of set of activities through which data is gathered from different internal and external sources and in different formats i.e. structured, unstructured and semi-structured forms [17,48,53,55,159,160]. Big data would have been worthless if it cannot be collected into consistent information [48]. The big data is created or collected from various data resources like social networks, the Internet of things, surveys, census, voting, historical maps, seismology motion sensor outputs, biological records, satellite observations, and commerce statistics [55,61,84,161]. The collection phase defines the moment when the new data or metadata is created in the system [49,57,159]. In the Extract, Transform, Load (ETL) common procedure, "Extract" is close to "collect" and this procedure performs a vital role in data collection [162]. In the public sector, during the data collection phase, the PAs should consider the once-only principle to collect data from citizens and businesses and reuse data instead of recollecting [58].
Collection phase key functions The data collection phase of DaLiF includes the following essential functions: • Collection of raw data from any sources in order to handle big data 'Variety' challenge, and to ensure endpoint input validation to avoid security of data issues [23, 42-44, 57, 58, 91, 132]. • Implement a strategy plan to select modernized extensible tools for public data collection platforms [58]. • Introduce a data protection awareness program while collecting data [23,70,132].
• Collection of metadata about information, based on metadata standards, to ensure interoperability across an organization and another future course of action [53,57,70]. • Adherence to the once-only principle, in public sector organizations, to collect data once from citizens, and business community [59]. • Manage the ranges of valid and trusted data sources for data collection [42][43][44]54]. • Manage the massive amount of data in any format to handle big data volume challenges and search and discover new sources for data collection [42][43][44]54]. • Consider specific resources to manage the big data 'Velocity' challenge that refers to the speed rate of data stream generation and the consequent ability to process it well [42][43][44].
Preparation The preparation phase refers to data integration, filtering, and enrichment [91,127,135]. The integration consolidates all these data silos into a single place with a coherent and homogeneous structure [26,46,155]. Filtering is focused on the purification of data by filtering the noisy and erroneous data [17,61]. Data Enrichment refers to the process that appends or otherwise, enhance collected raw data with relevant context obtained from additional sources [10,63,143]. The integration enables users to do their queries easily and obtain responses from a single data source [28,155]. The integration can be considered as a database subarea that provides uniform access to various data sources [163]. Data Lakes (DLs), conceptualize as big data repositories, store raw data, and give functionality for on-demand data integration with the help of metadata descriptions as and when required [162]. However, in DLs, data integration does not take place after the data collection. In the Extract, Transform, Load (ETL) common procedure, "Transform" is close to "integrate" and this procedure performs a critical role in data integration. The integration is based on a set of rules and policies [17]. The data integration is somehow an interim step to achieve a single source of data [17,28,164]. In the literature, we noticed that a considerable cost is required for data integration. Such a substantial price is due to data integration across multiple domains, various formats, different vocabularies, metadata of varying quality, and political boundaries [28].
Filtering also allows the data classification in different formats like structured and unstructured formats [61]. The filtered data is further processed through succeeding phases. After due process, policymakers use filtered data to make better decisions within a limited time and with fewer resources [58,152]. The software development team implements data filters to extract the required data from a vast amount of collected data [62]. The output of the data filtering is a set of categorized, purified, anonymize, less noisy, and error-prone datasets [17,42,61,121].
In enrichment, the normalized, enriched, and simplified data is used through data analysis and mining to generate new information [130,135]. The enrichment activities are performed on the integrated massive amount of data to limit the selection of data as per certain criteria [28,61,62,90]. The outcome of the data enrichment is a set of refined and mature datasets compared to the original raw data, which can be utilized for either further analysis or for archiving for future inquiries over historical data [42] Preparation phase key functions The preparation phase of DaLiF includes the following key functions: • Creation of a homogeneous set of data by consolidating data that is gathered from numerous data sources [6,60]. • Implement a plan to select modernized and extensible tools for public data integration platforms [17,58] and tools for data filtering [58,62]. • Perform scalability about the big data that is high in volume, veracity, velocity and comes from a diversity of sources [155]. • Process big data with the addition of extra measures to achieve a public administration's short-term integration goals [164]. • Do activities, like forming relations among variables of different data sources, adapting units, translating, and building a single database along with all the acquired data, so the government data can be traceable and easier to access for future use [46]. • Consider data privacy protection constraints to avoid revealing private information, like citizen personal and government classified information, in the integrated data [46]. • Identify noise and errors in the collected data and process this public data to remove such issues [61]. • Involve reliable and authorized HR resources in the phase to avoid the leakage of sensitive government data [23,121]. • Define filtering criteria to be used by the PAs and researchers to filter the public data, including research data as per their needs [61,62]. • Verification of reliability of the GBD sources as well as of the own data, to manage for any data inconsistencies [40,42] • Responsible for carrying out certain fundamental data transformations to optimize the volume of data flowing from the data collection to the quality phases [42]. • Preparation of additional internal and/or external sources data to be merged with existing public data for data valuation [10,70,143]. • Extension of existing information by extending missing or incomplete data [61,63,130]. • Carefully process data to eliminate unnecessary, misleading, unreliable, and duplicate information [42,130]. • Establish effective data integration architecture that controls the replication, and the flow of data to ensure data quality and consistency, especially for reference and master data [70]. • Describe source-to-target mappings and data transformation designs for extracttransform-load (ETL) programs and other technology for constant data cleansing and integration [70]. • Implement methods for integrating data from multiple sources and the suitable metadata to make sure meaningful integration of the data [70].
Analysis The analysis phase is the most common phase of all data lifecycles models. The analysis phase enables an organization to handle ample information that can affect the business [61]. This phase is responsible for developing all data analysis and data analytics to extract knowledge and discover new insights [34,42,165]. This phase is like a human brain, i.e., processes the information for the next appropriate required actions by human beings [143,155]. In this phase, different big data analytical tools are utilized to analyze data. The data analysis tools help the policymakers to analyze much and complex data to understand what is happening (descriptive analytics), to understand the reasons that something is happening (causality) to run what-if scenarios, and to do forecasts. The policymakers use forecasts in their decision making [143,155]. Modern technologies, state-of-the-art data infrastructure, and highly skilled people are critical to extract relevant insights from big data in the analysis phase. Examples of modern technologies include machine learning, deep learning, artificial intelligence and natural language processing. Relevant people's skills include data analytics, data mining, computing, statistics, etc.
[152]. The analysis phase includes analysis of unstructured data [91]. The output of the data analysis phase includes knowledge, the discovery of new insights, new data, interpretations and/or new datasets [42,54,61,165].
Analysis phase key functions The data analysis phase of DaLiF includes the following key functions.
• Data sources selection like identification of descriptions, data sources location, file types, and data provenance [58,91]. • Perform analysis of data to extract knowledge and discover new insights, and then decision-makers use this knowledge intelligently to generate value for public organizations [17, 42-44, 54, 61]. • Consider innovative data analysis strategies like schema on read to manage public data through mode tools [91]. • Select appropriate data analysis tools and techniques, like data mining algorithms, cluster analysis, correlation analysis, statistical analysis, and regression analysis, to analyse public organizational data [58,61]. • Set-up a data scientists' group (actors) with sound expertise in data analytics to perform analysis of various types of public data, particularly unstructured data, and describes a set of actions to be completed by the group [61,62,91,152]. • Discover business processes that can be enhanced through big data technology, perform analysis of existing issues in each business process, and perform re-engineering of each business process using big data technology [62]. • Define the required types of big data analysis that include descriptive, predictive, prescriptive, and diagnostic [34,62,165]. • In the analysis phase, prepare and publish outputs in machine-readable formats [53,55]. • Extraction of value from big data through its extensive use and offers a natural interface with the data users [35,[42][43][44]54].
Visualization The visualization phase deals with the presentation and visualization of the outcomes, as well as explanation of the meaning of the discovered information [62,91].
The visualization phase has the highest value for the data consumer in the information value chain, and this phase also boosts the interaction of data analytics and the organizations [91,155]. We also noted the following categorization of data visualizations, exploratory data, explicatory, and explanatory visualization. The first category emphasizes better data understanding, particularly in a huge amount of repurposed data. This is because of the volume of the datasets that need new methods. Examples of exploratory data visualization include browsing, boundary conditions, and outlier detection. The second category focuses on analytical results. Example of explicatory visualization includes confirmation, interpreting analytical results, and near real-time presentation of analytics. The last category is about 'telling the story' and a simple way of presenting results to the layman to ingest easily. Examples of explanatory visualization include business intelligence, reports, and summarization [91]. This phase results can be offered in various forms like dashboard, oral presentation, user interactions, alerts, reports [62].
Visualization phase key functions The visualization phase of DaLiF includes the following key functions.
• Visualize public data so that less tech-savvy decision-makers can understand and use results for effective decision making [64,91]. • Implement a plan to select modernized and extensible tools for the data visualization, like pipes in 'R' computer programming language and geoms (Cleveland dot plots, box plots, and jittered graphs) [58,64]. • Encrypt the resulting information and knowledge and adopt an access control strategy to avoid privacy threats [23,132]. • Adopt appropriate mechanisms for reporting and analysing the data, including online and web-based reporting, BI scorecards, ad-hoc querying, OLAP, and portal [70].
Storage The objective of the storage phase is to save data securely throughout the life cycle. Data storage is an essential process of big data analytics in real-world applications [65,166]. We noticed in the literature that the storage phase is considered in all data lifecycles models. There is a demand for stable and usually web-accessible storage [90,144,147]. Data Lakes ingest raw data in its original format from various data sources, meet their role as storage repositories, and allow users to query and explore them to extract knowledge [167]. In the Extract, Transform, Load (ETL) standard procedure, "Load" is close to "store, " and this procedure performs a critical role in data storage [162]. The activities of data lifecycle phases, like data access, publish, data sharing, data use, and re(use), would be executed once the data is stored in a place [85]]. However, big data storage is also a complex, costly, and challenging data lifecycle phase. In the public sector, government entities usually setup base registries to store GBD of particular importance i.e. master data. A base registry is a reliable and authentic source of information about people, health, vehicle, crime, and businesses [58]. The base registries support PAs to eliminate data silos and maximize the data's re-use across the public sector entities easily and inexpensively [15,58]. In the storage phase, different modern tools and technologies are required to store big data like NoSQL, NewSQL, Big Data Query Platforms, Hadoop Distributed File System (HDFS), and cloud storage technologies [50,95,96,168,169]. Moreover, several NoSQL technologies, like HBase, MongoDB, Cassandra, CouchDB, DynamoDB, Riak, Redis, Neo4J. These technologies store data streams in a real-time fashion into a NoSQL database [127].
Storage phase key functions The storage phase of DaLiF includes the following essential functions.
• Identify public data to be stored, specify a data repository or data center where the shared data will be stored [15,58]. • Develop and implement an appropriate, short & long-term storage plan to store data in GBDE [53]. • If relevant, ask permission from citizens and businesses to store data of their property [66]. • Store data in an appropriate location (in-house data center or private cloud environment) in a secure, scalable, accessible, and reliable manner [65,90,147]. • Compliance with industry standards, a) to store GBD along with improved data structures, appropriate cloud data security, and backing fault tolerance; and b) to improve data storage systems performance in terms of capacity and speed [65,75,142,166]. • Implement a plan for the selection of modernized and extensible tools for data storage along with a balanced approach for data availability and scalability [58,64,95]. • Establish base registers to store public data at national and cross-border levels [58]. • Perform continuous work on data storage with improved data structures and backing fault tolerance [65,85]. • Adopting approaches based on encryption techniques ensures privacy protection in the data storage phase [95,96]. • Implement a document and content management system that offer electronic documents and electronic images of paper documents storage, versioning, security, metadata management, content indexing, and retrieval capabilities [70,71].
Access The data access phase focuses on ways of communication between the data provider and data consumer in the big data ecosystem [60]. Through this phase, we decide and document which user [60,147] or re-user [90,147] is accessing which data and with what mechanisms [58]. Public sector organizations offer multiple channels for data access [94]. Access phase key functions The access phase of DaLiF includes the following essential functions.
• Ensure the access of public data to users and reusers on a day-to-day basis as per agreed and signed an agreement [60,90,147,149]. • Define data access controls, and data authentication methods [58,90,117]. • Establish data access models like cloud, intranet, and virtual desktop models that help determine the hosts' identity, authority, clarify the operation authority, and identify, authenticate remote users, and ensure secure communication, respectively [117]. • Ensure that data that is openly accessible to all users may not by any means contain classified privacy information to avoid personal data privacy threats [109].
• Ensure that limitations on access are conveyed and admired [17,90,147].
• Allow government data exchange platforms, like Belgium platform 'MAGDA' , to further facilitate data access and exchange of data among public bodies [58]. • The mission-critical data that need to be accessed frequently by the analytical tasks should be stored to offer fast retrieval and updates. While less urgently accessed data can be stored in a database, on disk, or in data files [120]. • Implement dynamic and scalable access control like Authenticator-based data integrity verification techniques [23,132]. • Enable effective and efficient access and use of data and information in unstructured formats [70]. • PAs should allow access to documents/records in accordance with related policies, standards, and legal requirements [70].
Use, re(use) and feedback In this phase, we combine two key concepts, use, re(use), and feedback. The 'use & re(use)' concept is about the use and re(use) of data by the data consumers [118,142,161,170] and focuses on discovering new and valuable information from existing public datasets by different stakeholders [58]. While in case of second concept feedback, data users exploit the open government data and provide their feedback [49,98,98]; such feedback is in the form of user reactions, comments, and suggestions that usually identify improvements and corrections in the published data or metadata [49,52,98]. Moreover, re(use) is a process, not a single action, and it includes different activities like acquiring datasets from various public or private data sources to compare to recently collected data, returning to one's own data for later comparisons, surveying available datasets as background research for a new project, or steering reanalysis of one or more datasets to address new research questions [171]. The examples of data consumers include citizens, individuals [67,118], businesses, researchers, and employees of other government agencies [168]. In the use, re(use), and feedback phase, PA is not the main actor, but the client as the PA can still use and re(use) the public data. App developers create new and valuable information by pulling the non-classified government datasets together and mashing up with other private data to build high-value Apps [28,142]. The governments are also being working to open data without personal attributes so that businesses and the community use and re(use) such data for innovation, accomplish their day-to-day tasks, and gain commercial benefits from this data [58,170]. There are a variety of open datasets that are usually used for several objectives by various users. The data publishers usually ensure that their data, incredibly private data, is accessible to designated data use and re(use) [90,147]. There are different motivations of the data users and reusers like community welfare, business growth, and earn money [11,28,172]. The European Commission advised the European Member States to formulate a holistic big data strategy, including publishing open data and promote the use and re(use) of such data. Moreover, the Commission offers special proposals to them to achieve better data use and re(use) within a State and cross-border as well [66,142]. The other government entities may use and re(use) GBD as a tool to improve and optimize the internal processes of the public administration and make evidence-based decisionmaking to improve their public services for the public [58,66]. This phase's output is a set of manipulated data values [48,147]. Data feedback is a way to obtain a consensus among stakeholders, including the community. The data providers examine the user feedback about data and again publish modified data after incorporating the data users' feedback [98,173]. The PAs can gather a vast amount of all stakeholders' viewpoints, as evidence-based information, on public data [58,120].
Use, re(use), and feedback phase key functions The use, re(use), and feedback phase of DaLiF includes the following key functions.
• The data provider may provide data to the data consumers to use, re(use), and offer feedback about data along with an appropriate mechanism that enables an individual to manage and control their digital record of information [67,142,171]. • Ask for permission to citizens, and businesses, i.e., owners of private data, to use and re(use) data of their property consistent with the objectives of information collection [66,118]. • Outreach all stakeholders so that everyone has an equal chance to provide feedback [49,66,153]. • Allocate enough time to the stakeholders and actively listen to them to provide their feedback [58,98]. • The data provider implements data usage policy, relevant national and international regulations about data use and re(use), and creates awareness amongst about the said policy within data consumers to avoid individual data misuse [58,66,67]. • Adoption of consistent and uniform approach(es) and shared (interoperability) platforms to help the safe, transparent, and controlled use and re(use) of data across public organizations. These approaches and platforms also help to discover what data is available and facilitate its use and re(use), preventing duplication of effort across public organizations [58,142]. • Interact in a more civilized and less bureaucratic manner with the stakeholders to get fruitful and enough feedback from them [52,66,153]. • Implementation of base registries, single authoritative sources of data, to enable data use and re(use), and decrease the requirement for citizens and businesses to give the same information to public organizations again and again [58,118,152]. • Implementation of the plan for the selection of modern tools and technologies, including API-based technologies to promote data use and re(use) with data harmonization and consistency [58,66,161]. • Develop IT systems, connectivity infrastructures, and platforms to proceed towards a country that functions as a unit and increase the use and re(use) of GBD for the decision making [58,66,119]. • Ensure the use of technological solutions and social media so that data providers, like PAs, can create informally and efficient ways of communication with data users' including citizens [49,52,98,120]. • Establish possible collaboration with the citizens to express their interest and offer feedback about data published by the government [98,133]. • Facilitate easy and inexpensive reuse of data across the organisations, preventing, wherever possible, redundant and inconsistent data [70].
Share/publish In this phase, we combine publishing and sharing concepts of traditional peer-reviewed publication with the distribution of data and information through (government) web portals, social media, data catalogs, eGovernment information systems, and other venues [55,61,128]. Data and its resources are collected, prepared, and analysed for sharing and publishing to benefit the stakeholders. The examples of such stakeholders include governments, businesses, citizens, researchers, scientific partners, and federal agencies [9,58,61]. The data provider shares data with the above-mentioned stakeholders, as per defined ethical and legal specifications [58,67,128]. In the government sector, organizations have data related to tax revenue, health, education, economics, transport, etc. The government organizations share data with the rest of the government entities. Data sharing is helpful to achieve greater efficiency in the use and re(use) of data by the government [35,142,152]. It is a key to transparency and economic growth [174]. The fundamental idea of linked data is to use the World Wide Webs global architecture to share structured data worldwide [26]. In this phase, the data publish concept emphasizes what data can and should be made public and how data needs to be published with appropriate security measures and integrity [58,92]. PAs determine which data is to be issued for other government departments and which information is to be disseminated openly to the public [58]. However, PAs do not publish various data sets due to certain data traits, like data containing personal or sensitive information [175]. PAs intend to publish government data for all to promote transparency, accountability, value creation, i.e., better governance, and to enhance the quality of life of the citizens [67,79,175]. The data publish phase is highly essential for the open government domain. This phase's output is publishing non-classified data [92,93].
Share/publish phase functions The data share/publish phase of DaLiF includes the following key functions: • Implement a plan for the selection of modern tools and technologies, including API-based technologies, to promote data sharing/publishing with stakeholders safely and effectively [58,67,128]. • Identification of non-classified public data to be shared or published [58,92]. • Sign off data sharing agreements between governments and other stakeholders that emphasize the legitimate basis and logic behind why public data is being shared [58,66,115]. • Ensure to take appropriate measures that enable individuals to control whom to share data and how much the owner is eager to share [67,97,114,142]. • Data providers focus on maintaining a balance between data availability and data redundancy when publishing data through various formats [79,93]. • Consider data sharing granularity and data transmission in addition to authorization of data while sharing private data. As sharing granularity refers to conformity to sharing policy and data transmission indicates the isolation of sensitive information from the original data. This function makes the data is not related to the data owners [118]. • Follow open data publishing guidelines and principles as mentioned in [176] and [177] to publish open data [93,175].
• PAs should keep balance to allocate powers to a different group of stakeholders (Government bodies, NGOs, Regulators, Data Brokers versus data subjects, entrepreneurs, archivists, data, data collectors) in driving the design, framing, and implementation of data sharing policies and practices [174]. • Implement web standards in data formats, like HTML, XML, RDF, CSV, and web protocols, like HTTP, FTP, and SOAP to publish data on web [70,92,175].
Archiving phase Archiving is a process to anchors a chunk of data within a system through cataloging, indexing, or a related action [49]. Archiving is for obsolete data, keeping for records in case access is needed, however, at a low storage cost. While data storage is for active information, available for day-to-day activities, but at a high storage cost [61]. Data archiving is one of the prime phases of a big data lifecycle [10]. It is pertinent to mention that effective data lifecycle management includes the intelligence not only to archive data; however to archive the data based on specific parameters or business rules. An example of such parameters consists of the data's age or the last date of their use [51]. In a cloud computing environment, archiving is a technique to shift less frequently used data to another place in cloud s for an extended period [88,142]. Data Archiving can also help storage administrators to develop a tiered and automated storage strategy to archive static data in a warehouse. Through this strategy, data warehousing specialists can improve overall data warehouse performance [51]. Some researchers describe the data life cycle by the data access frequency [142]. As the moment goes on, the data access rate gradually declines, and ultimately such data goes to an archived state. Additionally, in this phase, the following three main operations are required, encryption techniques, long-distance storage, and data retrieval mechanism. These operations permit the least used data to be shifted to separate storage devices for long-term storage. The archiving and storage devices are thus separated [61,88]. Some countries have special national archival legislation to archive the government record/data for reference and future use by PAs.
Archiving phase key functions The archiving phase of DaLiF include the following es research work provides a holistic view of 76 datsential functions: • Data, including personal data, should be archived with strict security measures to avoid such data leakage [58,142]. • Implement a formal agreed plan to archive data to ensure data availability and data re-use [55]. • Use of appropriate archival standards like General International Standard Archival Description (ISAD-G) for various purposes, including hierarchical data description [90,147]. • Implement a plan to select modernized and extensible data archiving tools [49,66].
• Adopt an appropriate archive method to ensure that such data is accessible to the data scientists for data analytics reasons as and when require [51]. • Data resources, data infrastructure, and data management should be forecast to deliver continuity and archive data for as long as required [66]. • Use of appropriate anonymization techniques like generalization and suppression to protect data privacy during this phase [95,96].

End of life phase:
In this phase, duplicated data, no longer required data, and useless data is removed from the system [50,58,88,111]. Data must be considered in terms of end of the usefulness of data or end of life [58,116]. In the cloud environment, to maximize resource usage, the storage location of data is often moved. As data is moved, the original location is also destroyed [117,142]. Such titles include deleting, terminate, destroy, and dispose of. Data-driven Public Administrations always make decisions regarding the end of life of data based on their data strategy [58]. This phase's output is a set of destructed data values [48,88].
End of life phase key functions The end of life phase of DaLiF includes the following key functions: • Useless, inactive, and data that has attained the end of its lifespan may be destroyed as per rules/regulations [58,88,116]. • Implement a plan to adopt appropriate methods for the data end of life [49,66]. • Data centers, including government data centers, should offer suitable data endof-life functions like disk replication and demagnetization to their clients to avoid sensitive public data leakage [88]. • Ensure that unnecessary data is permanently removed and cannot be restored from the storage medium to avoid inadvertently disclosing sensitive information [118]. • Ensure that data in the cloud is removed, through appropriate means, according to the owner's mind, to guarantee the information not be disclosed or recovered [119,142]. • Ensure wiping of unwanted data on partitions and hard disks [118,142].
Data quality phase The quality phase focuses on maintaining data quality during the whole data lifecycle, i.e., data collection, data integration, data analysis, data publishing, and data share phases [42,62]. A primary data quality management principle is that manage data as a core organisational asset [70]. Data quality is one of the prime issues related to the value of data for the business [54,178]. DMBOK highlighted the following dimensions of data quality, accuracy, completeness, consistency, currency, precision, privacy, reasonableness, referential integrity, timeliness, uniqueness, and validity [36,70,71]. The data-driven public administrations can offer better services and policies through improving data quality [43,66,152,179]. When we have welldefined quality requirements, then implement controls to measure the data quality's satisfaction is more feasible. Examples of such quality requirements include margins of errors and the requisite level of precision [17,60]. The quality of the big data is essential for their consumption. Data may be precise, timely, and in accordance with actuality [66,180]. Base registries (public sector master data) are needed for valuable and highly reusable data [44,58]. The United Nations also described a set of actions for the computer scientists to ensure quality during data input and output results to limit the risks in various factors like complexity, speed, accuracy, validity, and clarity [62].
Quality phase key functions The data quality phase of DaLiF includes the following essential functions: • Certify that public data, information, and metadata are of high quality by engaging data quality and metadata experts [60,66,70]. • Establish quality criteria and quality processes that consider generation, storage and processing [62,66]. • Implement data quality management policies, international standards, procedures, and guidelines to cross-check the data quality level to discard the data with low quality, improve the data quality, etc. Such implementation ensures high quality, consistency, the integrity of public data, and help to handle the 'Veracity' challenge [17, 42-44, 54, 66, 152]. • Monitor the data quality flows, in case of failures, then proceed as per the data quality management policies [42][43][44]54]. • Apply conformance checks to data quality business rules, like attribute domain constraints, format constraints, and standardisation constraints, at each phase of the data lifecycle to avoid low-quality data, like missing attribute values, schema, and data format differences [43,181]. • Create and promote data quality awareness within an organisation [70]. • Make explicit data quality attributes like accuracy, integrity, completeness, and timeliness to help policymakers determine whether data is reliable for the decision-making process [13,180,181]. • Being business process owners, PAs should agree to and abide by the data quality SLAs [70,73].
Protection phase This phase focuses on data protection in terms of data integrity, security, access control, and privacy [17,61,182]. The phase is being considered throughout the data lifecycle, i.e., from planning, collection to archive, and destruction phase to maintain data security and privacy protection against any accidental or malicious compromises to the GBDE [58,69,97,144,183]. Due to the quantity, variety, and sensitivity of the big data and its management through heterogeneous based technological solutions, data security and privacy protection become crucial. A holistic methodological approach is required based on data protection standards and common practices to deal with these issues [69,118]. Adequate data security and privacy protection management establish governance mechanisms that are easy enough to abide by on a daily operational basis [70]. Classified data must be secured and protected from unauthorized users through various data masking techniques to protect data from unauthorized access [17,51]. There is an in-balance between privacy and the risk of malicious data exploitation [23,58,61,183]. The processing of personal data in Europe is subject to the General Data Protection Regulation (GDPR) and the Data Protection Act of 2018. Such legislation ensures the privacy of citizens and the secrecy of data and information gave by businesses [58,66,118]. This phase's output is secured and protected data [61,144]. Protection phase key functions The data protection phase of DaLiF includes the following essential functions.
• Government organizations should process data in a way that certifies the protection of personal data against unauthorized or unlawful data handling [58,66,96,118].
• Implement privacy standards, introduced by ITU, CSA, ISO, etc., privacy policy, techniques, and security solutions to protect data, including personal data, to avoid data threats. Whereas such solutions and methods will be based on various security patterns like encryption, authentication, anonymization, and role-based access control [23,69,70,118,119,132]. • The use of unique identifiers to manage users' digital identities, their relationship to a real-world identity, and access to systems, data, and information are essential for data protection [58,118]. • Data and information must be protected as prescribed by both regional (like EU) and national (like Italy) legal codes and data protection policies with suitable levels of data protection, security, confidentiality, privacy, integrity, and availability [183]. • PAs should also allocate sufficient funding, create awareness amongst the people, impart requisite training for the staff, and engaged technical experts to protect GBD [96,118]. • Minimize the risk of privacy violation during the data collection/generation by appropriate means like restricting access or falsifying data [95,96]. • Ensure privacy protection in the cloud environment by the strict separation of sensitive data from non-sensitive data [118]. • The PAs should take security and data protection processes to identify and protect citizen and business data; for example, privacy-by-default and privacy-by-design will be adopted. [58,66]. • Ensure double encryption data system using an appropriate encryption algorithm, like AES and RSA, to avoid data mining-based security attacks [23,132]. • PAs should form arrangements to identify and utilize security requirements applicable to the receipt, processing, physical storage, and output of data and classified messages [70]. • Execute effective data security policies and procedures to assure that the right people can use and update data in the right way [70,73]. • Collaborate with stakeholders (e.g., IT security administrators, data stewards, internal and external audit teams, and the legal experts) for defining data security requirements and data protection policy [70]. • Adopt data protection tactics at data consumers, systems, and data providers levels to ensure data protection from unauthorized entities, systems, and un-trusted data providers, respectively. Examples of such tactics include personal data stores, software/hardware-based virtualization, data encryption [182]. • PAs should promote the concept of decentralization and private-by-design IoT through Blockchain technology in IoT-based Information Management Systems to ensure data security and privacy protection [184].
Governance phase Data governance phase refers to a plan to guarantee that high-quality and protected data exists and exploited throughout the complete data lifecycle [58,185]. It determines the policies and procedures to safeguard preemptive and effective management of data assets [70]. Data governance interacts with and influences each of the surrounding phases and guides how activities in other phases are performed [70][71][72]. Data governance helps the PAs manage data in the public sector organizations as it also implies the allocation of decision-making rights and associated functions in such management [68,186]. The data governance phase helps PAs protect public sector organizations' data assets to assure generally understandable, accurate, complete, reliable, protected, anonymous, and discoverable government big data. It is also assisting in systemizing these organizations by linking business processes with data in GBDE [185,187]. The data governance phase includes consistent management and helps public administrations set data rules/policies, provide insights, wisdom & judgment, and promote accountability [61,152]. The estimation of the quality of data is recognized to be crucial for data governance. Data governance is one of the central pillars of the data-driven government. Through excellent data governance, public administrations will guarantee that their data are precise, reliable, comprehensive, available, and secure [58,66,186].
Governance phase key functions The data governance phase of DaLiF includes the following essential functions.
• Utilize standards, guidelines, tools, policies, laws, procedures, roles, and responsibilities for public data governance to ensure data utility by the data consumers [58,68]. • Establish a formal system of accountability for effective data governance [58,152,187]. • Apply machine learning and AI algorithms to improve data governance [66,187].
• Create a collaborative environment within the stakeholders, including users, so that public administration will get proposals from them on improving the data life cycle, particularly in case of agility in work scope [62,186]. • Focus on the following aspects, data quality, data security, and privacy protection to tackle data governance-related issues in cloud computing environment for better visibility, data quality, and protection control [186,187]. • Promote the use of use machine learning and AI to reframe data governance to address related business requirements in a way that motivates data producers and consumers to work together [68,185]. • Constitute a Governance Board or a Committee in the organization to oversee and drive data governance across the public services [58,66,186]. • PAs must take an organizational perspective to ensure the quality, security protection, and effective use of government data [70,72].
As an outcome of the research review protocol's step 5, "results", we mentioned our comprehensive research results and proposed DaLiF in the preceding sub-sections of this segment.

Research implication
Given the above-mentioned outcomes, we offered the following research implications for the scholars and practitioners: Benefits to the research community • This research work provides a holistic view of 76 data lifecycles and their phases to the research community.
• We proposed DaLiF based on various literature data lifecycles to help the research community in studies related to data management in GBDE.

Benefit(s) to entrepreneurs
• This study may offer insights to entrepreneurs to gauge new business ideas and innovation in developing government big data management solutions and services.

Benefits to the practitioners
• DaLiF could help practitioners to plan and handle the complexity of data management, identify essential activities and maintain appropriate data quality throughout the data lifecycle. • The detailed overview of DaLiF, including our proposed functions for each phase of the lifecycle, as an information tool, could be workable for the public sector organizations to develop or modify their strategic measures to manage GBD efficiently.

Benefits for the Public Administrations-PAs
• DaLiF could support PAs to gather, classify, refine and analyze data to extract knowledge, find new insights, generate value for the public sector organizations and promote data-driven decision making.

Conclusion and future work
In this work, we propose DaLiF as a big data lifecycle model. The key characteristics of DaLiF include: • The DaLiF is based on the analysis of 76 data lifecycle models presented during the last 25 years. • It is not mandatory to follow all the proposed phases as the optional ones can be selected based on the specific needs. • The model remains relevant and applicable in various government fields such as health, agriculture, education, manufacturing while it covers different types of data, including open government data, business data, scientific and research data, citizen (i.e., personal) data, etc.

Limitation and future work
We summarise here two limitations of our work. Due to limited existing research, we could not find a good number of research articles explicitly on the data lifecycle frameworks for the data-driven government. Therefore, we borrowed concepts from existing literature in "neighboring" areas, i.e., scientific research, Semantic web & web contents management, open government, cloud computing, IoT, etc. Moreover, we examined IEEE, ACM, ScienceDirect, and Springer digital research libraries as we considered them more relevant to our research; nevertheless, other digital libraries may also contain some relevant research articles.
We intend to continue our work in the area. This research article primarily contributes to the theoretical realm and needs practical validation to bridge the gap between academic rigor and industrial applicability. Secondly, in our previous published study on GBDE [81], we presented a classification model for data actors. We have the interest to establish a linkage between data actors and the proposed here DaLiF phases. Lastly, we intend to conduct a detailed survey of technological tools for each phase of DaLiF.