A canonical model for seasonal climate prediction using Big Data

This article addresses the elaboration of a canonical model, involving methods, techniques, metrics, tools, and Big Data, applied to the knowledge of seasonal climate prediction, aiming at greater dynamics, speed, conciseness, and scalability. The proposed model was hosted in an environment capable of integrating different types of meteorological data and centralizing data stores. The seasonal climate prediction method called M-PRECLIS was designed and developed for practical application. The usability and efficiency of the proposed model was tested through a case study that made use of operational data generated by an atmospheric numerical model of the climate area found in the supercomputing environment of the Center for Weather Forecasting and Climate Studies linked to the Brazilian Institute for Space Research. The seasonal climate prediction uses ensemble members method to work and the main Big Data technologies used for data processing were: Python language, Apache Hadoop, Apache Hive, and the Optimized Row Columnar (ORC) file format. The main contributions of this research are the canonical model, its modules and internal components, the proposed method M-PRECLIS, and its use in a case study. After applying the model to a practical and real experiment, it was possible to analyze the results obtained and verify: the consistency of the model by the output images, the code complexity, the performance, and also to perform the comparison with related works. Thus, it was found that the proposed canonical model, based on the best practices of Big Data, is a viable alternative that can guide new paths to be followed.


The Big Data challenge
With the advent of the Internet of Things (IoT), there was an increase in access to Information Technology (IT) services and agile development of tools and applications, such as rapid technological and market changes, among others. At the same Fig. 1 The timeline of Big Data definitions [1] time, the need arose to obtain more dynamic and flexible solutions for computing environments, which aim for quality, reliability, safety, security, economy, intelligence, and management simplicity.
Some barriers encountered have been the difficulty to monitor and manage: various computerized segments, storage systems, network equipments and services, as well as the difficulty to handle various tools configured in a decentralized and disintegrated way.

The seasonal climate prediction
Weather is an ephemeral state of the atmosphere and the climate comprises a more comprehensive and stable period. Therefore, through the average study of weather on certain variables and locations, the difference between weather and climate is found on a time scale [7].
Although it is not yet possible to produce accurate weather forecasts beyond 1 week, it is already possible to predict probable future conditions based on averaging over the long term, usually a period between 1 month and 1 year. This probabilistic process is known as seasonal forecasting, generated by sets of predictions of climate models [8].
In an atmospheric numerical model, the atmosphere is projected on a grid-like plane and contains several vertical levels. Its mathematical equations allow to analyze the value of each point of the grid and also the interactions between layers, lateral quadrants, and also with the surface [9][10][11][12][13].
On the first stage of model execution, the use of meteorological data assimilation techniques are extremely important for the correction of inaccuracies in data that make up the initial and boundary conditions of the weather and climate forecast models. The work of Huang et al. is an example of applied research in this area [14]. The five dimensions of Big Data [6] During this initial stage, data observed from equipments are interpolated with variables from previous forecasts and named "first guess". In this first stage, there is also the computation of boundary conditions, sea surface temperature, sea ice, soil moisture, snow cap, greenhouse gases, and aerosol concentration, for example [9][10][11][12][13].
One method used to sample uncertainties associated with forecasts is known as a forecast by Ensemble members (sets). This method consists of a set of executions of a given model with small variations in the initial conditions of each execution and allows determining probabilities in certain scenarios. According to Lorenz, this methodology is based on the variability in which mathematical equations can result in small atmospheric variations, assuming that the model is perfect and only the analysis is disturbed [15][16][17].
A set of executions of an atmospheric numerical model is used in the process of seasonal climate prediction and forecasting of climatic anomalies. The first stage of this process consists of obtaining model's climatology, that is, the monthly average of models forecasting for the processed period, where the model is executed over a long period of data of 30 years or more. Then, the model is run on current input data to obtain the seasonal climate prediction for a future monthly or even annual period. Finally, it becomes feasible to conceive the forecast of climatic anomalies through the difference between the model climatology and the seasonal climate prediction, which results in a probabilistic forecast of the variables and locations with chances of suffering some variations of climate compared to a historic period [9]. Figure 3 illustrates the climate forecasting process using the Ensemble members method, where the atmospheric numerical model is executed "n" times with the necessary disturbances in its analysis file to obtain the climatology of the model and the seasonal climate prediction, also presenting the prediction of anomalies for a given period and location.

The Apache Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable, and distributed computing. It is a framework that allows distributed processing and storage of data distributed across a cluster of computers. Its architecture is based on Big Data concepts, where a main node controls the other nodes, which offer processing and local storage [18,19].
The distributed data processing is performed by the Hadoop Yet Another Resource Negotiator (YARN), responsible for resource management and scheduling tasks in the Cluster. Its main components are: ResourceManager, NodeManagers, ApplicationMaster, Containers, and JobTracker [18,19].
The ResourceManager runs on the main node and is responsible for resource management and application scheduling. The NodeManager component is processed on all nodes and is responsible for the resources of the nodes and containers, for example, processors, memory, disk space, network, log, and others [18,19].
The ApplicationMaster is started only on slave nodes, where it will have one instance per application responsible for managing the number of Containers requested and processing the MapReduce application. Finally, JobTracker has a single instance on the main node and controls the progress of all applications [18,19]. Figure 4 illustrates the architecture used by Hadoop to perform distributed data processing, in which the client requests resources to run an application to the Resource-Manager, which in its turn requests NodeManagers a sufficient number of nodes to create instances of ApplicationMaster and Containers, that communicate with the ResourceManager after starting [18,19].
The Hadoop Distributed File System (HDFS) is the Hadoop distributed data storage system and its operation is based on a client/server architecture. The HDFS has three components: NameNode and SecundaryNameNode, that make up the server; and the DataNode present on all nodes in the cluster [18,19]. The NameNode is responsible for storing the metadata, relating the blocks of a file to the referring DataNode. The SecundaryNameNode has the function of monitoring and managing the NameNode, ensuring its availability. The DataNode represents the place where the blocks of a file are actually recorded [18,19].
The Hadoop, when writing a file to HDFS, partitions it into one or more blocks and distributes them among DataNodes, as shown in Fig. 5. These blocks can be replicated in different DataNodes to provide fault tolerance [18,19]. Figure 6 shows the MapReduce data flow, a model proposed by Google, which addresses a new programming paradigm for working with Big Data. This model allows the manipulation of Big Data in parallel and distributed way, in addition to providing fault tolerance, scaling Input/Output (I/O), and monitoring [18][19][20][21][22][23].
Map and Reduce are the two main operations of the MapReduce process, where at least six steps can be highlighted: (1) Input-when the text is stored in blocks in HDFS; (2) Splitting-when each block is divided into smaller parts, for example, text broken down into lines; (3) Mapping-when each part (line) is computed for the key/value format; (4) Sort/Shuffle-when the Sort and Shuffle operations perform the sorting and grouping of data, according to the "key"; (5) Reducing-when calculating the values contained in each grouping; and, finally, (6) Output-when the result is recorded on the HDFS. Figure 7 shows the MapReduce flow applied to count words in a text [18][19][20][21][22][23].

The Apache Hive
The Apache Hadoop has several benefits and technological innovations and brings some difficulties regarding its use. One of them is the fact that the developers are not familiar with the Map and Reduce technique, which requires considerable time to understand, and there are not many qualified professionals in the job market.
Another point is the complexity to perform operations and extract information from files stored in the HDFS. Even in simple queries, it is necessary to execute a considerable sequence of instructions. Then, the Apache Hive is used, in order to abstract the difficulties in using the Apache Hadoop [18-21, 24, 25].
The Apache Hive is an open source and collaborative project managed by the Apache Software Foundation and has started as a subproject of the Apache Hadoop. But due to its importance, it has its own scope today. It represents a Data Warehouse software, which facilitates the management, manipulation, and extraction of information from large sets of data [24,25].
The Apache Hive and Apache Hadoop can be coupled to work together and provide users with distributed processing managed by the YARN, by using the MapReduce method and the HDFS distributed storage, without sacrificing ease and usability.
To abstract the complexity, the Apache Hive offers a query language named Hive Query Language (HiveQL), simple and similar to Structured Query Language (SQL), which is later converted into MapReduce jobs and run on the Apache Hadoop Cluster [24,25].
Although Apache Hive is similar to Database Management Systems (DBMSs) such as: MySQL, PostgreSQL, Oracle, among others, it has high latency even when working with a small set of data. But this time interval is relatively small when it comes to a large data  [24,25]. Figure 8 illustrates the Apache Hive architecture working in conjunction with Apache Hadoop. The main components of the Apache Hive are: (1) User Interface (UI)-the User Interface to send requests to the system; (2) Driver-the Device that receives queries, implements the notion of session identifiers, provides execution, and searches APIs modeled on Java Database Connectivity (JDBC)/Open Database Connectivity (ODBC) interfaces; (3) Compiler-the Device that analyzes the HiveQL language, obtains the necessary metadata from Metastore and, eventually, generates an execution plan; (4) Metastore-the Device that stores the metadata, all the structure information of the various tables and partitions in the Warehouse; and, (5) Execution Engine-the Device that executes the execution plan created by the compiler [24,25].

The Optimized Row Columnar (ORC)
The ORC file format implements optimized columnar data storage and brings some benefits to the Apache Hive and Apache Hadoop. It uses data compression algorithms, reduces the storage space consumed, and brings performance-related advantages, as its columnar structure and division into data blocks allows a HiveQL query to access only the data blocks it needs, in addition to paralleling the reading data between multiple nodes and to increasing speed during query processing [25].
The ORC divides the data into blocks called stripes. These blocks are linear and, according to the standard configuration, have the size of 250 MB, considered as a good practice to obtain efficiency in the HDFS. Each stripe contains three fields: (1) indexfor indexing the data for each block; (2) data-to store each block of data; and, finally, (3) information-to provide quick access to some calculations on the data of each relative column, such as: lowest value, highest value, average, sum, etc. These three elements are structured in columns, which represent each field in the database, and allow the query engines to read only the columns used in the query, for example, which optimizes access to the HDFS [25].  Figure 9 illustrates the architecture of the ORC and its various linear blocks, in addition to the internal structure of each block and the storage of the index, data, and information in column format.

Related work
This section presents the main existing and related works to the scope of this research. It provides an overview of systems, architectures, tools used, types of data used, type of information generated, as well as their main limitations.
The existing systems were classified into three groups: (1) The use of the observed data; (2) The visualization of the weather/climate forecast; and (3) The storage and performance in the use of meteorological data.
Fathi et al. presented a systematic literature review of big data analytics in weather forecasting. Initially, 185 articles published between 2014 and August 2020 were identified. Then, 35 articles were chosen. Hadoop, Hive, Spark, Kafka, and Python were among the main technologies used to work with Big Data. Within this literature review, the main terms related to weather/climate forecast were: temperature with 29.7%, wind with 15.4%, precipitation with 11%, humidity with 12.1%, precipitation with 11% and pressure with 8.8% [26].

The use of the observed data
Hart et al. have implemented a system called the Regional Climate Model Evaluation System (RCMES), which provides dynamic assessment of the regional climate model. Fig. 9 The ORC architecture [25] They compared the seasonal climate forecast between different models with observed data. Figure 10 illustrates the RCMES architecture, composed of two main modules: (1) The Regional Climate Model Evaluation Database (RCMED), which performs data ingestion and processing; and (2) The Regional Climate Model Evaluation Toolkit (RCMET), which integrates several tools for analyzing and visualizing information [27].
Shao et al. proposed a data mining environment in a meteorological data warehouse on a Hadoop cluster platform, shown in details in Fig. 11. The structure of the meteorological data warehouse is composed of a Hadoop cluster, a metadata storage system, a web server and a log server. This heterogeneous database is connected to other tools: data analysis, data mining, and data predicting [28].
Almgren et al. enabled the realization of the climate calculation on data observed in the last 63 years [29]. Waga and Rabah implemented a system based on Big Data and observed data to support agriculture [30]. Based on historical weather data, Wang  The meteorological data warehouse [28] et al. proposed the creation of a tool to support decision making in the face of major climatic variations and natural disasters related to the energy system [31].
Chen et al. implemented the BigSmog system, which allowed the analysis of catastrophic conditions of severe pollution, using Big Data and historical records of data collection stations [32]. Yerva et al. have based on large volumes of observed meteorological data and social data to generate a concept about a target destination, this application is called Mood Space [33].
Mao and Zhu developed a Big Data application to support agriculture and microcredit systems, based on calculations over a large volume of historical weather data, as well as a benchmark between a common computer and a Hadoop cluster [34].
Manogaran and Lopez stored a large volume of observed weather data in the HDFS database and used a MapReduce algorithm to make seasonal weather forecasts. Afterwards, they used a climate change detection algorithm based on spatial auto-correlation to monitor climate change [35].

The visualization of the weather/climate forecast
Han and Yan created a cloud-based solution that allowed the use of mobile devices to view weather forecasts for cities, quickly and accurately [36]. Rutledge et al. explored a web interface and Big Data concepts to access meteorological data [37]. Finally, Xuelin et al. worked with a massive meteorological database, but with a focus on data management, storage, reading, and writing. Figure 12 illustrates the structure of the Private Cloud Storage Platform that has five modules: the Data Acquisition module, the Results Show module, the Query and analysis module, the Memory module, and the Data Migration module. The relational database was used as an intermediary to obtain greater performance in data insertion [38].

Fig. 12
The Private Cloud Storage Platform [38] The storage and performance in the use of meteorological data Bauer et al. presented a platform for storing, processing, and managing corporate data in a private cloud, where various types of data can be stored, such as: sales, marketing, weather and climate conditions, news and social media, among others [39].
Xie et al. addressed the challenge of Big Data in the storage and backup of images in JPEG format, where traditional techniques did not provide efficient data deduplication and proposed the use of the PXDedup technique. With the use of this technique, it was possible to unzip the JPEG file, divide it into fragments, eliminate visually identical fragments, and recompact the remaining parts before storage, thereby reducing considerably the size of the stored file, without considerable loss of quality [40].
With a focus on performance in the use of meteorological data, Fang et al. implemented a MapReduce version of the K-means algorithm called MK-means, thus overcoming the bottleneck in working with large sets of meteorological data [41]. Xue et al. performed a benchmark on a volume of meteorological data containing small files and then grouped into larger blocks [42]. Li et al. proposed an image processing technology to generate the weather forecast based on nephogram recognition, using a neural network method based on the k-means algorithm and a Hadoop cluster for performance processing [43].
According to the literature review that surrounds the themes of Big Data, Apache Hadoop, and Weather and Climate Forecast, the related works have explored the use of these subjects, but with emphasis on observed data, visualization of weather and climate forecast, and also on storage and performance in the use of meteorological data.
Preserving certain similarities, however different from these previous researches, this proposed model was named "A canonical model for seasonal climate prediction using Big Data" to perform seasonal climate predictions on the output data of the numerical climate forecast model, through its methods, processes, techniques, and tools of Big Data.

Describing and applying the canonical model architecture in a case study
Initially, this section describes the challenge of this research when carrying out an operational seasonal climate prediction at CPTEC/INPE. To face this challenge, the seasonal climate prediction was divided into two stages: the Execution of the Global-CPTEC Model (parallel) and the Post-processing (sequential).
The first stage, the Execution of the Global-CPTEC Model is carried out in parallel, using resources from the Cray XE-6 supercomputer. The second stage, the Post-processing is sequential, involving data generated by the Global-CPTEC model. It was considered the focus of this research, mainly because it could be improved and optimized when using technologies associated with Big Data.
It is in this Post-processing stage that this research provides processing and distributed data storage, managed with simplicity and usability, using Big Data technologies such as: Hadoop Apache, Hive, YARN, MapReduce, HDFS, among others. This section describes: (1) The research challenge; (2) The development of the proposed canonical conceptual model, its architecture in four modules, and its internal components; and (3) The development of the proposed method for predicting seasonal climate from Big Data (M4PSC-BD) with its three internal processes. At the end of this section, it is also described the application of some metrics used for processing seasonal climate predictions.

Describing the research challenge
The CPTEC/INPE uses the Global-CPTEC model to generate climate forecast products for Brazil. This model has a T062 resolution with an accuracy of approximately 200 km 2 and employs the prediction method by sets of members named Ensemble Method.
As described in "The seasonal climate prediction" section, it is necessary to subtract model's climatology from the seasonal climate prediction to obtain the forecast of climatic anomalies, which allows, for example, to infer whether a given month or trimester will be more or less rainy than normal.
Model's climatology refers to its ability to make predictions in certain regions or to indicate the climate behavior of these locations. It is obtained by running the model on a large set of historical data, which is why it is called long execution.
The current climatology of the Global-CPTEC model was generated from 1979 to 2010. This process takes 8 months to process and is only performed once per model. Processing is similar to that of seasonal climate prediction. However, it takes into account more variades with 2 procesbles of initial conditions and three variations of the model: the Persistent Sea Surface Temperature (in Portuguese: Temperatura de Superf ície do Mar-TSM) with Kuo convection; the Relaxed Arakawa-Schubert (RAS); and the Grell.
The Persistent TSM consists of a procedure used in global climate modeling, where the atmospheric model receives, as one of its input parameters, the values of TSM anomalies observed from the month prior to the beginning of the forecast date and these values are maintained (persisted) throughout the model execution.
The TSM Prevista means that, instead of persisting the observed TSM anomalies, the TSM values for each forecast month are predicted by another auxiliary climate model. Therefore, in the first case, these input variables (TSM) are used at the beginning of the climate model execution, which are maintained throughout the processing. In the second case, an auxiliary model generates these variables for every model execution day.
The CPTEC/INPE uses the TSM forecasts of the Coupled Forecast System model version two (CFSv2) from the National Centers for Environmental Prediction (NCEP), as input to the global climate model. Its execution occurs in conjunction with the TSM conditions foreseen for the period that it is intended to forecast atmospheric conditions. Finally, the terms convection Kuo, RAS, and Grell are about three distinct ways of representing the process of cloud formation by convection (upward vertical movements) [11,[44][45][46][47].
In the seasonal climate prediction, the following 6 versions of the model are executed: the TSM persisted with Kuo, RAS, and Grell convection; and the TSM predicted with Kuo, RAS, and Grell convection. The model is applied for a period of 10 months, the first 3 months being retroactive and necessary for it to stabilize. The 4th month is the current month, the next 3 months are considered for the seasonal climate prediction, also called the target trimester, and the remaining interval is used for scientific research. Therefore, assuming that the current month is May, execution will be from February to November and the target trimester will be: June, July, and August.
Through the Ensemble forecasting technique (sets), the Global-CPTEC model is executed 15 times (members) with initial conditions of 15 different days. Therefore, for the climatology of the model, data are produced referring to: 3 variations of the model × 15 members × 90 days × 12 months × 30 years; and for the seasonal climate prediction, data are reproduced referring to: 6 variations of the model × 15 members × 90 days.
After the model execution, there is a post-processing stage to obtain the seasonal climate prediction. This stage consists of opening the binary files generated by the model and calculating, for example, the average or accumulation of variables for 15 daily members, months, and the target trimester. In this way, a single binary output file is generated with the result. The methodology used by the CPTEC/INPE makes use of FORTRAN's and Grad's tools. The target trimester has approximately 8190 files and, in this case, the total volume is around 24.570 MB.
The generated files are in binary format, encoded, and have several dimensions and levels. Each file can be represented as an XY matrix on the analyzed territory, where some variables have a value only at ground level, but other variables have different values, according to the height of the atmosphere (Z coordinate) and all data vary with the time.
The operational seasonal climate prediction at CPTEC/INPE is divided into two stages: the Execution of the Global-CPTEC model (parallel); and the Post-processing (sequential).
The first stage is performed on the Cray XE-6 supercomputer, which has a maximum processing capacity of 258 Teraflops as measured by the Linpack benchmark and its hardware configuration consists of 1304 nodes with 2 processors with 12 cores each, totaling 31,296 processors, approximately 40.75 TB of RAM, 866 TB of primary storage, 3.84 PB of secondary storage, and 6.0 PB of tertiary storage.
In the second stage of seasonal climate prediction, the data generated by the Global-CPTEC model is Post-processed. This stage is sequential and can be optimized. The use of the Apache Hadoop in conjunction with the Apache Hive has provided: the distributed processing managed by YARN; the use of the MapReduce method; and the HDFS distributed storage, in addition to ease and usability.

Describing the development of the proposed canonical conceptual model
The challenges related to the five dimensions that define the term Big Data (Volume, Variety, Velocity, Value, and Veracity) have motivated the conception and development of a canonical model that integrated some Big Data tools and minimized the main difficulties related to handling large amounts of files and large data sets. Figure 13 illustrates the proposed conceptual model architecture within its four modules:

The Data Sources Module
The Data Sources Module was designed to represent several types of storage systems and the Data Generator Component. In this module, raw data can be directly extracted from the database or inserted in the Big Data environment, through the Data Generator Component. Ramos

The Data Extration Module
The Data Extration Module was conceived to allow the insertion of data in the Big Data environment. This module consists of two components: (1) the Bridge2Hive-ORC Component; and (2) the Data Preparation Component. The Bridge2HiveORC Component has a server subcomponent, which receives HiveQL commands through messages and executes them directly in the database offered by the Apache Hive component. The client forwards the received messages to the server.
The Data Preparation Component connects to the raw database of the Data Sources module and performs copying, processing, and standardization of data on local disk. After this step, this component also connects to the Apache Hive and the Apache Hadoop components of the Mass Data Repository module to transfer data to the Big Data environment.

The Big Data Module
The Big Data Module was designed to centralize a large volume of structured and unstructured data and uses interconnected Big Data tools. It consists of four components: (1) the Apache Hadoop Component; (2) the Apache Hive Component; (3) the Apache Spark Component; and (4) the Apache HBase Component.
The Apache Hadoop Component ensures increased processing efficiency and scalability of all resources involved, which ensures greater speed of data capture, discover, and analysis. It has the HDFS and MapReduce subcomponents.
The Apache Hive Component promotes: greater efficiency in using the resources of the Apache Hadoop component; less complexity through the use of HiveQL queries; and greater dynamics and flexibility. Finally, the Apache HBase Component offers a distributed and scalable database, which supports the storage of structured data from large tables, allowing real-time and random access (read/write), as well as ease of integration with Hadoop.

The Applications Module
The Applications Module consists of a set of applications that make use of the Big Data Module. These applications have similar needs such as: extracting useful information from large data sets in a timely manner; subsequent MapReduce processing with writing to disk and memory; real-time and random access (read/write); among others.

Describing the development of the proposed method for predicting seasonal climate from Big Data (M4PSC-BD)
The canonical model proposed in this research was applied in the post-processing of the seasonal climate prediction of the supercomputing environment of the Center for Weather Forecasting and Climate Studies linked to the Brazilian Institute for Space Research.
In order to perform a more appropriate processing, based on Big Data Sets, the Method for seasonal climate prediction was conceived, developed, and named in Portu The BRIDGE to HIVE-ORC named in English by the acronym (Bridge-2HiveORC), was developed to simplify the data preparation process. The development of this component allowed data to be inserted directly into the HIVE database in Optimized Row Columnar (ORC) format. In this case, the Client and Server subcomponents were applied, which performed a messaging service between an external application and the Big Data environment.
Thus, it was used the component previously described directly in the numerical model of climate forecast (coded in FORTRAN language). After being stored in a matrix in RAM memory, its data could then be inserted directly into the HIVE database already in ORC format, using the DML INSERT command in the HiveQL language.
Although this component is more optimized, it was not always trivial to change a complex application. For this reason, the proposed model has these two forms of data extraction.

Developing the process P-EXTRACTOR
The   (7) Convert the Apache Hive Database into the ORC Format.
Step 1-Generate the list of files: The first step of P-EXTRACTOR consisted of locating and filtering the binary output files of the Global-CPTEC numerical model, present in the supercomputing environment, referring to the target trimester and eliminating unnecessary data.
Step 2-Transfer files: This step of P-EXTRACTOR aimed to transfer the binary files of the target trimester, from the supercomputing environment, to the Big Data processing cluster.
Step 3-Convert binary to text: In this step of P-EXTRACTOR, the binary files were converted to text format, ideal for greater efficiency in Hadoop, as it involves processing and distributed data storage.
These binary files had four dimensions: X (longitude), Y (latitude), Z (altitude), and T (time). The X dimension had 192 points of longitude, the Y dimension had 96 points of latitude, the Z dimension had up to six vertical pressure levels (1,000, 925, 850, 500, 250 and 200 mbars), and the T dimension varied daily in the period of reference.
Each XYZT point had 13 types of variables separated into two groups. The first group, containing seven variables, had value only at one level: TOPO, LSMK, T02M, TSZW, TSMW, SPMT, and PREC. The second group consisted of six variables at six levels: UVMT, VVMT, GHMT, TMAT, UEMT, and OMMT.
The command in Data Definition Language (DDL) used to define this data structure is shown in Fig. 14.
Step 5-Transfer the converted text format data into the Hive Database format: This fifth step of P-EXTRACTOR dealt with the transfer of data converted from textual format to the format of the HDFS file system. Before, it was necessary to create the directory specified in the "LOCATION" parameter of the DDL command seen in the previous step. This was done using the "hdfs dfs -mkdir /MODELDATA" command. Then, the data was transferred to the HDFS, using the "hdfs dfs -put <FILE>/MOD-ELDATA" command. As for the disk volume, contents of files already converted to text format have used 59.4 GB of space, increasing by 2.48 times the original size, which was 24 GB.
Step 6-Create the Apache Hive database table using the ORC format: In this sixth step of the P-EXTRACTOR, a table similar to the table from step 4 was created, changing the format of the Database stored in the Hive to the Optimized Row Columnar (ORC) format.
The ORC format offers to Hive more efficiency if compared to the purely textual format. It manages to improve performance, due to the restructuring of the data in linear groups named "stripes". In addition to this, the other advantage is in data compression, significantly reducing disk usage.
The DDL command used to define this data structure is shown in Fig. 15 and the "STORED AS ORC" setting indicates the storage format. Step 7-Convert the Apache Hive Database into the ORC format: In this seventh and final step of the P-EXTRACTOR, a Data Manipulation Language (DML) command was used to complete the conversion from the Text format to the ORC format. Initially, it was necessary to create the MODELDATA-ORC directory, using the "hdfs dfs -mkdir /MODELDATA-ORC" command. Then, the conversion was carried out using the "hive -S -e "INSERT INTO TABLE climate_orc SELECT * FROM climate"" command.
After converting to the ORC format, the disk space used was 19.7 GB, decreasing 1.22 times if compared to the binary format (24 GB), and 3.02 times if compared to the text format (59.4 GB).
At the end of the application of these seven steps of the proposed P-EXTRACTOR process, the gain obtained with the significant reduction in storage space was evident.

Developing the process P-PRECLIS
The seasonal climate prediction process, named in Portuguese by the acronym P-PRE-CLIS, has involved the relationship between the Applications Module and the Big Data Module.
The processing of the seasonal climate prediction was obtained by executing a HiveQL query, which encapsulated metrics to measure the complexity of Hadoop (MapReduce and HDFS) and Hive in the ORC format. Equation 1 presents an example of application of these metrics used for seasonal climate prediction of average precipitation, based on the following three subsequent average calculations. First, the average of the 15 Ensemble members generated per day was calculated. Then, the monthly average was calculated and, finally, the trimesterly average.
In the first part of the calculation, x i , p represented the number of Ensemble members (15). In the following calculation, This calculation corresponded to approximately 151,142,400 tuples (tuples of each file multiplied by the number of files). These tuples were grouped by the value of the Fig. 15 The DDL code to create the Hive Table in ORC format index, which ranged from 1 to 18,432. As each tuple contained 48 variables, an approximate total of 7.25 billion elements was analyzed at the end (total tuples multiplied by the number of fields).
A code fragment in Python is illustrated in Fig. 16, which presents as a result, the index and the average precipitation of the seasonal climate prediction obtained, where the grouping and order were applied to the index field. With this, it was found that it would also be possible to perform other types of calculations while maintaining the same calculation methodology.

The main results and discussions
The experiment of this research was executed in a Hadoop Cluster and involved a frontend equipment and 32 slave nodes. In this experiment, distributed storage resources from the HDFS storage system and parallel processing were used through the MapReduce functionalities.
The front-end equipment was configured with 4 processing cores, 16 GB of RAM, 4 disks with 73 GB, and an additional disk with 147 GB of local storage. Each slave node was configured with 4 cores, 8 GB of RAM, and 1 disk with 250 GB of local storage. On this disk, a partition with 184 GB of storage was reserved for HDFS, which was grouped in the HDFS system with other partitions, totaling 5.7 TB of storage.
The first stage of verification of the proposed canonical model has shown precision in the results produced. The proposed seasonal climate prediction method-M-PREC-LIS was applied in the processing of average precipitation resulting in six different forecast versions: the TSM persisted with Kuo, RAS, and Grell convection; and the TSM expected with Kuo, RAS, and Grell convection.
These files were compared with the operational files and were considered identical. Figure 17 shows an example using output files plotted in the Grads tool.
In this example, it was found that the code implemented was of low cyclomatic complexity, as it has few indented instructions and a small number of operators and operands. Thus, using the Halstead and Cyclomatic Complexity metrics would not result in significant values for analysis. However, this fact proved that the code was testable, of low complexity, and easy to maintain. 16 A Python language fragment of code for a seasonal climate prediction Therefore, the metrics of source code lines and computational performance were chosen because they are more related to the characteristics of the implemented code.
As for the metrics of source code lines, only the third step of the Data Preparation Component used Python language code with 37 useful lines, the remaining steps were performed by line commands in the terminal. The use of the Bridge2Hive-ORC Component employed the DML INSERT command that belongs to the HiveQL language. The proposed seasonal climate prediction process P-PRECLIS was used based on a Python script with only 23 useful lines.
The execution duration of the steps referring to the Data Preparation Component was 30 min and 16 s and the data extraction occurred through the Bridge2HiveORC Component, performed in parallel with the generation of the binary files. Therefore, times were canceled.
As for performance, the use of M-PRECLIS, as a proposed method to carry out the processing of the seasonal climate prediction, consumed only 4 min and 8 s, with a maximum consumption of 60% of all cores, 60 GB of RAM, and 100 GB of cache.
According to [6], in a relevant observation related to mono-processed architecture, when it comes to Big Data, it is not possible to base calculations on the optimal parallelization equation, where the cost has the same order of sequential processing: p · T p = θ(T s ) , because the hard drive becomes a bottleneck.
According to the bibliographic research carried out, the related works have an emphasis only on observed data, visualization of weather forecast, and climate and performance in the use of meteorological data.
Preserving certain similarities, but differently from these researches, the M-PRE-CLIS proposed and presented in this research, different from the traditional models and methods previously used, provided the most effective realization of a seasonal climate prediction on the output data of the Global-CPTEC numerical model, using: the Apache Hadoop framework, the MapReduce distributed processing, the HDFS distributed storage, the Apache Hive module, and the ORC file system. The proposed method M-PRECLIS was implemented in the experiment of a case study on Seasonal Climate Prediction, using the main methods, techniques, tools, and metrics of free software available in the market and applying Big Data technologies.
At the end of the experiment, it was found that the proposed model is consistent and produces correct results, in addition to having low code complexity and good performance. According to the literature review, it was identified that the proposed model is innovative when performing seasonal climate prediction on the output data of the atmospheric numerical model of climate area. Thus, the use of the model and the proposed method can provide greater dynamics, speed, and conciseness in Seasonal Climate Prediction.
In addition, from this experiment, it was also found that the proposed method has provided an environment capable of: integrating Big Data Sets and various types of meteorological data; centralizing data storage; avoiding raw data transfers and file replications; and at the same time, providing better scalability.
The authors of this article believe that the main contribution of this research work was the elaboration of the canonical model for seasonal climate prediction, based on Big Data.
The first additional contribution of this research carried out from the Canonical Model was the design and development of the seasonal climate prediction method named M-PRECLIS.
The second additional contribution of this research work was the use of: Python programming language; Apache Hadoop; Apache Hive; and the format of ORC files in the knowledge domain of Seasonal Climate Prediction.
The third additional contribution of this research was the analysis of the Canonical model application, the M-PRECLIS method and its three internal processes (P-INSERTER, P-EXTRACTOR, and P-PRECLIS) in a practical and real experiment, involving operational data from the seasonal climate prediction.
The authors of this article also consider as a complementary contribution of this research work, the satisfactory use of the proposed canonical model, mainly for providing faster calculations on a base, involving Big Data Sets of Meteorological Data, coming from a very robust Numerical Model of Climatic Forecasting.
On this context, in addition to providing a computational environment capable of integrating various types of meteorological data, centralizing data storage and increasing scalability, the proposed canonical model avoided raw data transfers and unnecessary file replications, saving several resources involved.
Therefore, based on the experiment carried out and from data presented in this article, it was found that the proposed canonical model constitutes a viable alternative that can guide new paths to be taken that, certainly, will demand new research and future improvements to corroborate the development of increasingly robust systems with different additional features.
The authors believe that the future of this important area of research will increasingly involve Large Meteorological Data Sets focusing on Big Data as a solution and will naturally evolve towards the creation and improvement of new models, methods, techniques, metrics, and tools for emerging computer systems.
For future work, it is suggested the use of data from other Numerical Models of Climate Forecasting, composing multi-model databases that allow even more accurate calculations of probabilistic climate predictions.
It is also suggested for future work to use observed historical meteorological data and historical series of model executions, with the purpose of producing statistics, supporting reliable analysis of models, and identifying evolutions or failures.
Finally, it is suggested the continuation of this research, addressing new case studies and even more complete and larger experiments involving Big Data Sets, new models, methods, techniques, metrics, and tools for the development of even more agile and robust computer systems.
Among the main tools for deepening new research and its applications in the domain of knowledge of Seasonal Climate Prediction, it is suggested the use of the following Big Data technologies: Pig, as a scripting language for MapReduce; Hbase, as a Hadoop database; Flume, as a log export system; Sqoop, as a system for exporting DBMS data to Hadoop; Apache Cassandra, to provide linear scalability in NoSQL Databases; and/ or Apache Spark, as a Framework capable of providing Data Analytics and real-time Machine Learning in distributed computing environments.