Big data actionable intelligence architecture

The amount of data produced by sensors, social and digital media, and Internet of Things (IoTs) are rapidly increasing each day. Decision makers often need to sift through a sea of Big Data to utilize information from a variety of sources in order to determine a course of action. This can be a very difficult and time-consuming task. For each data source encountered, the information can be redundant, conflicting, and/or incomplete. For near-real-time application, there is insufficient time for a human to interpret all the information from different sources. In this project, we have developed a near-real-time, data-agnostic, software architecture that is capable of using several disparate sources to autonomously generate Actionable Intelligence with a human in the loop. We demonstrated our solution through a traffic prediction exemplar problem.

Executive Order 12,906 concepts by providing "the technology, policies, standards, and human resources necessary to acquire, process, store, distribute, and improve utilization of geospatial data. " [9]. This paper is organized as follow. "Exemplar problem" section provides a discussion on the data sources and exemplar we used to demonstrate our architecture. "Related works" section goes over any related work in current open literature. "Methods" section discusses our approach to the BDAI problem. "Results and discussion" section provides a discussion of the results in our project. "Conclusion" section goes over the conclusion of our research.

Exemplar description
To demonstrate our capability of transforming Big Geospatial Data to Actionable Intelligence in near-real-time, we focused on an exemplar problem of generating Actionable Intelligence in regard to the traffic congestion in the city of Chicago. The traffic prediction problem is extremely complex, which makes it hard to accurately predict traffic condition based on off-line data (patterns, trends, road networks, etc.) or crowdsourcing applications such as Waze [10] due to the dynamic changes of real-time environment (i.e. accidents, sport events, weather changes, etc.). This exemplar highlights the importance of Actionable Intelligence. For example, first responders need to safely and expeditiously transport a victim to the hospital. Rapidly identifying the fastest route to a medical facility increases the survivability of the victim. Actionable Intelligence provides timely information such as heavy traffic, which allows the first responders to make important time saving transportation decisions. Table 1 provides the data sources used to test the BDAI framework. Figure 1 provides a high-level pictorial illustration of each data types. The data sources were extremely diverse, in terms of data types and data frequency. Most of the data interfaces provided ways to geospatially constraint the results within the Chicago city limits. One of the data sources included a 3-h ground truth dash camera video experiment to validate actionable intelligence created from our framework.

System requirement
A summary of requirements and metrics that we used to evaluate our system is depicted in Table 2.

Assumption about data
We made the following general assumptions in regards to data: 1 Data can be referenced by time and geospatial extent. 2 Each data type may not follow a standardized format. Hence, architecture needs to accommodate needed flexibility to onboard new format. 3 Input data can come from variety of form (structured, semi-structured, or unstructured). 4 Data might not be immediately available for retrieval due to site restriction.

Related works
Traffic prediction analysis is typically done in a crowd sourcing way, where location information from GPS apps are shared among users to help predict the fastest route [17].
Recently, improvement in traffic prediction accuracy using social media data has been demonstrated [18]. Despite many researches on traffic prediction [19], many existing research focuses on using few data sources for traffic prediction. Based on our research, we were not aware of any existing work utilizing a combination of data sources such as Twitter, web camera imageries, satellite imagery, dash camera video, Mapquest, and GDELT to support near-real-time traffic prediction. Our work uses seven disparate data sources as described in Table 1. Each data sources can be streamed from multiple locations. The web camera data in particular, involves the live streaming of over hundreds of camera locations around the City of Chicago. The traffic reports are received from hundreds of stations. Existing software architecture [20] typically focuses on acquisition, storage, and the retrieval of Big Data. However, our architecture focuses on Actionable Intelligence generations. Several data architecture has been proposed for network traffic monitoring applications [21][22][23], but our data architecture supports multiple disparate data sources. A general five-layer Big Data Processing and Analytics (BDPA) involves a collection layer, a storage layer, a processing layer, an analytic layer, and an application layer [24]. However, this architecture does not address actionable intelligence generation in their framework. In 2019, Zhu et al. states: "Currently, there are no widely accepted BDPA solution, especially a general-purpose solution fit for both traditional and internet industries [24]. " Liu et al. [25] proposed a general multi-source framework [25] to map disparate data sources to a common unified data format for Big Data fusion. Their paper suggested the benefits of combining heterogenous sources to provide a better solution, but it did not provide a solution on how this framework can be integrated with Big Data streaming sources. Hence, the motivation for our work focuses on using Big Geospatial Data to answer key customer geospatial and temporal questions. Big Geospatial Data is Big Data with geospatially tagged features and error estimates. As stated by the NIST Big Data Public Working Group (NBD-PWG), "Big Data consists of extensive datasets, primarily in the characteristics of volume, variety, velocity, and/or variability-that require a scalable architecture for efficient storage, manipulation, and analysis. " [26]. While most Big Data information fusion solution focuses on social media data sources [27], our architecture accommodates a variety of geospatially tagged data sources at various velocities and veracities. Our traffic prediction exemplar allows us to test and validate key BDAI capabilities: handling heterogenous data sources, hosting data pipelines on distributed processing platforms, and running machine learning algorithms in near-real-time. The exemplar is not meant to compete with crowd sourcing GPS apps, but rather serve as a generic exemplar that can be extended to other Big Data Actionable Intelligence problems.

System setup
Our BDAI software was initially deployed to a bare metal system named "Ray". We deployed, configured, and tested the HORTONWORKS Data Platform ( ). Most of our custom data processing code is implemented in Java, [31] with some processing implemented in Python [32].

BDAI architecture contributions
A high level of our BDAI architecture is depicted in Fig. 2. While a similar architecture has been proposed in open literature [20,24], these architectures focus on acquisition, storage and retrieval of Big Data, and on the use of specific datatypes [22,23]. The key question we want to answer in this paper is: Can we create a near-real-time data agnostic software architecture that can process many disparate sources while autonomously generate Actionable Intelligence? In order to combine and fuse disparate streaming data sources to produce actionable intelligence, we believe Big Data should be curated as it arrives to the system. Our main contributions to the Big Data Architecture field is listed as such: 1. Provide a general framework to map data from disparate data sources into a common frame of reference indexed by time and geo-spatial extent. This enables our architecture to stay data agnostic, which provides the possibility to quickly onboard new data sources that allows for agile responses to complete new and orthogonal scenarios. This method also provides the ability to ask questions generically over many disparate data sources, which minimizes the learning curve to perform meaningful fusion and analysis. 2. Provide a high-level description of our implementation in which our architecture uses a modern Big Data technology stack (depicted in Fig. 3). This software stack is natively distributed and built for high-throughput streaming that allows us to tackle problems of mission-level magnitude. 3. Demonstrate and prove that our architecture and technology stack are capable of supporting the streaming of disparate data sources to produce actionable intelligence.

BDAI Architecture-algorithm workflow
Our architecture contains four levels of processing: Data Source, Data Pipeline, Data Analytic, and Data Reporting. First, we set up a streaming interface connection for each data source. We utilized Apache Storm's topology [33] and Apache Kafka's [34] interprocess communication mechanism to implement our Data Pipelines because they are known to achieve a high level of scalability, low latency, fault-tolerant, and the data is guaranteed [35,36]. A general workflow of our data pipeline is depicted in Fig. 4. We created a separate processing Storm Topology [33] for each data type. Each topology follows a similar workflow of acquiring, normalizing, processing, and publishing the data (see Fig. 3). Apache Kafka is used as a central messaging broker, connecting each step of the processing. For example, when incoming data arrives, it will first be placed in Kafka, and the "Getter" will be informed to obtain the data. The "Getter" is responsible for acquiring the data from an individual data source. The "Normalizer" is responsible for transforming the data by mapping out both raw data and metadata into a common event schema. A description of the event schema is depicted in Fig. 5. The ontology mapping of each individual data source into a common event description is depicted Fig. 6. The mapping of each individual data source into a common data schema is necessary to establish a common frame of reference for events that occurs at a given in time and space. This design makes searching for the events in a specific time or space to be easily accessible. All the data sources are "normalized" with the same common event schema, in which they are all "linked" by the time and its location. By tagging the data in this manner, it ensures that the data can be discoverable by geospatial analytic processing in later steps.
The "Processor" is responsible for extracting events from raw sensor data and then populating its results in the event schema. The "Publisher" is responsible for "indexing" the data to enable search and discovery at the "Data Analytic" level. Apache Solr [37], an enterprise search engine, is used for both indexing and querying the geospatial and temporal data.
In our design, we developed a custom topology for each data type. The custom design provides flexibility to support different data types. An illustration of a web camera topology insertion is depicted in Fig. 7. In this example, the "Processor" was built based on an object detection algorithm called You Only Look Once (YOLO) [38]. As depicted in Fig. 8, the pre-trained YOLO processor did not yield good results. Hence, we labeled and re-trained YOLO using the web camera images from Travel Mid-West. Results of the re-trained YOLO processing are also depicted in Fig. 8 as a comparison. The output of YOLO is used to determine the number of cars in each camera image. The event (i.e. number of cars at a location) generated from the YOLO topology is indexed by image time (when the image is captured) and image location (i.e. latitude and longitude of where the event occurred).
For the Tweeter Topology, we implemented a separate machine-learning "Processor" to process live tweets to generate traffic sentiment. Similarly, we indexed tweeted events by   Table 1.

Muti-source data fusion
Information Fusion (IF) is a process of combining data or information to develop improved estimates or predictions of entity states [39]. Information obtained from a single source can be unreliable or insufficient to make an accurate determination. For example, in one traffic scenario on the Dan Ryan Expressway Inbound between 87th St and 71st St on March 22, 2019, our YOLO topology had reported light traffic conditions because there were very few cars detected (see Fig. 9). However, information received from our Tweet Processor indicated that the road was closed due to police activity (see Fig. 10). Since the Tweet information had already been indexed by time and location, we could easily perform a geospatial query to obtain the Tweet's Information to match the closest image time and location. Hence, the use of multiple data sources is necessary in order to improve the reliability and quality of the information provided to decision makers.

BDAI architecture analytical fusion algorithm
Our BDAI analytic seeks to combine event data from disparate sources to predict traffic congestion by improving the outcome beyond what could be done with a single source of information. At the data analytic level, we first query the normalized and curated data from all data sources by time and location. Then, we performed a data analytic on events occurring at similar times and locations. To demonstrate how machine-learning algorithm can be integrated into our architecture, we designed a Merged Neural Network (as depicted in Fig. 11) to perform the traffic congestion classification. The algorithm takes input from all the normalized event data (related by time and location) to produce a traffic congestion probability. The output is a realvalued number between 0 and 1, as related to the level of traffic, where 0 is negligible traffic and 1 is a severe, complete standstill traffic jam.

Chicago traffic analytic-multi-source analytical fusion demonstration
A web camera image which captured the traffic condition on the Dan Ryan Expressway is depicted in Fig. 12. At the corresponding time frame, our BDAI system was able to locate a tweet from the Total Traffic Chicago data source indicating that the road was closed due to an accident in the area (see Fig. 13). At a similar time frame, the BDAI system had confirmed slow traffic through a traffic report from Mapquest (see Fig. 14). However, Mapquest had reported that the West Dan Ryan Expressway had light traffic (see Fig. 15). This information was also confirmed by the small number of cars detected (Fig. 16) by our web camera topology. Taking into account all of the sources, BDAI was able to distinguish the traffic congestion level on both sides of the West Dan Ryan Expressway.

Traffic classifier performance
Overall, the BDAI Merge Neural Network classifier performed extremely well on intersections where the network was trained. We also tested the BDAI Merge Neural Network classifier on intersections where it was not trained. As expected, the performance was not good. A summary of the performance of our classifier is depicted in Fig. 17.

BDAI dashboard
The output of the BDAI system is visualized using a Banana Dashboard [40], as depicted in Fig. 18. The BDAI Dashboard is back ended by an Apache Solr Cluster, which contains all event data. The map in the lower left represents the event records that were ingested in one of our data pipelines. The icons are the actual geospatial locations of the events. The event metadata is the table to right of the map.

System performance
Our BDAI software was deployed to a bare metal system named "Ray" ("System setup" section). A summary of the system performance from "Ray" for all event types is depicted in Table 3. We are not aware of any similar systems that are published in open literature to draw a direct comparison from our effort. The missing entries in the table are due to insufficient information in that particular event type to derive statistics. Each event type can have multiple sources as there may be multiple camera locations or traffic report stations active at a given time. Our data architecture supports concurrent streaming from each data sources. Each event type restricts how frequent we can "poll" the data. Hence, "polling" is not done instantaneously when the event is available, but rather done at a fixed time interval, as permitted by the external source. This is not a limitation in our architecture, but rather a limitation set forth by an external data source. Latency in Table 3 is measured from the time an event happens, to the time that the event is curated and indexed into Solr. This does not account for any additional latency required by downstream analytic processing. Once the data is indexed into Solr, the data is immediately available to perform any sort of analysis. Some events, such as the web camera imagery requires additional processing (i.e. using the YOLO processor). The time for data processing highly depends on the specific type of algorithm implemented. Our Merge Neural Network ("Muti-source data fusion" section) used for actionable intelligence generation performs a poll from "Solr" every 15 min. All information retrieved over the time interval are used to create actionable intelligence. The execution time for the Merge Neural Network is negligible (within a millisecond). The "polling" period is not a limitation in the architecture, but it is an adjustable parameter depending on the arrival time of each individual data sources. The polling rates for each topology is depicted in Table 4. The overall turnaround time for actionable intelligence generation is mainly driven by the availability of data sources and the frequency we poll the data since actual data processing is deemed negligible.

Performance vs requirement discussion
In regards to the original requirement as depicted in Table 2, our system has achieved the scalability and flexibility needed for Big Data processing. We have demonstrated that our system is horizontally scalable to hundreds of locations. For example, the data we ingested include: traffic segments received from 818 stations, vehicle detection system reports received from 818 stations, images received from 150 camera locations, and dynamic message signs reported from 150 stations. We ingested an average of 132,000 tweets a day, 14,000 camera images a day, and 10,000 posts from Gdelt. A comparison breakdown of the statistics for requirement analysis is depicted in Table 5. The majority of the data met our requirement specification. The only exception is the dynamic message sign report topology. The larger latency was associated with an inconsistent update interval provided in the server rather than the actual latency in our system. As depicted in Table 5, the overall latency performance of each data types are largely driven by external site restrictions on how frequent we are allowed to query the data. Despite this restriction, most data sources had an average latency less than the "polling" time. It is possible that the latency can be further reduced if the data can be pushed to the

Conclusion
In conclusion, our big data architecture provides a framework for machine-learning algorithms to learn and analyze streaming data (e.g. near real-time analytics) from heterogenous data sources (texts, signal waveforms, images, videos) to turn them into actionable information for decision makers. Our data-agnostic solution is accomplished by mapping different data types into a common frame of reference that requires both temporal and geospatial metadata. We have demonstrated through a traffic prediction exemplar that our architecture can support actionable intelligence generation in nearreal-time using disparate data sources. Our traffic prediction exemplar allowed us to test and validate key BDAI capabilities: handling heterogenous data sources, hosting data pipelines on distributed processing platforms, and running machine learning algorithms in near-real-time. Our BDAI platform was designed with flexibility in mind, allowing us to quickly onboard new data sources and apply machine learning algorithms. Our data platform's agility and common frame of reference allows us to rapidly provide Actionable Intelligence to our customer's mission relevant problems. The framework architecture is a generalized architecture that can enable solutions for other BDAI problems with similar data diversity and data volume. The BDAI architecture has been fully implemented into a software system that is currently running and is hosted at Sandia National Laboratories for over a year. Our work has been featured on the local news media [1]. The current BDAI system can produce first order of data analytics (i.e. combining data from multiple source to assess what is happening at current time). In the future, we plan to further develop statistical techniques such as minimum variance to optimize the resultant estimate. In addition, we plan to extend BDAI's capability to include a second order of analytics by providing the decision maker with a list of suggested actions, based on the assessment of the current situation using multiple data sources.