Skip to main content

Evaluation of big data frameworks for analysis of smart grids


With the rapid development of smart grids and increasing data collected in these networks, analyzing this massive data for applications such as marketing, cyber-security, and performance analysis, has gained popularity. This paper focuses on analysis and performance evaluation of big data frameworks that are proposed for handling smart grid data. Since obtaining large amounts of smart grid data is difficult due to privacy concerns, we propose and implement a large scale smart grid data generator to produce massive data under conditions similar to those in real smart grids. We use four open source big data frameworks namely Hadoop-Hbase, Cassandra, Elasticsearch, and MongoDB, in our implementation. Finally, we evaluate the performance of different frameworks on smart grid big data and present a performance benchmark that includes common data analysis techniques on smart grid data.


The smart grid combines the traditional power system with information technology leading to a one of most important modern cyber-physical systems. Smart grid is envisioned to fully integrate high-speed and two-way communication technologies into millions of power equipment to establish a dynamic and interactive infrastructure with new energy management capabilities, such as advanced metering infrastructure (AMI) and demand response. Smart grid heavily relies on information and communication technology to achieve efficient and reliable operation [1]. At the same time, smart grid big data has provided new opportunities for electric load forecasting, anomaly detection (e.g power theft), and demand side-management. However, the high-dimensional and massive smart grid big data creates new challenges in data transmission, data storage, and data analysis. This paper addresses the problem of creating a benchmark for big data frameworks used in smart grid big data analysis. We also develop a realistic smart grid data generator for performance analysis in real conditions.


The rapid growth of smart grid, deployment of modern information and communication technologies and millions of newly deployed smart meters, will generate large amount of smart grid. Smart grid big data analysis is considered to be the key to solving significant problems of this industry. With the exponential growth of data, how to efficiently utilize this data becomes a critical issue. Most of recent work have focused on smart grid analysis and big data technology separately, without considering a benchmark for different big data frameworks. But, it is important to understand different big data frameworks’ performance in analyzing millions of smart meters’ data.


The main contributions of this paper are as follows:

  1. 1.

    A smart grid data generator is designed based on big data platforms, taking into account the practical concerns of realistic smart grid.

  2. 2.

    The data generator is developed and implemented using Spark and HDFS file-systems.

  3. 3.

    The four most efficient open source big data frameworks are selected and used to analyze smart grid big data.

  4. 4.

    We propose a benchmark for evaluating smart grid big data frameworks in term of different criteria such as run time, throughput, latency and etc.

  5. 5.

    Based on the proposed benchmark, we compare four big data frameworks and select the best one for smart grid big data analysis

Paper structure

The rest of the paper is organized as follows. “Related work” section discusses related work. In “Methods” section, we explain the design of our scalable smart meter data generator. We also provide the required background on big data concepts and the four big data frameworks for smart grid big data analysis. In “Implementation and benchmarking of big data technology in smart grid” section we propose and deploy our benchmark for comparing smart grid big data frameworks. In “Results and discussion” section, we evaluate different frameworks for smart grid data analysis and present the benchmark results. Finally, “Conclusion” section provides some concluding remarks and outlines directions of future research.

Related work

There is a wide range of studies that are related to the big data challenges in smart grids. In this section, we try to briefly cover researches that their results are mostly related to our work.

Big data concepts in smart grid

Ye et al. in [2] illustrate big data analytics and cloud computing in a smart grid with focus on relevant issues in demand response and wide-area monitoring. To fulfill the potential of energy big data and obtain insights to achieve smart energy management, a comprehensive study of big data-driven smart energy management presented in [3]. Vasilakos et al. in [4] present a survey and discuss about new findings and developments in the smart grid big data analytics and security. A big data framework that can be a start for innovative research in smart grids and implementation of the framework on a secure cloud-based platform is presented in [5]. In [6], an overview on the state-of-the art of big data technology in smart grid is presented. A comprehensive survey on the life cycle of data and data trading, such as studying a variety of data pricing models, categorizing them into different groups, and comparison of these models is presented in [7]. Kang et al. in [8] introduce some recent progress in the application of big data analytics in China’s electric power industry. A study on the compression techniques for smart meter big data is presented in [9]. A smart grid big data ecosystem based on the state-of-the-art Lambda architecture is proposed in [10] that is capable of performing parallel batch and real-time operations on distributed data. Jiang et al. in [11] proposed a high performance and privacy-preserving query scheme over encrypted multidimensional big metering data to address how to efficiently query encrypted multidimensional metering data stored in an untrusted heterogeneous distributed system environment. In [12], a lightweight communication approach for big data traffic is presented that can prevent or quickly respond to the occurrence of network congestion in situations that computing power and bandwidth are limited. A comprehensive tutorial and survey to highlight research challenges on the vulnerabilities in smart grid big data is presented in [13]. Hou et al. in [14] propose a novel temporal, functional, and spatial big data computing framework for large-scale smart grid. The potential applications of big data analytics in electric grids is discussed in [15]. A study on 0.1 billion data point, collected by various smart Internet of Things (IoT) devices in power system of China is done to analyze consumption characteristics of power users in [16].

Big data analysis in smart grid

Here, we briefly survey the papers that focused on using big data analysis technology in different domains such as anomaly detection, marketing and electricity price forecasting, fault tolerance etc. in smart grid. Liu et al. in [17] proposed a lambda system for detecting anomalous consumption patterns, aiming at assisting decision makings for smart energy management. A real-time anomaly detection framework, which can be built based upon smart meter data collected in smart grid big data is proposed in [18]. Vimalkumar et al. in [19] designed a big data framework using various machine learning techniques to detect intrusions. An unsupervised progressive incremental data mining mechanism applied to smart meters big energy consumption data through frequent pattern mining is presented in [20]. Wu et al in [21] proposed a security situational awareness mechanism based on the analysis of big data in the smart grid. A novel electricity price forecasting model that handle with huge price data in the smart grid big data is introduced in [22]. Ye et al. in [23] proposed a secure data learning scheme for big data applications in the information and communication technology infrastructure of the smart grid. In [24], a new model for the microgrids optimal scheduling and load curtailment based on big data concept is proposed. A data-driven method of association rule mining for transformer state parameters based on big data is proposed by combining the a-priori algorithm and probabilistic graphical model in [25]. From performance point of view only in [26], smart meter analytic from a software performance perspective is examined. Also a smart meter data generator using Spark that can generate semi-realistic data sets is presented in [27].

Fig. 1
figure 1

a Overall architecture of our smart grid big data framework, b smart grid big data analysis system model


Figure 1 presents the overall architecture of our smart grid big data framework and data analysis system model based on this architecture. The system can be divided in to three blocks: data generator, database, and data analysis. Data initialization module is a smart meter dataset that has been provided by the Irish Social Science Data Archive [28]. This real data is collected during 2009 and 2010 with over 5000 Irish homes and businesses participating. The data set includes over 150 million records of 5000 customers for 18 months. A smart meter measures consumption in a given interval every 30 min. Figure 2 illustrates the typical power consumption of customer from ISSDA dataset. We can see a fixed pattern in workdays and small differences in weekends. In the following we describe each part of the smart grid big data system model in details.

Fig. 2
figure 2

A week smart meter reading of customer

Smart grid data generator model

In order to investigate the performance and benchmarking of big data technologies in smart grids, we need to process petabytes of data for smart grid data management and analytics. Indeed using realistic data for performance evaluation is much better, but obtaining datasets with records of millions customers is challenging because of privacy issues and the lack of truthful data sources. Therefore, in this section, we first describe a comprehensive smart meter data generator based on big data frameworks that can generate semi-realistic datasets with large number of customers. Our proposed data generator use a supervised machine learning algorithm that can generate data based on small real datasets. As shown in Fig 1, the data generator has different blocks: pre-process module, Spark module and HDFS file system module. The pre-process module, first checks the format and values in each field of the seed data, then clustering the customers in order to offer personalized energy-efficiency services [29]. It also prepares time series of data to send to the Spark module. The engine of the data generator includes two modules: training module and data generation module which are implemented using memory-based distributed computing framework Spark. In the first stage, Resilient Distributed Dataset (RDD) is built in RAM to store temporary data. The seed data is deseasonalized using a moving average time series model in order to flatten the periodic variations and create an adjusted time series. The moving average by smoothing the peak periods reduces the periodic variations from the actual time series and it uses auto-regressive (AR) model trained with the flattened time series to predict meter reading. In the end, the periodic variations are added back. Actually, the training process will take a clustered seed data as the input to create the models and writes the output in HDFS. Since, generated data has a stream-oriented nature, we propose using Apache Kafka [30] as a message queue in big data framework in our model. The output is inserted in NoSQL database through Kafka queue. The generated data has four fields including meter ID,time, power consumption and environment temperature. The meter ID is consider as 1000 to 5001000. The time is stored as a five-digit numeric field, the first three numbers representing the day n and the next two digits representing the clock that it’s value set between 1 to 48 which time of day start 0 correspond to 12 PM . The power consumption field is considered as numerical data. We see that, all data is structured column-based and range of changes is fully specified.

NoSQL database

NoSQL provides the new data management technologies designed to meet the features of big data i.e. volume, velocity, and variety of data. It can store and retrieve data that is modeled in means other than the tabular relations used in relational databases [31]. In comparison to relational databases, NoSQL databases are more scalable and provide superior performance when we are dealing with big data. Based on our experiences and domain experts’ recommendations, we select four open source NoSQL database: Hbase, Cassandra, MongoDB and Elastic search. In the following we describe each technology.


Hbase is a column store based on Apache Hadoop and concepts of BigTable. It is a NoSQL key/value store which runs on top of HDFS [32]. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Although Hbase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality. Hbase has centralized architecture where the Master server is responsible for monitoring all RegionServer (responsible for serving and managing regions) instances in the cluster, and is the interface for all metadata changes. Hbase is optimized for reads, supported by single-write master, and resulting strict consistency model, as well as use of Ordered Partitioning which supports row-scans. It provides CP (Consistency, Partition tolerance) form the CAP theorem. The most important feature of Hbase is strong consistency and fast read and write with high scalability.


Cassandra is also a column store based distributed database management system based on ideas of BigTable and DynamoDB. Unlike Hbase, Cassandra has decentralized architecture and each node can perform any operation that it provides AP (Availability, Partition-Tolerance) from the CAP theorem [33]. Cassandra has excellent single-row read performance as long as eventual consistency semantics are sufficient for the use-case. Cassandra quorum reads, which are required for strict consistency will naturally be slower than Hbase reads. Cassandra is well suited for supporting single-row queries or selecting multiple rows based on a Column-Value index [34].


MongoDB is a document oriented database. All data in MongoDB is treated in JSON/BSON format. It is a schema less database which goes over tera-bytes of data in database. It also supports master slave replication methods for making multiple copies of data over servers making the integration of data in certain types of applications easier and faster [31]. MongoDB combines the best of relational databases with the innovations of NoSQL technologies, enabling engineers to build modern applications. MongoDB provides the data model flexibility, elastic scalability, and high performance of NoSQL databases. As a result, engineers can continuously enhance applications, and deliver them at almost unlimited scale on commodity hardware [35].


Elasticsearch is an open-source search engine built on top of Apache \(\hbox {Lucene}^{\mathrm{TM}}\), a fulltext search engine library. Elasticsearch is a real-time distributed search and analytics engine. It is used for full text search, structured search, analytics, and all three in combination [36]. It has a very flexible query API, that supports filtering, sort, pagination, and aggregations in the same query. Elasticsearch handles unstructured data automatically, meaning you can index JSON documents without predefining the schema. It will try to figure out the field mappings, and adding/removing the new/existing fields automatically. Also task such as clustering, replication of data, and automatic fail-over are supported out of the box and are completely transparent to the user.

Implementation and benchmarking of big data technology in smart grid

In this section the architecture of the platform is briefly introduced and principles for building real-time processing systems are presented. Furthermore, the advantages of using the architecture in smart grids are summarized.

Cluster specification

To implement our ecosystem, we use a cluster of 6 nodes. Each node is a server with 20 physical CPU core and 40 in hyper-thread mode, 128GB RAM, and 24TB HDD. We assign a node to Spark, a node to HDFS file system, and a node to Kafka queue. Each of the database technologies (i.e. Hbase, Cassandra, MongoDB, and Elasticsearch) is implemented on a cluster with three nodes.

Smart grid data generator implementation

The smart meter data generator is implemented on two nodes. The implementation is consisted of two functions: data-preparator and data generator. The initial value for training algorithm and specifications of data generated are summarized in Table 1.

Table 1 Initial and generated data description

The generated data is stored in HDFS and the format of each data file is as follows: 3 columns corresponding to Meter ID, a five digit code composed of day code: digits 1–3 (day 001 = 1st of January 2009), and time code: digits 4–5 (1–48 for each 30 min with 01= 00:00:00–00:29:59), and electricity power consumed during each 30 min interval (in kWh). By means of this implementation we generate a big data-set with 5,000,000 home and commercial customers for 1.5 year that might be used to evaluate big data platforms performance for smart grid data.

Big data backend implementation


In Hbase, read and write processes have two stages: data transfer to RAM memory data-node that is called memstore and then copying on hfile. Figure 3 illustrate the process of Hbase read and write.

Fig. 3
figure 3

Hbase data model


We implement our cluster with three nodes as a ring. Figure 4 shows the mechanism of data processing in Cassanadra. In data insertion, each write command changes store in memtable memory on RAM and then on SSTable. Also in order to protect data against fault and Cassanadra nodes down, replica of data is stored temporarily in an optimized disc called commitlog for each change caused by write command.

Fig. 4
figure 4

Cassandra data model


In our MongoDB implementation we consider a process depicted in Fig. 5. The implementation has a mongos function that acts as a router and determines the path of data store. The data directly is written on tables, after the data store path is determined by mongos.

Fig. 5
figure 5

MongoDB data model


The data model of Elasticsearch that we have implemented is shown in Fig. 6.

Fig. 6
figure 6

Elasticsearch data model

Results and discussion

In this section, we evaluate the performance of each big data technology with default configurations in terms of effectiveness and scalability. The effectiveness will be evaluated by comparing WRITE and READ parameters. The scalability will be assessed by measuring the execution performance of the full mechanism. Table 2 summarizes the comparison of the four technologies performance. We compare big data technologies based on the most famous criteria including: Run time, Throughput, number of operations, and latency.

Table 2 The summarized performance criteria results

Figure 7 shows the run time of each big data database technology. We see that Cassanadra in write and read stages has less run time and consequently better performance.

Fig. 7
figure 7

Run time of different technologies

Comparison of throughput’s in ops/sec is presented in Fig. 8. It is shown that Cassandra has the best throughput and MongoDB performs the worst.

Fig. 8
figure 8

Throughput of different technologies

Figure 9 shows the percentile latency of different technologies. If A% latency of a technology is B, it means that A% of read or write latencies of this technology is less than or equal to B. For example in Fig. 9, we see that the 95% latency of MongoDB is 2 ms, thus 95% of read/write latencies of MongoDB are less than or equal to 2 ms. We see that Cassanadra has the smallest average latency in both read and write processes. Based on these results we observe that Cassandra has the best overall performance among the four open source big data database technologies that are used for analysis of smart grid big data in real scale. Generally as outlined in "Methods" section, because of smart grid meter data is structural column based data, that in our implementation the smart meter ID and time fields are considered as key partition for storing these data. Since the Cassandra is one of the best column base databases, writing on this database is better than other technologies. Also, Cassandra read operation is faster than other technologies due to key structure and sorting data according to time.

Fig. 9
figure 9

Percentile-latency of Elasticsearch, MongoDB, Hbase and Cassandra comparison


In this paper we proposed a framework to analyze smart grid big data. At first a comprehensive architecture for evaluation of smart grid big data is proposed. Then we introduce a scalable data generator to overcome to lack of access to real smart grid big data. Since there are different big data technologies at this time, we selected four of the best open source big data NoSQL databases and integrated them in the architecture of our large scale smart grid data generator. Based on our implementation, we evaluated the performance of Elasticsearch, MongDB, Hbase and Cassandra in terms of different criteria. Our results show that among these database technologies, Cassandra has the best overall performance and satisfies all requirements for smart grid bid data analysis. In the future, we plan to perform cyber security analysis using smart grid big data based on Cassandra. Also we are going to extend our data generator to generate data related to electricity theft and crypto-currency mining.

Availability of data and materials

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.


  1. Ansari MH, Vakili VT, Bahrak B, Tavassoli P. Graph theoretical defense mechanisms against false data injection attacks in smart grids. J Mod Power Syst Clean Energy. 2018;6(5):860–71.

    Article  Google Scholar 

  2. Ye F, Qian Y, Hu RQ. Big data analytics and cloud computing in the smart grid. Smart Grid Commun Infrastruct. 2018;171–185.

  3. Zhou K, Fu C, Yang S. Big data driven smart energy management: from big data to big insights. Renew Sustain Energy Rev. 2016;56:215–25.

    Article  Google Scholar 

  4. Vasilakos A, Hu J. Energy big data analytics and security: challenges and opportunities. IEEE Trans Smart Grid. 2016;PP(99):1.

    Article  Google Scholar 

  5. Munshi AA, Mohamed YA. Big data framework for analytics in smart grids. Electric Power Syst Res. 2017;151:369–80.

    Article  Google Scholar 

  6. Tu C, He X, Shuai Z, Jiang F. Big data issues in smart grid—a review. Renew Sustain Energy Rev. 2017;79(May):1099–107.

    Article  Google Scholar 

  7. Liang F, Yu W, An D, Yang Q, Fu X, Zhao W. A survey on big data market: pricing trading and protection. IEEE Access. 2018;6:15132–54.

    Article  Google Scholar 

  8. Kang C, Wang Y, Xue Y, Mu G, Liao R. Big data analytics in china’s electric power industry: modern information, communication technologies, and millions of smart meters. IEEE Power Energy Mag. 2018;16(3):54–65.

    Article  Google Scholar 

  9. Wen L, Zhou K, Yang S, Li L. Compression of smart meter big data: a survey. Renew Sustain Energy Rev. 2018;91(April):59–69.

    Article  Google Scholar 

  10. Munshi AA, Mohamed YARI. Data lake lambda architecture for smart grids big data analytics. IEEE Access. 2018;6:40463–71.

    Article  Google Scholar 

  11. Jiang R, Lu R, Choo KKR. Achieving high performance and privacy-preserving query over encrypted multidimensional big metering data. Future Gener Comput Syst. 2018;78:392–401.

    Article  Google Scholar 

  12. Chang YC, Huang TC. An interactive smart grid communication approach for big data traffic. Comput Electr Eng. 2018;67:170–81.

    Article  Google Scholar 

  13. Chin Wl, Li W, Chen Hh. Energy Big Data Security Threats in IoT-Based Smart Grid Communications (October), 2017;70–75.

  14. Hou W, Ning Z, Guo L, Zhang X. Temporal, functional and spatial big data computing framework for large-scale smart grid. IEEE Trans Emerg Topics Comput. 2017;6750(1):1–1.

    Article  Google Scholar 

  15. Junaidi N. Big Data Applications in Electric Energy Systems. In: 2018 international conference on computational approach in smart systems design and applications (ICASSDA). 2017; p. 1–5.

  16. Shuai C, Yang H, Ouyang X, He M, Gong Z, Shu W. Analysis and identification of power blackout-sensitive users by using big data in the energy system. IEEE Access. 2018; PP(c):1–1 .

    Article  Google Scholar 

  17. Liu X, Nielsen PS. Scalable prediction-based online anomaly detection for smart meter data. Inf Syst. 2018;77:34–47.

    Article  Google Scholar 

  18. Moghaddass R, Wang J. A hierarchical framework for smart grid anomaly detection using large-scale smart meter data. IEEE Trans Smart Grid. 2018;9(6):5820–30.

    Article  Google Scholar 

  19. Vimalkumar K, Radhika N. A big data framework for intrusion detection in smart grids using Apache Spark. 2017 International conference on advances in computing, communications and informatics (ICACCI). IEEE, p. 198–204, 2017.

  20. Singh S, Yassine A. Mining energy consumption behavior patterns for households in smart grid. IEEE Trans Emerg Topics Comput. 2017;6750(c):1–1 .

    Article  Google Scholar 

  21. Wu J, Ota K, Dong M, Li J, Wang H. Big data analysis based security situational awareness for smart grid. IEEE Trans Big Data 2016;7790(c):1–1 .

    Article  Google Scholar 

  22. Wang K, Xu C, Zhang Y, Guo S, Zomaya A. Robust big data analytics for electricity price forecasting in the smart grid. IEEE Trans Big Data 2017;7790(c):1–1 .

    Article  Google Scholar 

  23. Ye F, Qian Y, Hu RQ. A secure data learning scheme for big data applications in the smart grid. 2017.

  24. Nguyen HK, Khodaei A, Han Z. A big data scale algorithm for optimal scheduling of integrated microgrids. IEEE Trans Smart Grid. 2018;9(1):274–82.

    Article  Google Scholar 

  25. Sheng G, Hou H, Jiang X, Chen Y. A novel association rule mining method of big data for power transformers state parameters based on probabilistic graph model. IEEE Trans Smart Grid. 2018;9(2):695–702.

    Article  Google Scholar 

  26. Liu X, Golab L, Golab W, Ilyas IF, Jin S. Smart meter data analytics. ACM Trans Database Syst. 2016;42(1):1–39.

    Article  Google Scholar 

  27. Iftikhar N, Liu X, Danalachi S, Nordbjerg FE, Vollesen JH. A scalable smart meter data generator using spark. In: OTM Confederated International Conferences “On the move to meaningful internet systems”. Cham: Springer; 2017. p. 21–36.

    Chapter  Google Scholar 

  28. Electricity customer behaviour trial.

  29. Iftikhar N, Liu X, Nordbjerg FE, Danalachi S. A Prediction-Based Smart Meter Data Generator. In: NBiS 2016—19th international conference on network-based information systems. 2016; p. 173–80.

  30. Kafka, Walkenbach J. The definitive guide. 2010;.

  31. Harrison G. Next generation databases. 2015;.

    Book  Google Scholar 

  32. Vaish G. Getting Started with NoSQL. Packt Publishing Ltd, 2013.

  33. Hewitt E. Cassandra: the definitive guide, first. edit ed. Sebastopol: O’Reilly; 2011.

    Google Scholar 

  34. Kan C. Cassandra data modeling and analysis. Birmingham: Packt Publishing Ltd; 2014.

    Google Scholar 

  35. Kristina Chodorow and Michael Dirolf: MongoDB: The Definitive Guide (2010).

  36. Gormley C, Tong Z. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. Sebastopol: O’Reilly Media, Inc.; 2015.

    Google Scholar 

Download references


Not applicable.


Not applicable.

Author information

Authors and Affiliations



MHA performed the study, developed the models and implemented it, performed the analysis and drafted the manuscript. VT is a leader of research, conducted the research and approved the results. BB worked with MHA and VT to develop the articles framework and interpreted results. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Vahid Tabatab Vakili.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ansari, M.H., Tabatab Vakili, V. & Bahrak, B. Evaluation of big data frameworks for analysis of smart grids. J Big Data 6, 109 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: