Quality management architecture for social media data
© The Author(s) 2017
Received: 13 December 2016
Accepted: 15 March 2017
Published: 23 March 2017
Social media data has provided various insights into the behaviour of consumers and businesses. However, extracted data may be erroneous, or could have originated from a malicious source. Thus, quality of social media should be managed. Also, it should be understood how data quality can be managed across a big data pipeline, which may consist of several processing and analysis phases. The contribution of this paper is evaluation of data quality management architecture for social media data. The theoretical concepts based on previous work have been implemented for data quality evaluation of Twitter-based data sets. Particularly, reference architecture for quality management in social media data has been extended and evaluated based on the implementation architecture. Experiments indicate that 150–800 tweets/s can be evaluated with two cloud nodes depending on the configuration.
KeywordsQuality attribute Quality metric Quality policy Spark Cassandra Word2Vec
Social media data can be analysed for getting insights into the behaviour of consumers [1, 2] or businesses . However, data to be analysed may have originated from a spam campaign  or contain false information . Therefore, filtering may be needed before meaningful insights can be created. Machine learning methods have been utilised for analysing and predicting credibility of social media data [6–8]. Also, frameworks have been developed for data quality management in social media [9, 10]. However, reference architecture (RA) design for data quality management in social media data has only been addressed recently . The RA is aimed for facilitating design of quality management aspects into implementation architectures of big data systems.
This research extends an earlier work , which focused primarily on social media data quality evaluation for decision making, and initial validation of metadata management architecture for big data systems. This work focuses on validation and extension of the earlier RA . Particularly, the proposed architecture has been implemented into a business context, where a company utilised tweet sentiment analysis in product development. A tool has been developed for data quality management, which enables evaluation, filtering, and querying of tweet-related quality information based on user-defined rules. Performance results indicated that 150–800 tweets/s can be evaluated on two cloud nodes depending on the configuration.
The document is structured as follows. First, related work is reviewed. Then, research question and research method are presented. Next, design of the implemented architecture is illustrated with unified modelling language (UML) views. RA for data quality management in big data systems has been enhanced based on the implementation architecture. The RA is evaluated based on technology selections and performance. Finally, main lessons learnt and future work is discussed. The “Appendix” includes a detailed data view of quality rules, and metadata.
Additionally, important theoretical concepts have been defined in earlier work for data quality management . Quality of data can be understood with associated quality attributes. A quality attribute represents a single aspect of quality (e.g. timeliness) [11, 14]. Quality metrics  are used for measuring properties of a quality attribute. An example of a quality metric is a model, which evaluates timeliness based on the timestamp of a data item. Quality evaluation refers to evaluating the quality of information (data with meaning ), where context and intended use of information is taken into account . Quality evaluation can be performed by utilizing quality policies . The purpose of quality policies is to enable the organisation to define contexts for different situations, where social media data is utilised for decision making. An organizational quality policy defines acceptable data sources, and context of the task . A filtering policy is used for evaluating only relevant quality dimensions to satisfy the requirements of a data consumer .
Also, other solutions have been developed for quality management of social media data. Social-QAS is a tailorable quality management service for social media content . The tool enables quality assessment based on metadata, content, classification of messages, and scientific methods. End users can search for information with different quality parameters in a Facebook-application, which focuses on emergency services . A dynamic quality evaluation concept has been implemented for supporting emergency situations . Specifically, search of social media data can be filtered by weighting different quality parameters (e.g. timeliness, author reputation). A quality framework has been developed for supporting data quality profile selection and adaptation . Particularly, the framework has been implemented for pre-processing of medical data. A hybrid approach has been proposed for quality evaluation across the big data value chain . Quality of medical datasets has been evaluated for detection of cardiovascular risks, and sleep-disordered breathing.
Social media data analysis has facilitated decision making within an enterprise. For example, the role of Twitter-based analytics for supply chain management (SCM) has been studied . Particularly, a framework was proposed for extracting intelligence from Twitter based on descriptive, content, and network analytics. Tweets have been analysed for gaining competitive intelligence in retail . Particularly, sentiment, mentions, and topics were analysed in product categories. Social media channels (Facebook, Twitter) have also been integrated with processing of customer orders with an enterprise resource planning (ERP) system . Several tools exist for validation of data from enterprise systems . The tools typically evaluate uniqueness, conformance, completeness, accuracy, and consistency of data. The impact of data quality metadata for decision making purposes has been studied . Results suggested that the use of data quality metadata may be enhanced with related training.
Methods and algorithms for social media data analysis can be categorized into network analytics, community detection algorithms, text analytics, information diffusion models and methods, and information fusion . Text analytics covers natural language processing (NLP), information extraction, data mining, and machine learning approaches . Information extraction refers to techniques for extracting entities, and their relationships from text. Information extraction has been utilised for entity extraction [24–26], and making tweet hashtag recommendations . Neural network-based language models have been developed for estimation of word representations . Particularly, high-quality word similarity vectors can be learned from billions of words based on continuous bag-of-words or skip-gram model. Google has also published pre-trained models as part of their Word2Vec-implementation . Hashtag recommendation has been proposed by learning of word representations with Word2Vec’s skip-gram model, and a trained neural network . Another application based on Word2Vec is restaurant recommendations based on similarities between tweets, and specified keywords related to foodborne disease symptoms .
Machine learning approaches can be categorized to supervised and unsupervised approaches . Unsupervised learning approaches create a predictive model based on input data with a clustering method. Unsupervised approaches have been used for topical clustering of tweets [32, 33], and tweet hashtag recommendations . Supervised methods utilize training data (input/output about the phenomenon) for classification or regression type of algorithms . Supervised machine learning approach has been used for sentiment analysis of tweets . Sentiment analysis has also been used as a factor in a regression model for predicting box-office revenues for movies .
More importantly, quality aspects of social media data have been analysed mainly with supervised machine learning. Automatic methods have been developed for assessing credibility of Twitter-based data sets . Classifiers trained based on a similar approach have been developed for detecting newsworthiness and credibility of tweets . TweetCred is a real time system for assigning a credibility score to a Twitter user’s time line . The usefulness of comments in Flickr and YouTube has been studied . The results indicated that a few straightforward features can be used for detecting usefulness. A model has been trained based on user and tweet characteristics using supervised learning for identifying misinformation in Twitter (accuracy ~77%) . CREDBANK is a corpus of tweets, topics, events, and associated credibility scores comprising more than 1 billion tweets . Annotations for the corpus have been crowdsourced with Amazon Mechanical Turk, which focused on event detection and credibility detection of the events (a set of tweets).
The review indicates that many applications exist, where social media data analysis has facilitated decision making within an enterprise [3, 19, 20]. While several methods and algorithms can be utilised for social media data analysis , typically supervised learning based methods have been used for evaluating quality aspects of social media data [6, 36, 38]. In order to understand quality aspects of social media data, different approaches have been proposed for management of quality in social media data [9–11, 17, 18]. However, only some the approaches [10, 11] focus on development of architecture for managing the quality of data in a big data system. The contribution of this paper is validation of RA design for management of quality in social media data. Particularly, a proof-of-concept implementation of the RA has been created, which has been validated in product development context focusing on sentiment analysis of tweets.
Research question and method
RQ1: How is quality management architecture constructed for evaluating the quality of social media data?
The research method follows the framework proposed for analysis and design of empirically grounded RAs [39, 40]. The approach consists of deciding a type for the RA (step 1), selection of a design strategy (step 2), empirical acquisition of data (step 3), construction of the RA (step 4), enabling of variability (step 5), and evaluation (step 6). In previous work, RA has been developed for big data systems . Facilitation type of RA was selected (step 1), which should provide guidelines for the design of similar big data systems. The design strategy was selected as practise-driven (step 2). The RA was created based on realised implementation architectures gathered from publications and blogs (steps 3–4). In another earlier work, the developed RA was initially implemented and evaluated (step 6) for data quality management in social media sources, and a new metadata management layer was designed as part of the RA .
This research aims to extend the earlier RA . First, the existing architecture implementation has been extended for acquisition of new empirical data (step 3). The implementation has been deployed into Eucalyptus cloud computing environment . Particularly, a tool has been implemented, which enables management of quality aspects in Twitter-based data sets. Then, the existing RA  has been extended (step 4) and evaluated (step 6) based on new empirical data gathered from the implemented architecture (step 3). Variability aspect (step 5) has not been focused in the study.
This chapter presents design of the architecture implementation. The design has been presented from the point of view of utilizing the developed tool for quality management of Twitter based data sets. First, a use case view illustrates how a company (Invenco) can manage quality of social media data. Then, a high level data view of the system is presented. Deployment view illustrates how different components of the tool are executed in the target development environment. Finally, component and sequence views demonstrate how rules are created for data quality management, and how data quality is evaluated, and queried from the developed system.
Use case view
Next, it should also be identified what level of quality is acceptable for an organisation. For this purpose, a filtering policy was defined, which specified an acceptable level for each quality attribute. The role of the filtering policy is to discard data with unacceptable quality from decision making. Also, quality metrics should be defined by a person, who understands basics of data science (e.g. a data scientist). Quality metrics specify how quality attributes are evaluated based on social media data under study.
For utilization of data for decision making, a search filtering policy was defined. The policy specified acceptable quality level for tweets to be returned in queries.
High level data view
Quality rules-store contains data structures from the data view (Figs. 3, 14). Metadata is stored in metadata store. Data store contains social media data, which has been evaluated and validated successfully with a filtering policy.
The stores have been implemented to Cassandra, which is deployed in the Front-end node. The Front-end node also provides a representational state transfer (REST) API for interacting with end users (provided by MetadataQualityManagement). The Back-end node performs data quality evaluation of tweets. QualityEvaluator provides a REST API for execution of quality evaluation operations based on requests received from the Front-end node. TwitterAnalysis component is executed in Spark Streaming  environment. Especially, QualityEvaluator deploys TwitterAnalysis process to a Spark cluster. The Relevancy node performs relevancy evaluation of tweets. Word2Vec-service provides a TCP socket interface for relevancy evaluation, which is utilized by the TwitterAnalysis-component.
Detailed component view: Front-end node
Responsibilities of components for metadata quality management
Provides REST API for searching of metadata based on end user queries
Provides REST API for creation of metadata
Adapters for interaction between REST API and lower SW components
Manages quality aspects of metadata for data sets
Manages quality evaluation for quality attributes
Manages quality policies for quality evaluation
Validates metadata received from MetadataCollectionInterface
Components provide REST APIs for collection of quality policies, profiles, quality metrics, and quality attributes
Components provide REST APIs for searching of quality policies, profiles, quality metrics, and quality attributes
Internal interfaces for saving of quality policies, profiles, quality metrics, and quality attributes to a database
Internal interfaces for searching for quality policies, profiles, quality metrics, and quality attributes from a database
Sequence view: creation of quality rules
- Steps 1–2::
Product manager creates a new profile for decision making, and defines a decision point (‘Sentiment analysis’)
- Steps 3–4::
The product manager identifies timeliness as an important quality attribute, which is communicated to the data scientist. A new quality attribute is created for timeliness. The data scientist includes reference to a data processing tool, which is capable of evaluating the quality attribute (timeliness)
- Steps 5–6::
A quality metric is created for timeliness, which is specified as JEngineRules . The rules specify how tweet metadata is used for calculating a value for timeliness
- Steps 7–8::
The product manager creates a filtering policy, and specifies an acceptable quality level for tweets regarding timeliness
- Steps 9–10::
An organisational quality policy is created by the product manager. The policy defines applicability of timeliness for the data source (Twitter). Also, an associated filtering policy is referred to. Optionally, an associated identifier of metadata may be included to the policy
Sequence view: searching of data quality information for supporting decision making
- Steps 1–3::
End user/application creates a search filtering policy. Quality attributes and associated ranges are provided. It is also indicated, if social media data should be included to the response. Optionally, metadata identifier and time range may be provided. The policy is saved into the quality rules store
- Step 4::
The end user sends a HTTP GET to MetadataSearchEngine, and provides identifier of the search filtering policy
- Steps 5–7::
Metadata is searched from the metadata store based on the input parameters
- Steps 8–10::
If matching metadata is found, related metadata for data items (Fig. 15) is read from the metadata store
- Steps 11–13::
Social media data is read from the Data store, if inclusion of data has been specified in the search filtering policy (Fig. 14)
- Steps 14–16::
Metadata and related data are returned to the end user/application for supporting decision making
Component view: Back-end node
Responsibilities of sub-components in TwitterAnalysis
Coordination of data quality evaluation process for tweets
Rules in JRuleEngine-format for evaluation of timeliness and popularity (quality attributes)
Filtering of data based on a filtering policy
Analysis of tweet metadata for evaluation of timeliness and popularity
Analysis of tweet content for evaluation of relevancy
Extraction of tweets from an external data source
Metrics for evaluation of relevancy
Sequence view: data quality evaluation (sequence 1)
- Steps 1–3::
The end user (of Invenco’s product) searches for data from Twitter APIs, which is extracted, and saved into a temporary data store in the big data system
- Step 4::
The data extractor (of Invenco’s product) transmits metadata about the stored data set to MetadataCollectionEngine by using the REST API (Fig. 5)
- Steps 5–6::
The metadata is received and validated
- Steps 7–9::
The metadata is compared against matching profiles and organisational quality policies. Data source type and identifier of metadata is matched against organisational quality policies
- Steps 10–11::
If an applicable policy is found, metadata is accepted for further evaluation, and stored into metadata store
- Steps 12–13::
Identifier of metadata, quality attributes, and filtering policy are transmitted to NetworkService (in a HTTP POST), which is executed in the Back-end node (Fig. 8)
- Steps 14–17::
HTTP POST is received. Processing configurations (Fig. 14) are read from the quality rules store. The information contains the data quality evaluation task to be started, and configuration/processing parameters to be used for starting of the analysis process
- Step 18::
TwitterAnalyser-process is started in Spark Streaming context (Fig. 8). Identifier of transaction and Java process is saved
- Steps 19–21::
HTTP 150 OK response will be transmitted to MetadataQualityEvaluator (at the Front-end node). HTTP 150 OK will be transmitted to the data extractor in the big data system
- Steps 22–25::
Transaction identifier is matched against the Java process identifier. When tweet processing has been completed, the Spark Streaming process (based on the Java process identifier) will be stopped
Data quality evaluation (sequence 2)
- Step 1::
TwitterAnalyser is started in the Back-end node
- Steps 2–5::
Metadata and quality attributes are read from metadata/quality rules store based on input received from the Front-end node. Processing configuration is retrieved based on the reference in the quality attribute
- Step 6::
Data quality analysis processing to be executed is filtered based on supported quality processing information (Fig. 14). Quality attributes are matched with supported processing based on data source type. Also, target of quality attribute must match with output_data_type-parameter of supported processing (Fig. 14)
- Steps 7–9::
Quality metrics are read from the quality rules store. Metrics are decoded with appropriate decoders. Format-field of the metric (Fig. 14) indicates how the metric should be decoded
- Steps 10–11::
A filter is created for tweets. Filtering policy is read from the quality rules store
- Steps 12–14::
Tweet extractor, metadata analyser, and relevancy analyser are created
- Steps 15–16::
Metadata and relevancy analysers are registered to the tweet extractor for processing of tweets
- Step 17–18::
Tweet extractor downloads tweets based on the DataSetLocation-parameter of metadata (Fig. 15). Each retrieved tweet is forwarded for further processing
- Steps 19–21::
Metadata analyser parses tweet metadata from the tweet. Timeliness and popularity are evaluated based on the associated metric, and tweet metadata. Values of the quality attributes are saved into the metadata store
- Steps 22–23::
Analysed tweets and quality attributes are filtered. If all quality attributes associated with a tweet satisfy a filtering policy, the tweet will be saved into the data store. Otherwise, the tweet will be discarded
- Steps 24–26::
When all tweets have been streamed, tweet extractor informs upper layer (QualityEvaluationService in the Back-end node) about completion (by providing a transaction identifier). The indication will be utilized for stopping of the data evaluation process (step 24 in Fig. 10)
Metadata management architecture
Quality rules store include quality attributes, quality policies, quality metrics, and profile (Fig. 14), which are utilised for managing quality of data sets. Metadata contains information about data sets in different dimensions including quality aspects (Fig. 15). Data store contains social media data (tweets in the prototype system), which has been evaluated.
Metadata management refers to creation of metadata, and providing access to it. MetadataCollectionEngine and MetadataSearchEngine components of the Front-end node (Fig. 5) encapsulated functionality of metadata management. Quality management refers to managing quality aspects of data sets with user-defined quality rules. In the prototype system MetadataQualityManagement, MetadataQualityEvaluator, and MetadataQualityPolicyManager of the Front-end node implemented quality management functionality (Fig. 5). Quality evaluation refers to analysing quality of social media data sets based on quality metrics, which have been selected to a context based on quality rules. In the prototype system quality evaluation was comprised of QualityEvaluator and TwitterAnalysis components of the Back-end node (Fig. 8), and Word2Vec-service of the relevancy-node (Fig. 4).
In the big data system, social media data was extracted from Twitter (data source), and stored. The raw data was transferred to the data quality management tool for quality evaluation. Sentiment analysis (deep analytics) was performed, and results were saved (analysis results). Then, sentiment analysis results are going to be filtered (transformation) based on the data quality information. Finally, the filtered information will be visualised to the end users in an application.
In the following technology selections for implementation are evaluated in terms of the data quality management architecture:
Cassandra was used for storing of metadata, quality rules, and data (see Additional file 1). However, each of the conceptual stores could also have been implemented with a different database technology. Data was modelled based on expected queries to be served, which is one of the principles of data modelling for Cassandra . From the end user’s point of view, quality policies, quality attributes, and quality metrics have to be defined for data quality management (Fig. 2). The information was stored into the database (see Additional file 1). The result of tweet quality evaluation was stored into metadata store (metadata_dataitems in Additional file 1), and social media data into data store (data_store in Additional file 1). An example on data quality management over REST API has been provided in the Additional file 3. Performance of the data management has been evaluated in the following section.
External and internal APIs
The external APIs were implemented based on REST-based communication paradigm. Interface messages were defined in extensible markup language (XML) format, and implemented with Jersey . REST API of the QualityEvaluator in the Back-end node (Fig. 8) was implemented as an embedded Jetty server  (Jersey could also have been used). Word2Vec service’s API (Fig. 8) in the relevancy server was implemented as a TCP server, which accepted words of tweets as input, and returned calculated relevancy value as a response.
Social media data extraction for quality evaluation
Quality metrics for quality evaluation
Different metrics were needed for evaluation of the quality attributes (see Additional file 2). Timeliness was evaluated based on timestamp (created_at ) of a tweet. Retweet count (retweet_count ), number of friends (friends_count ), and number of followers (followers_count ) were utilised for evaluating popularity of a tweet (see Additional file 3 for an example). The metric for evaluation of quality attributes based on the tweet metadata was implemented as a file, which contained evaluation rules. Java Rule Engine API  was utilised for encoding/decoding of the evaluation rules in XML format (steps 8–9 in Fig. 11). Particularly, the XML file contained weighted parameters of tweet metadata for quality evaluation.
Relevancy evaluation relied on Google’s Word2Vec-implementation . DeepLearning4 J-library  provided a Java API to Google’s pre-trained word vector model, which is based on Google News data set. The implementation was executed on a separate Relevancy node (Fig. 4), due to large memory consumption of the model (zipped model file ~1.6 GB). Word2Vec algorithm was utilised for calculation of word cosine distance between words of a tweet, and context words. A metric file (see Additional file 2) was utilized for adjusting a threshold for word cosine distance for indicating, when a tweet word is relevant. The metric file also contained context words, which were compared to tweet words. Another metric (in JRulesEngine format) was used for calculating the final relevancy score based on the quantity of words, which were found to be relevant in a tweet.
Performance of data quality evaluation was studied. In the experiments 176,478 tweets (~798 MB) were transmitted for quality evaluation. The tweets were extracted from the public Twitter API . The average size of a tweet was ~4.6 KB. The experiments were executed within the Eucalyptus cloud computing environment (Fig. 4), where Front-end node had two vCPUs, and 4 GB RAM. Back-end node and relevancy nodes had six vCPUs and 40 GB RAM. Two vCPUs and 4 GB RAM was allocated for Twitter analysis-component (Fig. 8) within the Spark streaming cluster. An additional node (had six vCPUs and 40 GB RAM) was used for simulating a Twitter data source, which utilized Netcat for streaming of tweets using a TCP connection. TCP was used instead of HTTP to minimise protocol overhead in the experiments. The tests were performed five times, and average processing rate of tweets is reported.
Main lessons learnt
Quality data was stored separately for each tweet (metadata_dataitems in Additional file 1). Therefore, multiple updates/writes to Cassandra were batched, which is known to increase performance, when applied to data residing in the same cluster . Batching increased performance of data quality evaluation, as expected (Fig. 13). Alternatively, quality attributes could have been modelled differently. For example, values of quality attributes could have been stored inside of a Cassandra collection, which provides support for storing up to 2 billion entries .
Relevancy evaluation was based on word cosine distance-measure, which is used by the Word2Vec-implementation. Word2Vec model had to be loaded initially into memory, which takes ~4.5 min on a cloud node. Due to the slowness of loading, and large memory consumption, relevancy evaluation was performed in a separate Eucalyptus instance. The experiments indicated that relevancy evaluation was slower, when compared to evaluation of timeliness/popularity (Fig. 13). Word2Vec service in the Relevancy node (Fig. 4) read words of a tweet one line at a time from a single TCP socket. Performance of relevancy evaluation may be improved, if multiple sockets would be utilized for the communication.
The quality metrics were associated with the underlying implementation of quality evaluation. For example, metrics were used for calculating a value for popularity based on specified tweet metadata (friend’s count, followers count etc.). However, if a different set of tweet metadata would need to be utilized for quality evaluation, quality metrics and quality evaluation implementation would need to be modified accordingly. An alternative approach would be to utilize ontology-based approach to definition of quality metrics and evaluation, where updated representation of concepts in the metrics would be automatically updated in the system .
Comparison to literature
Comparison to approaches in literature
Domain (data set)
Quality level spec.
Quality metric spec.
Quality policy spec.
Social media (Twitter, Facebook, Maps)
Weighted score: link, credibility, up-to-datedness, dissemination, quality of coordinates
OpenSocial format, MSSQL database
Weighting of quality attributes
Social media (Facebook)
Based on metadata, content, message, classification, and scientific methods
Stanford NER, Classifier4J, Open Thesaurus,Gisgraphy Geocoder
Weighting of parameters
Medical (EEG data)
XML file: targeted data quality
XML file: data cleansing algorithm
Data quality profile
Completeness, consistency, accuracy
Talend and Trifacta Wrangler. ML Vagrant
Social media (Twitter)
Social media (Twitter)
Timeliness, relevancy, popularity
Spark, Cassandra, Word2Vec
Ranges specified in SearchFilteringPolicy
Dynamic quality policies
Immonen  presented mainly quality evaluation of social media data based on quality policies. Also, an initial validation of RA for quality management of social media data was presented . The RA (Fig. 1) has been extended in this work with implementation of quality evaluation for tweets, evaluation of new quality attributes based on the developed metrics, and adaptable quality rules to a database. The main contribution of this work is presentation of the metadata management layer for big data systems (Fig. 12), which has been validated with an implementation. Additionally, performance of the implementation has been evaluated from data quality evaluation point of view.
Extension of the RA based on other implementation approaches
Social media data may need to be cleaned or filtered before analysing its quality (Serhani). We didn’t focus on the pre-processing phase of the big data pipeline (Fig. 12). When social media data is cleaned/filtered, quality of data may improve, which should be saved into the associated metadata. Social media data may also be processed with offline data processing tools (Sehrani). We aimed at automating the process of data quality evaluation instead of manual processing. If social media data sets would be processed externally by the end user, metadata/quality management part of the architecture would need to be extended.
Validation of the RA with new business cases
The revised RA  has been developed originally based on seven published implementation architectures of big data systems . In this work, the RA has been extended, and validated based on a realised architecture in a business case. In order to validate suitability of the RA for various business cases in the context of social media data, the RA should be experimented with several implementation architectures. This work can be considered as a step towards that goal.
Performance of quality attributes and development of quality evaluation algorithms based on machine learning
Quality of social media data was evaluated by utilizing either simple (timeliness, popularity) or trained models (relevancy). In order to evaluate performance (e.g. precision or recall) of the quality attributes, a ground truth based on human observations would be needed for comparison. The ground truth could be established by crowd-sourcing annotations to the evaluated social media data (e.g. with Amazon Mechanical Turk) [6, 7]. Also, if credibility of tweets would need to be evaluated, a model may have to be developed based on supervised machine learning, where annotated data would be utilised for training of the model (e.g. [6, 7, 36]).
Performance optimisation in quality evaluation
Spark streaming was used in the implementation to study feasibility of the technology for quality evaluation. Previously, sentiment analysis of tweets has been simulated in a similar environment . Spark streaming processed lower number of tweets (per second) with one node in this study (~150–800 tweets/s vs. ~1000 tweets/s). Performance may be improved with optimisations related to storage of quality data and quality attribute evaluation (relevancy). Currently, tweets are analysed sequentially in a Spark streaming cluster. Parallel processing of tweets may need to be implemented, in order to achieve better performance.
Streaming support for quality evaluation
In the prototype no feedback is provided to the end user regarding the processing of tweets. As an alternative to REST-based communication pattern, a streaming interface could be developed for providing processing indications to support utilisation of quality information in real time.
Updating of data quality information
Filtering and visualisation of tweets for facilitating decision making
Currently, quality data has been created based on Twitter data sets, and filtered for supporting sentiment analysis of tweets. Subsequently, quality information of individual tweets will be used for filtering in sentiment analysis. Especially, it should be determined how quality information will be visualised for end users in order to support decision making.
Multiple social media data sources and decision points
When an organisational actor has a need for managing quality of social media data, quality rules are specified. In the prototype a single organisational policy and profile were used for quality management of tweets. Multiple organisational policies would be needed, when different social media data sources are utilised in a business case. Similarly, multiple policies would be needed for different decision points. Especially, it should be considered how data from one or more social media data source(s) can serve decision making in different business contexts.
Evaluation of new quality attributes. New functionality has to be developed for evaluation of additional quality attributes. The added functionality should be updated to the processing configurations (Fig. 14), and communicated to the end users of the tool.
This research was conducted based on the needs of a company for managing quality in social media data. The contribution of the paper is (an extended) RA design of metadata management layer for big data systems, which focuses on data quality management. The RA was implemented for ensuring empirical validity of the design. Feasibility of the developed data quality management-tool was validated in a business context, in which the company utilised data quality information for sentiment analysis of tweets. Quality of tweets was evaluated in terms of timeliness, popularity, and relevancy. The quality information was used for filtering of tweets to facilitate creation of higher quality insights with the company’s product.
The research question was “How is quality management architecture constructed for evaluating the quality of social media data?” The data quality management architecture is comprised of a metadata management layer in the RA for big data systems. The metadata management layer consists of quality rules, metadata, and data (data stores). Quality rules provide means for the organization of the company to manage quality of social media data sets for decision making purposes. The main functional elements of the metadata management layer include metadata management, quality management, and quality evaluation. In the prototype system metadata management enabled creation of and access to metadata related to tweets. Quality management was responsible for managing quality aspects of metadata based on user-defined quality rules. Quality evaluation of tweets was performed in a Spark streaming cluster, which indicated that 150–800 tweets/s can be processed with two cloud nodes depending on the configuration.
enterprise resource planning
natural language processing
representational state transfer
supply chain management
unified modelling language
uniform resource locator
extensible markup language
PP wrote the article. He also designed and implemented the presented architecture, and executed the experiments. JJ contributed to the design of the use case view (in “Architecture design” chapter). Both authors read and approved the final manuscript.
The author acknowledges Esa Ronkainen and Juha Jokitulppo (from Invenco) for providing business context and related feedback for this research.
The authors declare that they have no competing interests.
Availability of data and materials
The dataset supporting the conclusions of this article is included within the article (and its additional files).
This research has been carried out in Digile Need for Speed program, and it has been partially funded by Tekes (the Finnish Funding Agency for Technology and Innovation).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Cui W. How to use the social media data in assisting restaurant recommendation. LCNS. 2016;9645:134–41.Google Scholar
- Asur S, Huberman BA. Predicting the future with social media. In: Paper presented at the IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, Toronto, Canada, 31 August to 3 September, 2010.
- He W. Gaining competitive intelligence from social media data. Ind Manag Data. 2015;115(9):1622–36.View ArticleGoogle Scholar
- Chen C. 6 million spam tweets: a large ground truth for timely twitter spam detection. In: Paper presented at the communication and information systems security symposium, London, United Kingdom, 8–12 June, 2015.
- Reuter C, Spielhofer T. Towards social resilience: a quantitative and qualitative survey on citizens’ perception of social media in emergencies in Europe. Technol Forecast Soc Change. 2016. doi:10.1016/j.techfore.2016.07.038.
- Castillo C, Mendoza M, Poblete B. Information credibility on twitter. In: Paper presented at the 20th international world wide web conference, Hyderabad, India, 28 March to 1 April, 2011.
- Castillo C, Mendoza M, Poblete B. Predicting information credibility in time-sensitive social media. Internet Res. 2013;23(5):560–88.View ArticleGoogle Scholar
- Momeni E, Haslhofer B, Tao K, Houben G. Sifting useful comments from Flickr Commons and YouTube. Int J Digit Libr. 2015;16(2):161–79.View ArticleGoogle Scholar
- Taleb I, Dssouli R, Serhani MA. Big data pre-processing: a quality framework. In: Paper presented at the IEEE international congress on big data. New York, USA, 27 June to 2 July, 2015.
- Serhani MA, El Kassabi HT, Taleb I, Nujum A. An hybrid approach to quality evaluation across big data value chain. In: Paper presented at the IEEE international congress on big data, San Francisco, USA, 27 June to 02 July, 2016.
- Immonen A, Pääkkönen P, Ovaska E. Evaluating the quality of social media data in big data architecture. IEEE Access. 2015;3:2028–43.View ArticleGoogle Scholar
- Pääkkönen P, Pakkala D. Reference architecture and classification of technologies, products, and services for big data systems. Big Data Res. 2016;2(4):166–86.View ArticleGoogle Scholar
- National Information Standards Organization. Understanding metadata. 2004. http://www.niso.org/publications/press/UnderstandingMetadata.pdf. Accessed 30 Jan 2017.
- Wang RY, Strong DM. Beyond accuracy: what data quality means to data consumers. J Manag Inform Syst. 1996;12(4):5–33.View ArticleGoogle Scholar
- W3C. Web services policy 1.5—framework (W3C recommendation). 2007. https://www.w3.org/TR/ws-policy/#Policy_Model. Accessed 05 Dec 2016.
- Reuter C, Ludwik T, Ritzkatis M, Pipek V. Social-QAS: tailorable quality assessment service for social media content. In: Paper presented at the 5th international symposium on end user development, Madrid, Spain, 26–29 May, 2015.
- Reuter C, Ludwik T, Kaufhold M, Pipek V. XHELP: Design of a cross-platform social-media application to support volunteer moderators in disasters. In: Paper presented at the CHI crossings, Seoul, Korea, 18–23 April, 2015.
- Ludwig T, Reuter C, Pipek V. Social haystack: dynamic quality assessment of citizen-generated content during emergencies. ACM Trans Comput Hum Interact. 2015;22(4):17.View ArticleGoogle Scholar
- Chae B. Insights from hashtag# supplychain and Twitter analytics: considering Twitter and Twitter data for supply chain practice and research. Int J Prod Econ. 2015;165:247–59.View ArticleGoogle Scholar
- Shankararaman V, Lum EK. Integration of social media technologies with ERP: a prototype implementation. In: Paper presented at the 19th Americas conference on information systems, Chicago, Illinois, USA, 15–17 August, 2013.
- Gao J, Xie C, Tao C. Big data validation and quality assurance—issues, challenges, and needs. In: Paper presented at the IEEE 2016 symposium on service-oriented system engineering, Oxford, United Kingdom, 29 March to 2 April, 2016.
- Moges H, Vlasselaer VV, Lemahieu W, Baesens B. Determining the use of data quality metadata (DQM) for decision making purposes and its impact for decision outcomes—an exploratory study. Decis Support Syst. 2016;83:32–46.View ArticleGoogle Scholar
- Bello-Orgaz G, Jung JJ, Camacho D. Social big data: recent achievements and new challenges. Inf Fusion. 2016;28:45–59.View ArticleGoogle Scholar
- Bontcheva K. TwitIE: an open-source information extraction pipeline for microblog text. In: Paper presented at the recent advances in natural language processing, Hissar, Bulgaria, 7–13 September, 2013.
- Derczynski L. Analysis of named entity recognition and linking for tweets. Inf Process Manag. 2015;51:32–49.View ArticleGoogle Scholar
- Ritter A, Clark S, Etzioni M, Etzioni O. Named entity recognition in tweets: an experimental study. In: Paper presented at the conference on empirical methods in natural language processing, Edinburgh, Scotland, UK, 27–31 July, 2011.
- Zangerle E, Gassler W, Specht G. Recommending #-tags in Twitter. In: Paper presented at the workshop on semantic adaptive social web, Girona, Spain, 15 July, 2011.
- Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: Paper presented at the international conference on learning representations, Scottsdale, Arizona, USA, 2–4 May, 2013.
- Google Code Archive. Word2Vec. https://code.google.com/archive/p/word2vec/. Accessed 08 Nov 2016.
- Tomar A. Towards twitter hashtag recommendations using distributed word representations and a deep feed forward neural network. In: Paper presented at the international conference on advances in computing, communications, and informatics, Delhi, India, 24–27 September, 2014.
- Batrinca B, Treleaven PC. Social media analytics: a survey of techniques, tools and platforms. AI Soc. 2015;30(1):89–116.View ArticleGoogle Scholar
- Rosa KD. Topical clustering of tweets. In: Paper presented at the social web search and mining, Beijing, China, July 28, 2010.
- Ferrara E. Clustering memes in social media. In: Paper presented at the IEEE/ACM international conference on advances in social networks analysis and mining, Niagara, Ontario, Canada, 25–29 August, 2013.
- Godin F. Using topic models for twitter hashtag recommendation. In: Paper presented at the 22nd international conference on world wide web, Rio de Janeiro, Brazil, 13–17 May, 2013.
- Le B, Nguyen H. Twitter sentiment analysis using machine learning techniques. Adv Intell Syst Comput. 2015;358:279–89.View ArticleGoogle Scholar
- Gupta A, Kumaraguru P, Castillo C, Meier P. TweetCred: real-time credibility assessment of content on twitter. In: Paper presented at the 6th international conference on social informatics, Barcelona, Spain, 11–13 November, 2014.
- Antoniadis S, Litou I, Kalogeraki V. A model for identifying misinformation in online social networks. LNCS. 2015;9415:473–82.Google Scholar
- Mitra T, Gilbert E. CREDBANK: a large-scale social media corpus with associated credibility annotations. In: Paper presented at the 9th international AAAI conference on web and social media, Oxford, UK, 26–29 May, 2015.
- Angelov S, Grefen P, Greefhorst D. A framework for analysis and design of software reference architectures. Inf Softw Technol. 2011;54(4):417–31.View ArticleGoogle Scholar
- Galster M, Avgeriou P. Empirically-grounded reference architectures: a proposal. In: Paper presented at the joint ACM SIGSOFT conference on quality of software architectures and ACM SIGSOFT conference on quality of software architectures and ACM sigsoft symposium on architecting critical systems, ACM, Boulder, Colorado, USA, 20–24 June, 2011.
- Pääkkönen P, Pakkala D. The implications of disk-based RAID and virtualisation for write-intensive services. In: Paper presented at the 30th annual ACM symposium on applied computing, Salamanca, Spain, 13–17 April, 2015.
- Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I. Discretized streams: fault-tolerant streaming computation at scale. In: Paper presented at the 24th ACM symposium on operating systems principles, Farmington, Pennsylvania, USA, 3–6 November 2013.
- SourceForge. JRulesEngine. 2016. http://jruleengine.sourceforge.net. Accessed 08 Nov 2016.
- Hobbs T. Basic rules of Cassandra data modeling. 2015. http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling. Accessed 08 Nov 2016.
- Jersey. 2016. https://jersey.java.net/. Accessed 08 Nov 2016.
- Eclipse. Jetty. 2016. http://www.eclipse.org/jetty/. Accessed 08 Nov 2016.
- The Apache Software Foundation. Apache HttpComponents. 2016. https://hc.apache.org/. Accessed 08 Nov 2016.
- Google code archive. JSON simple. 2016. https://code.google.com/archive/p/json-simple/. Accessed 08 Nov 2016.
- Twitter developer documentation. https://dev.twitter.com/overview/api/. Accessed 08 Nov 2016.
- DeepLearning4J. https://deeplearning4j.org/. Accessed 08 Nov 2016.
- Datastax. CQL for Apache Cassandra. 2016. http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html. Accessed 08 Nov 2016.
- Pantsar-Syväniemi S. Situation-based and self-adaptive applications for the smart environment. J Ambient Intell Smart Environ. 2012;4(6):491–516.Google Scholar
- Pääkkönen P. Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing. J Big Data. 2016;3:6.View ArticleGoogle Scholar