A big data methodology for categorising technical support requests using Hadoop and Mahout
© Duque Barrachina and O' Driscoll; licensee Springer. 2014
Received: 14 June 2013
Accepted: 20 February 2014
Published: 24 June 2014
Technical Support call centres frequently receive several thousand customer queries on a daily basis. Traditionally, such organisations discard data related to customer enquiries within a relatively short period of time due to limited storage capacity. However, in recent years, the value of retaining and analysing this information has become clear, enabling call centres to identify customer patterns, improve first call resolution and maximise daily closure rates. This paper proposes a Proof of Concept (PoC) end to end solution that utilises the Hadoop programming model, extended ecosystem and the Mahout Big Data Analytics library for categorising similar support calls for large technical support data sets. The proposed solution is evaluated on a VMware technical support dataset.
In recent years, there has been an unprecedented increase in the quantity and variety of data generated worldwide. According to the IDC’s Digital Universe study, the world’s information is doubling every two years and is predicted to reach 40ZB by 2020 (Digital Universe Study (on behalf of EMC Corporation) ). This increase in data, often referred to as a “data tsunami”, is driven by the proliferation of social media along with an increase in mobile and networked devices (the Internet of Things), finance and online retail as well as advances in the physical and life sciences sectors. As evidence of this, the online microblogging service Twitter, processes approximately 12 TB of data per day, while Facebook receives more than five hundred million likes per day (McKinsey Global Institute ). In addition, the Cisco Internet Business Solutions Group (IBSG) predicts that there will be 25 billion devices connected to the Internet by 2015 and 50 billion by 2020 (Cisco Internet Business Solutions Group (IBSG) ). Such vast datasets are commonly referred to as “Big Data”. Big Data is characterised not only by its volume, but by a rich mix of data types and formats (variety) and it’s time sensitive nature which marks a deviation from traditional batch processing (velocity) (Karmasphere ). These characteristics are commonly referred to as the 3 V’s.
Traditional distributed systems and databases are no longer suitable to effectively capture, store, manage and analyse this data and exhibit limited scalability. Furthermore, relational databases support structured data by imposing a strict schema; however, data growth is currently driven by unstructured data by a factor of 20:1 (Karmasphere ). Finally, data warehouses are no longer able to process whole datasets due to their massive size; hence the information stored in these solutions is no longer statistically representative of the data from which it was extracted, making the data analytics performed on it less reliable. Big Data requires new architectures designed for scalability, resilience and efficient parallel processing.
As a result, big data processing and management (capturing and storing big data) has gained significant attention in recent years e.g. the MapReduce paradigm. However it is now recognised that it is necessary to further develop platforms that can harness these technologies in order to gain meaningful insight and to make more informed business decisions. The implementation of data analytics on such datasets is commonly referred to as big data analytics. Data mining and machine learning techniques are currently used across a wide range of industries to aid organisations in optimising their business, reduce risks and increase profitability. Sectors employing these techniques include retailers, banking institutions and insurance companies as well as health related fields (Huang et al. ; Lavrac et al. ). In the current marketplace big data analytics has become a business requirement for many organisations looking to gain a competitive advantage as evidenced by IBM’s 2011 Global CIO study that places business intelligence and analytics as the main focus for CIOs over the next five years, on top of virtualisation and cloud computing (IBM ).
Importantly, this has further been recognised in the technical support space of leading technology multinationals, as call centres have begun to explore the application of data analytics as a way to streamline the business and gain insight regarding customer’s expectations, a necessity in an industry challenged by economic pressures and increased competition (Aberdeen Group ). Thus this paper;s contributions are two-fold:
An end to end proof of concept solution based entirely on open source components is described that can be used to process and analyse large technical support datasets to categorise similar technical support calls and identify likely resolutions. The proposed solution utilises the Hadoop distributed data processing platform, extended ecosystem and parallelised clustering techniques using the Mahout library. It is envisaged that if such a solution was deployed in commercial environments with large technical support datasets, updated on a daily basis, it would expedite case resolution and accuracy to maximise daily closure rates by providing similar case resolutions to staff when a new technical support case is received. If achieved, the reduction of resolution time would also ultimately aid technical support teams to increase customer satisfaction and prevent churn. Furthermore this solution could also be used to identify the most problematic product features and highlight staff knowledge gaps leading to more directed staff training programmes.
Secondly an evaluation of the performance and accuracy of parallelised clustering algorithms for analysing a distributed data set is conducted using a real-world technical support dataset.
The rest of this paper is organised as follows: Section II describes the algorithms and technologies underpinning the proposed architecture along with related work in this Section III outlines the architecture and implementation of the proposed technical support analytics platform. Section V details the performance evaluation and analysis of the proposed solution with Section VI outlining final conclusions.
Background & literature review
A brief overview of the constituent technologies is now provided. The need for efficient, scale out solutions to support partial component failures and provide data consistency motivated the development of the Google File System (GFS) (Ghemawat et al. ) and the MapReduce (Dean & Ghemawat ) paradigm in the early 2000s. The premise behind the Google File System and MapReduce is to distribute data across the commodity servers such that computation of data is performed where the data is stored. This approach eliminates the need to transfer the data over the network to be processed. Furthermore, methods for ensuring the resilience of the cluster and load balancing of processing were specified. GFS and MapReduce form the basis for the Apache Hadoop project, comprising two main architectural components: the Hadoop Distributed File System (HDFS) and Hadoop MapReduce (Apache Hadoop ). HDFS (Shvachko et al. ) is the distributed storage component of Hadoop with participating nodes following a master/slave architecture. All files stored in HDFS are split into blocks which are replicated and distributed across different slave nodes on the cluster known as data nodes, with a master node, called the name node, maintaining metadata e.g. blocks comprising a file, where in the cluster these blocks are located and so on. MapReduce (Bhandarkar ) is the distributed compute component of Hadoop. MapReduce jobs are controlled by a software daemon known as the JobTracker. A job is a full MapReduce program, including a complete execution of Map and Reduce tasks over a dataset. The MapReduce paradigm also relies on a master/slave architecture. The JobTracker runs on the master node and assigns Map and Reduce tasks to the slave nodes in the cluster. The slave nodes run another software daemon called the TaskTracker that is responsible for actually instantiating the Map or Reduce tasks and reporting the progress back to the JobTracker.
The extended Hadoop ecosystem includes a growing list of solutions that integrate or expand Hadoop’s capabilities. Mahout is an open source machine learning library built on top of Hadoop to provide distributed analytics capabilities (Apache Mahout ). Mahout incorporates a wide range of data mining techniques including collaborative filtering, classification and clustering algorithms. Of relevance to this paper, Mahout supports a wide variety of clustering algorithms including: k-means, canopy clustering, fuzzy k-means, Dirichlet Clustering and Latent Dirichlet Allocation. HBase is a distributed column-oriented database that resides on top of Hadoop’s Distributed File System, providing real-time read/write random-access to very large datasets (Apache Hbase ). Additionally, Hive defines a simple SQL-like query language, called HiveQL, to abstract the complexity of writing MapReduce jobs from users. Hive transforms HiveQL queries into MapReduce jobs for execution on a cluster (Apache Hive ).
While research associated with machine learning algorithms is well established, research on big data analytics and large scale distributed machine learning is very much in its infancy with libraries such as Mahout still undergoing considerable development. However some initial experimentation has been undertaken in this area.
Esteves et. al studied the use of Mahout k-means clustering on a 1.1GB data set of TCP dumps from an airforce LAN, while examining clustering scalability and quality (Esteves et al. ). The authors subsequently evaluated the applications of k-means and fuzzy c-means clustering on an 11GB Wikipedia dataset with respect to clustering algorithm and system performance (Esteves & Rong ). However in contrast to this work, the authors of this paper consider a 32GB dataset, the impact of five clustering algorithms and, most importantly, provide a complete end to end solution including data pre-processing, daily upload of new data, real-time access and a user interface by using the extended Hadoop ecosystem. Ericson et.al use the 1987 Reuters dataset, one of the most widely used text categorisation data sets containing approximately 21,578 documents, to evaluate clustering algorithms over Hadoop but also over their own runtime environment, Granules (Ericson & Palliekara ). While the performance of four clustering algorithms was evaluated, the primary evaluation concentrated on the underlying runtime environment and unlike the proposed architecture, a complete end to end solution was not provided as a clean data set was used.
In contrast, the end to end framework presented in this paper successfully integrates Hadoop and Mahout to provide a fully functional end to end solution that addresses a real world problem, facilitating the streamlining of call centre operations and evaluating multiple clustering algorithms in terms of timeliness and accuracy.
Research design and methodology
Data pre processing
Automated periodic refresh of new technical support data
As technical support data sets will be updated daily with new solutions derived, a mechanism to automatically update the technical support data from which the clustered results are derived to ensure accurate output is required. Given the nature of the data, it is sufficient to perform such a procedure on a daily basis. To avoid uploading the entire technical support data set on a daily basis, a cron job can be scheduled to automatically backup the changes that occurred over the preceding 24 hours. It is proposed that a separate Hadoop job is responsible for uploading the CSV files containing the details of the support requests received that day into HDFS.
In order for data to be processed by the Mahout clustering algorithms, it must first be converted to vector format. Technical support data stored in HDFS is converted into vectors using Mahout’s existing command line for subsequent clustering analysis. Importantly, the challenge of identifying related support calls based on their problem description is resolved by using Mahout’s distributed clustering machine learning algorithms to analyse the data set, thereby identifying support calls with a similar description. Multiple clustering algorithms are evaluated in the proposed system with respect to system performance and accuracy. The specific details of the clustering evaluation and results are discussed further in Section V. It is beyond the scope of this paper to provide a detailed description of each clustering algorithm with respect to a parallelized Map Reduce job. However to as an indication, the k-means algorithm described from the perspective of a MapReduce job is outlined: Each map task receives a subset of the initial centroids and is responsible for assign each input data-point i.e. text vector to its nearest cluster i.e. centroid. For each data point, the mapper generates a key/value pair, where the key is the cluster identifier and the value corresponds to the coordinates of that point. The algorithm uses a combiner to reduce the amount of data to be transferred from the mapper to the reducer. The combiner receives all key/value pairs from the mapper and produces partial sums of the input vectors for each cluster. All values associated with the same cluster are sent to the same reducer, thus each reducer receives the partial sums of all points of one or more clusters and computes the new cluster centroids. The driver program iterates over the points and clusters until all clusters have converged or until the maximum number of iterations has been reached.
Real-time platform access
User query of results
Finally, it should be noted that directly querying HBase to obtain similar support calls is not user friendly and is technically complex requiring the development of a MapReduce job. To overcome this, HBase has been integrated with Hive. Such an approach allows technical support engineers to obtain similar support calls from a web-based interface that generates HiveQL to query the clustering results derived from Mahout. It can be envisaged that this proof of concept web interface could be easily enhanced for integration with existing BI tools and to provide advanced technical support dashboards.
This section has outlined an end to end, open source proof of concept solution to transform, process, analyse and identify resolutions to similar technical support calls given an open case. A VMware technical support dataset is now analysed using five of Mahout’s distributed clustering algorithms. The output from each of these algorithms outlining the impact of the nuances of each clustering algorithm as well as their performance is now discussed.
Results and discussion
The implemented solution outlined in the previous Section is now evaluated using a real-world VMware technical support data set of approximately 0.032 TB. This is based on an average call size of 60 KB from 4 main call centres, with an estimated 1900 calls per week. The four node Hadoop cluster is based on commodity hardware with each node following the same configuration: 8GB RAM; 2 TB hard drive; 4 MB Cache; 1 CPU Intel Core i7; 2 cores at 2.20 GHz; Onboard 100Mbps LAN; Linux CentOS with Hadoop v0.20 and Mahout v0.5.The evaluation is discussed with respect to the processing performance of the specific parallelized clustering algorithms and also with respect to the clustering accuracy.
Parallelised clustering performance
T1 (Canopy clustering)
T2 (Canopy clustering)
Fuzziness Factor (Fuzzy-kmeans)
Max Dirichlet Iterations
Words in Corpus
Max LDA Iterations
Mahout algorithms and their execution times
Execution Time (ms)
The research presented in this paper presents a complete open source solution for processing and categorisation of similar service calls within large technical supports data sets to allow for identification of similar calls with potential for faster resolution. The solution was evaluated using a subset of VMware technical support data with the output and accuracy of the five commonly employed clustering algorithms examined. Although, this paper discusses the analysis of VMware support data in particular, the described techniques and procedures are generally applicable to other organisations providing similar services, thereby providing a proof of concept Industry framework. Future work will examine alternative text vectorisation methods to TD and TD-IDF to further improve the quality of the clustering results and to consider word collocation. Additionally, orchestration tools such as Oozie could be considered to automate the steps required to identify related support calls.
- Digital Universe Study (on behalf of EMC Corporation): Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. 2012.http://idcdocserv.com/1414 Google Scholar
- McKinsey Global Institute: Big data: The next frontier for innovation, competition, and productivity. 2011.http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation Google Scholar
- Cisco Internet Business Solutions Group (IBSG): The Internet of Things: How the Next Evolution of the Internet is Changing Everything. 2011.http://www.cisco.com/web/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf Google Scholar
- Karmasphere: Deriving Intelligence from Big Data in Hadoop: A Big Data Analytics Primer. 2011.Google Scholar
- Karmasphere: Understanding the Elements of Big Data: More than a Hadoop Distribution. 2011.Google Scholar
- Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nat Protoc 2009,4(1):44–57.View ArticleGoogle Scholar
- Lavrac N, Keravnou E, Zupan B: Intelligent data analysis in medicine. Encyclopaedia of Computer Science and Technology 2000, 42: 113–157.Google Scholar
- IBM: The Essential CIO. 2011.http://www-935.ibm.com/services/uk/cio/pdf/CIE03073-GBEN-01.pdf Google Scholar
- Aberdeen Group: Unlocking Business Intelligence in the Contract Center. 2010.Google Scholar
- Ghemawat S, Gobioff H, Leung S: The Google File System. SOSP ’03 Proceedings of the 19th ACM symposium on Operating Systems Principles 2003. vol 6. ACM, pp 10–10. http://dl.acm.org/citation.cfm?id=945450 vol 6. ACM, pp 10-10.Google Scholar
- Dean J, Ghemawat S: MapReduce: Simplified Data Processing on Large Clusters. San Francisco, CA: OSDI' 04 Proceedings of the 6th symposium on Operating System Design and Implementation; 2004.http://dl.acm.org/citation.cfm?id=1251264 Vol 6. pp 10-10.Google Scholar
- Apache Hadoop 2012.http://hadoop.apache.org/
- Shvachko K, Kuang H, Radia S, Chansler R: The Hadoop Distributed File System. IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’ 10) 2010. ACM pp 1–10. http://dl.acm.org/citation.cfm?id=1914427 ACM pp 1–10.Google Scholar
- Bhandarkar M: MapReduce programming with apache Hadoop. IEEE 24th International Symposium on Parallel & Distributed Processing (IPDPS’ 10) 2010.Google Scholar
- Apache Mahout 2012.https://mahout.apache.org/
- Apache Hbase 2012.http://hbase.apache.org/
- Apache Hive 2012.http://hive.apache.org/
- Esteves RM, Pais R, Rong C: K-means Clustering in the Cloud – A Mahout Test. IEEE International Conference on Advanced Information Networking and Applications (WAINA ‘11) 2011. IEEE pp 514–519. http://www.ieeeexplore.us/xpl/articleDetails.jsp?tp=&arnumber=6133195&queryText%3Desteves+clustering+wikipedia IEEE pp 514–519.Google Scholar
- Esteves RM, Rong C: Using Mahout for Clustering Wikipedia’s Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud. IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom ‘11) 2011. IEEE pp 565–569. http://www.ieeeexplore.us/xpl/articleDetails.jsp?tp=&arnumber=5763553&queryText%3Desteves+K-means+Clustering+in+the+Cloud+%E2%80%93+A+Mahout+Test IEEE pp 565–569.Google Scholar
- Ericson K, Pallickara S: On the Performance of High Dimensional Data Clustering and Classification Algorithms. Elsevier Future Generation Computer Systems; 2012. Available from: http://dl.acm.org/citation.cfm?id=2435540 Available from:Google Scholar
- Owen S, Anil R, Dunning T, Friedman E, Manning Publications: Mahout in action. Chapter 8. Shelter Island, N.Y; 2012:130–144.Google Scholar
- Soucy P, Mineau GW: Beyond TFIIDF Weighting for Text Categorization in the Vector Space Model. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI ’05) 2005. ACM pp 1130–1135. http://dl.acm.org/citation.cfm?id=1642474 ACM pp 1130–1135.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.