Predictive analytics using big data for increased customer loyalty: Syriatel Telecom Company case study

Wassouf, Wissam Nazeer; Alkhatib, Ramez; Salloum, Kamal; Balloul, Shadi

doi:10.1186/s40537-020-00290-0

Research
Open access
Published: 23 April 2020

Predictive analytics using big data for increased customer loyalty: Syriatel Telecom Company case study

Wissam Nazeer Wassouf ORCID: orcid.org/0000-0001-8301-6320¹,
Ramez Alkhatib²,
Kamal Salloum¹ &
…
Shadi Balloul³

Journal of Big Data volume 7, Article number: 29 (2020) Cite this article

28k Accesses
38 Citations
4 Altmetric
Metrics details

Abstract

Given the growing importance of customer behavior in the business market nowadays, telecom operators focus not only on customer profitability to increase market share but also on highly loyal customers as well as customers who are churn. The emergence of big data concepts introduced a new wave of Customer Relationship Management (CRM) strategies. Big data analysis helps to describe customer’s behavior, understand their habits, develop appropriate marketing plans for organizations to identify sales transactions and build a long-term loyalty relationship. This paper provides a methodology for telecom companies to target different-value customers by appropriate offers and services. This methodology was implemented and tested using a dataset that contains about 127 million records for training and testing supplied by Syriatel corporation. Firstly, customers were segmented based on the new approach (Time-frequency- monetary) TFM (TFM where: Time (T): total of calls duration and Internet sessions in a certain period of time. Frequency (F): use services frequently within a certain period. Monetary (M): The money spent during a certain period.) and the level of loyalty was defined for each segment or group. Secondly, The loyalty level descriptors were taken as categories, choosing the best behavioral features for customers, their demographic information such as age, gender, and the services they share. Thirdly, Several classification algorithms were applied based on the descriptors and the chosen features to build different predictive models that were used to classify new users by loyalty. Finally, those models were evaluated based on several criteria and derive the rules of loyalty prediction. After that by analyzing these rules, the loyalty reasons at each level were discovered to target them the most appropriate offers and services.

Introduction

The telecom sector is witnessing a massive increase in data, and by analyzing this massive data, telecom operators can manage and retain customers. It is also important for companies to be able to predict the amount of income they may receive from their active customers. For this purpose, they need models able to determine customer loyalty. The cost associated with customer gain is usually higher than the cost associated with maintaining it [1]. Prediction can be directed at customer loyalty to identify both customers who have great loyalty to their preservation as well as customers with intentions to change to the competitors. This capability is necessary, especially for modern telecommunications operators. Nowadays companies face more complexity and competition in their business and need to develop innovative activities to capture and improve customer satisfaction and retention [2]. Growing profitability is the goal of most companies, to reach this goal, companies must provide an analysis of customer relationship management (CRM) and provide appropriate marketing strategies [3]. Some studies provided a new model of transactions based on both the services and customer satisfaction and showed that the price is not the only measure affecting customer buying decisions, but it is also important that both the customer and the company agree on product value and good customer services. Therefore, organizations should not seek to develop a product to satisfy their customers, but they must follow the customer purchasing behavior and offer distinct products for each segment. In other words, segmenting customers based on purchasing behavior is necessary to develop successful marketing strategies, which in turn cause the creation and maintenance of competitive advantage. Current methods of customer value analysis which are based on past customer behavior patterns or demographic variables are limited to predict future customer behavior. So, better patterns were exch

Research objectives

Our goals of this research

Customers value was Analyzed by segmenting them according to the new approach TFM and then determine, the level of loyalty for each segment in a big data environment in telecom.
A set of features was derived from the telecom data.
The best behavioral features for customers with their demographic information were Chosen, based on these features and the level of loyalty for each segment, the following classification algorithms were applied and the classification models were built: random forest classifier, Decision tree classifier, Gradient-boosted tree classifier, and Multiplexer perceptron (MLPC).
These models were evaluated based on several criteria that evaluated and selected the most accurate model.
The loyalty rules were derived from this model, these rules showed the characteristics of each level of loyalty and thus the loyalty reasons were identified in each segment to target them in a representative manner. The other advantage of classification algorithms application was building a model to classify new users by loyalty.

Related works

Various efforts have been made to build an effective prediction model for retaining customers using different techniques. To better understand how Many studies have built their own predictive models suggested by Oladapo et al. [4]. Logistic regression model design, a good model of customer data to predict customer retention in a telecommunications company with 95.5% accuracy. This model predicts customer retention based on billing, value-added services, and SMS service issues.

Aluri et al. [5] have focused on using machine learning to determine the value of customers in the hospitality sectors of customers, such as restaurants and hotels, by engaging dynamic customers with the loyalty program brand. Their results also show that automated learning processes excel in identifying customers with greater value in specific promotions. They have deepened the practical and theoretical understanding of automated learning in the value chain of customer loyalty, in a structure that uses a dynamic model for customer engagement.

Wiaya and Gersang [6] predict customer loyalty at the National Multimedia Company of Indonesia, using three data mining algorithms, to form a customer loyalty classification model, namely: C4.5, Naive Bayes and Nearest Neighbor. These algorithms were applied to the set of data contained 2269 records and 9 attributes to be used. By comparing the analysis models, the C4.5 algorithm with its own data set segment (80% for training data and 20% of test data) has the highest accuracy results of 81.02% compared to algorithms and other data segments. In the attribute analysis, the disconnection attributes (the attribute that is interpreted as the reason why customers have stopped) get the most influential attribute on the accuracy of the results in the data extraction process to predict customer loyalty. This article does not discuss the algorithms of features selection, methods of obtaining important features, and its impact on model accuracy.

Wong and Wei [7] presented a research to develop a tool to analyze customer behavior and predict their upcoming purchases from Air Travel Company. They provided an integration tool between data mining Pricing for competitors, customer segmentation and predictive analysis. Results In customer segmentation analysis, 110,840 clients are identified and segmented based on their purchasing behavior. Customers’ profiles are split using a weighted RFM model, and customer purchasing behavior is analyzed in response to competitor price changes. The following destinations are expected for high-value customers identified using pre-link rules and custom packages promoted to targeted customer segments.

Moedjionom et al. [8] have predicted customer loyalty in a multimedia services company, offering many services to win the market. This research contribution is to use data related to the segmentation and splitting of potential customers based on the RFM model, then applying the classification, Proportion of accuracy in customer loyalty rating research. Although the C4.5 algorithm with the k-mean segmentation give a better result, there are some important action that can be added to the search: using optimization algorithm to select the features or to adjust the value of the label to obtain a more accurate model.

Kaya et al. [9] have built a predictive model based on spatial, temporal and optional behavioral features using individual transaction logs. Our results show that proposed dynamic behavioral models can predict change decisions much better than demographics-based features and that this effect remains constant across multiple data sets and different definitions of customer leakage. They have examined the relative importance of different behavioral features in predicting leakage, and how predictive power differed across different population groups.

Cheng and Sun [10] have viewed other application of the RFM model (named TFM) to identify high-value customers in the communications industry. Use three main features to describe users who have accumulated a greater amount of service time (T), often purchase 3G services (F) and create large amounts of invoices per month (M).

This study proposes a comprehensive CRM strategy framework that includes customer segmentation and behavior analysis, using a dataset that contains about 500 million (full dataset in syriatel company). Al Janabi and Razaq [11] used intelligent big data analysis to design smart predictors for customer churn in the telecommunication industry. The goal of this research maintain customers and improve the level of revenue. The proposed system consists of three basic pashas: First Phase: an understanding of the company’s data. This phase focuses on the initial processing of data that is fragmented and unbalanced. They addressed the problem of imbalance by building the DSMOTE algorithm. Second Phase: construct a GBM-based predictor after it was developed, replace its decision-making part, which is (DT) with a (GA) algorithm. The impact of this is able to overcome DT problems and reduce time implementation. Third Stage: The accuracy of the predictor results was verified by using the matrix of the conflict matrix. A comparison was made between the traditional method of initial treatment, which is SMOTE, DSMOTE in terms of error rate and accuracy. GBM-GA method has higher Accuracy than GBM.

One of the biggest challenges of the current big data landscape is our inability to process vast amounts of information in a reasonable time. Reyes-Ortiz et al. [12] explored and compared two distributed computing frameworks implemented on commodity cluster architectures: MPI/OpenMP on Beowulf that is high-performance oriented and exploits multi-machine/multi- core infrastructures, and Apache Spark on Hadoop which targets iterative algorithms through in-memory computing. The Google Cloud Platform service was used to create virtual machine clusters, run the frameworks, and evaluate two supervised machine learning algorithms: KNN and Pegasos SVM. Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance. However, Spark shows better data management infrastructure and the possibility of dealing with other aspects such as node failure and data replication.

There are several studies in the field of communication that deal with predicting the age and gender of the customer in big data platform by analyzing their personal data, including the study presented by Zaubi [13]. Where he designed a model using a reliable data set of 18,000 users provided by SyriaTel Telecom Company, for training and testing. The model applied by using big data technology and achieved 85.6% accuracy in terms of user gender prediction and 65.5% of user age prediction. The main contribution of this work is the improvement in the accuracy in terms of user gender prediction and user age prediction based on mobile phone data and end-to-end solution that approaches customer data from multiple aspects in the telecom domain.

Other studies have also dealt with the prediction of customer churn in telecom using machine learning in big data platform, including the study presented by Ahmad [14]. The main contribution of his work is to develop a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn. The model developed in this work uses machine learning techniques on big data platforms and builds a new way of features’ engineering and selection. In order to measure the performance of the model, the Area Under Curve (AUC) standard measure is adopted, and the AUC value obtained is 93.3%. Another main contribution is to use customer social networks in the prediction model by extracting Social Network Analysis (SNA) features. The use of SNA enhanced the performance of the model from 84 to 93.3% against the AUC standard.

With regard to how some studies approached customer value analysis, retention, and loyalty. A study in [4] did not apply to big data as it studied all customers according to some features and using a method of machine learning (a logistic regression model) to show the role of machine education in retaining and increasing customer loyalty. In the study [5]. Machine learning was implemented in a major hospitality location and compared to traditional methods to determine customer value in the loyalty program. In the study [6] predict customer loyalty at the National Multimedia Company of Indonesia, using three data mining algorithms, These algorithms were applied to the set of data obtained are 2269 records and contain 9 attributes to be used. By comparing the analysis models, the C4.5 algorithm with its own data set segment has the highest accuracy results of 81.02% compared to algorithms and other data segments. In my study, a model is built to increase customer loyalty predictions based on the new TFM methodology and machine learning. My experiences were demonstrated that TFM most appropriate for the telecom sector than RFM. The concept of the TFM is adjusted, where T is the sum of the duration of calls and the periods of internet sessions during a certain period. The set of data obtained is 127 million records and contains 220 features to be used. Binary and multi-classification are applied. After comparing the classifiers, the Gradient-boosted-tree classifier was found to be the best in binary and Random Forest Classifier is the best in multi-classification.

Research tools

Hortonworks data platform (HDP)

An open-source framework for distributed storage and processing of large and multi-source datasets [15]. HDP enables flexible application deployment, machine learning, deep learning workloads, real-time data storage, security, and governance. It is a key element in the modern data structure of data (Fig. 1).

The HDP framework was custom- installed to obtain only the tools and systems required to track all stages of this work. these tools and systems were: a distributed file system [16], Hadoop HDFS^{Footnote 1} for data storage, Spark^{Footnote 2} implementation engine for data processing [17], Yarn for resource management, Zeppelin^{Footnote 3} as a development user interface,Ambari for system monitoring, Ranger for system security and (Flume^{Footnote 4} System and Scoop^{Footnote 5}) for data acquisition from Syriatel company data sources to HDFS in our dedicated framework.

Hive^{Footnote 6} is an ETL and data warehouse tool on top of the Hadoop ecosystem and used for processing structured and semi-structured data. Hive is a database present in the Hadoop ecosystem that performs DDL and DML operations, and it provides flexible query language such as HQL for better querying. Hive in Map reduce mode is used because data was distributed across multiple data, nodes to execute queries with better performance in a parallel way. The hardware resources were used included 12 nodes with 32 GB of RAM, 10 TB storage capacity, and 16-core processor per node. The Spark engine [17] was used in most phases of model building such as data processing, feature engineering, training and model testing because it is able to save its data in compute engine’s memory (RAM) and also perform data processing over this data stored in memory, thus eliminating the need for a continuous Input/Output (I/O) of writing/reading data from disk. In addition, there are many other advantages. One of these advantages is that this engine contains a variety of libraries to implement all stages of the machine learning life cycle.

Syriatel data sources

Call details record of CDRs

Each time a call is made, a message is sent, the Internet is used, or an operation is performed on the network, the descriptive information is stored as a call details record (CDR). Table 1 illustrated some types of call logs, messages, and Internet details available in Syriatel that were used in this research to predict customer loyalty:

Table 1 CDR sample fields in Syriatel company

Predictive analytics using big data for increased customer loyalty: Syriatel Telecom Company case study

Abstract

Introduction

Research objectives

Related works

Research tools

Hortonworks data platform (HDP)

Syriatel data sources

Call details record of CDRs

Detailed data stored in relational databases

Customer services

Customer contract information

Database of cells and sites

Demographics data for customers

Extraction of features

Features engineering-ways to choose features

Filter model

Wrapper model

Embedded model

Implement methodology

Data preparation

Address missing and text values

Data processing and application of extraction and selection of attributes

Compilation using calculating TFM

Cumulative total duration (cumulative total duration) T

Frequency

Monetary

Segment and target customers

Customer categories

Target customers

Results and discussion

Apply classification algorithms

Performance measurement

Compare binary classifiers

Confusion matrix

Loyalty categories (loyal 1, not loyal 0)

Comparison of multiple classes

Example of confusion matrix for binary Classes

Multi-classification (1, 2, 3, 4, 5)

Conclusion

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords