Improving prediction with enhanced Distributed Memory-based Resilient Dataset Filter

Launching new products in the consumer electronics market is challenging. Developing and marketing the same in limited time affect the sustainability of such companies. This research work introduces a model that can predict the success of a product. A Feature Information Gain (FIG) measure is used for significant feature identification and Distributed Memory-based Resilient Dataset Filter (DMRDF) is used to eliminate duplicate reviews, which in turn improves the reliability of the product reviews. The pre-processed dataset is used for prediction of product pre-launch in the market using classifiers such as Logistic regression and Support vector machine. DMRDF method is fault-tolerant because of its resilience property and also reduces the dataset redundancy; hence, it increases the prediction accuracy of the model. The proposed model works in a distributed environment to handle a massive volume of the dataset and therefore, it is scalable. The output of this feature modelling and prediction allows the manufacturer to optimize the design of his new product.

information from shop owners. On the other hand, online shopping sites give product reviews and previous customer feedbacks without extra cost and effort for the customers [7][8][9][10].
Investing in poor quality products potentially affects an industry's brand loyalty and this strategy should be changed by the eCommerce firms [5,11]. Consumer product success depends on different criteria, such as the quality of the product and marketing strategies. The users should provide their valuable and accurate reviews about the products [12]. Customers bother to give reviews about products, whether they liked it or not. If the users provide reviews, then other retailers can create some duplicated reviews [13,14]. In online marketing, the volume and value of product reviews are examined [15,16]. The number of the product reviews on the shopping sites, blogs and forums has increased awareness among the users. This large volume of the reviews leads to the need for significant data processing methods [17,18]. The value is the rating on the products. The ratio of positive to negative reviews about the product leads to the quality of the product [19,20].
Feature selection is a crucial phase in data pre-processing [21]. Selecting features from an un-structured massive volume of data reduce the model complexity and improves the prediction accuracy. Different feature selection methods existing are the filter, wrapper and embedded. The wrapper feature selection method evaluates the usefulness of the feature and it depends on the performance of the classifier [22]. The filter method calculates the relevance of the features and analyzes data in a univariate manner. The embedded process is similar to the wrapper method. Embedded and wrapper methods are more expensive compared to the filter method. The state-of-art methods in customer review analysis generally discuss on categorizing positive and negative reviews using different natural language processing techniques and spam reviews recognition [23]. Feature selection of customer reviews increases prediction accuracy, thereby improves the model performance.
An enhanced method, which is a combination of filter and wrapper method is proposed in this work, which focuses on product pre-launch prediction with enhanced distributive feature selection method. Since many redundant reviews are available on the web in large volumes, a big data processing model has been implemented to filter out duplicated and unreliable data from customer reviews in-order to increase prediction accuracy. A scalable big data processing model has been applied to predict the success or failure of a new product. The realization of the model has been done by Distributed Memory-based Resilient Dataset Filter with prediction classifiers. This paper is organized as follows. "Related work" section discusses related work. "Methodology" section contains the proposed methodology with System design, Resilient Distributed Dataset and Prediction using classifiers. "Results and discussions" section summarizes results and discussion. The conclusion of the paper is shown in "Conclusion and future work" section.

Related work
Makridakis et al. [24] illustrate that machine learning methods are alternative methods for statistical analysis of multiple forecasting field. Author claims that statistical methods are more accurate than machine learning [25] methods. The reason for less accuracy is the unknown values of data i.e., improper knowledge and pre-processing of data. Different works have been implemented using the Matrix factorization (MF) [14] method with collaborative filtering [26]. Hao et al. [15] focused on a work based on the factorization of the user rating matrix into two vectors, i.e., user latent and item latent with low dimensionality. The sum of squared distance can be minimized by training a model that can find a solution using Stochastic Gradient Decent [27] or by least squares [28]. Salakhutdinov et al. [29] proposed a method that can be scaled linearly by probability related matrix factorization on a big volume of datasets and then comparing it with the single value decomposition method. This matrix factorization outperforms other probability factorization methods like Bayesian-based probabilistic analysis [29] and standard probability-based matrix factorization methods. A conventional approach, like traditional collaborative Filtering [13,30] method depends on customers and items. The user item matrix factorization technique has been used for implementation purpose. In the recommender system, there is a limitation in the sparsity problem and cold start problem. In addition to the user item matrix factorization method, various analyses and approaches have been implemented to solve these recommendation issues.
Wietsma et al. [31] proposed a recommender system that gives information about the mobile decision aid and filtering function. This has been implemented with a study of 29 features of student user behavior. The result shows the correlation among the user reviews and product reviews from different websites. Jianguo Chen et al. [32] proposed a recommendation system for the treatment and diagnosis of the diseases. For cluster analysis of disease symptoms, a density-peaked method is adopted. A rule-based apriori algorithm is used for the diagnosis of disease and treatment. Asha et al. [33] proposed the Gini-index feature method using movie review dataset. The sentimental analysis of the reviews are performed and opinion extraction of the sentences are done. Giniindex impurity measure improves the accuracy of the polarity prediction by sentimental analysis using Support vector machine [34,35]. Depending on the frequency of occurrence of a word in the document, the term frequency is calculated and opinion words are extracted using the Gini-index method. In this method, high term frequency words are not included, as it decreases the precision. The disadvantage of this method is that for the huge volume of data, the prediction accuracy decreases.
Luo et al. [36] proposed a method based on historical data to analyze the quality of service for automatic service selection. Liu et al. [37] proposed a system in a mobile environment for movie rating and review summarization. The authors used Latent Semantic Analysis (LSA-based) method for product feature identification and feature-based summarization. Statistical methods [38] have been used for identifying opinion words. The disadvantage of this method is that LSA-based method cannot be represented efficiently; hence, it is difficult to index based on individual dimensions. This reduces the prediction accuracy in large datasets.
Lack of appropriate computing models for handling huge volume and redundancy in customer review datasets is a major challenge. Another major challenge handled in the proposed work is the existence of a pre-launch product in the industry based on the product features, which can be predicted based on the customer feedback in the form of reviews and ratings of the existing products. This prediction helps to optimize the design of the product to improve its quality with the required product features. Many of the relational database management systems are handling structured data, which is not scalable for big data that handles a large volume of unstructured data. This proposed model solves the problem of redundancy in a huge volume of the dataset for better prediction accuracy.

Methodology
A pre-launch product prediction using different classifiers has been analysed by huge customer review and rating dataset. The product prediction is done through the phases consisting of data collection phase, feature selection and duplicate data removal, building prediction classifier, training as well as testing. Figure 1 describes the various stages in system design of the model. The input dataset consists of multivariate data which includes categorical, real and text data. Input dataset is fed for data pre-processing. Data pre-processing consists of feature selection, redundancy elimination and data integration which is done using Feature Information Gain and Distributed Memory-based Resilient Dataset Filter approach. The cleaned dataset is trained using classification algorithms. The classifiers considered for training are Support Vector Machine (SVM) and Logistic Regression (LR). Further the dataset is tested for pre-launch prediction using LR and SVM.

Data collection phase
This methodology can be applied for different products. Several datasets like Amazon and flip cart customer reviews are available as public datasets [39][40][41]. The dataset of customer reviews and ratings of seven brands of mobile phones for a period of 24 months are considered in this work. The mobile phones product reviews are chosen because of two reasons. New mobile phones are launched into the market industry day by day which is one of the unavoidable items in everyone's life. Market sustainability for the mobile phones is very low. Table 1 shows a sample set of product reviews in which input dataset consists of user features and product features. User features consists of Author, ReviewID and Title depending on the user. Product feature consists of Product categories, Overall ratings and Review Content. Since mobile phone is taken as the product, the categorization is done according to the features such as Battery life, price, camera, RAM,

Dataset pre-processing
In data pre-processing, feature selection plays a major role. In the product review dataset of a mobile phone, a large number of features exist. Identifying a feature from customer reviews is important for this model to improve the prediction accuracy. Enhanced Feature Information Gain measure has been implemented to identify significant feature.
Features are identified based on the content of the product reviews, ratings of the product reviews and opinion identification of the reviews. Ratings of the product reviews can be further categorized based on a rating scale of 5 (1-Bad, 2-Average, 3-Good, 4-very good, 5-Excellent). For opinion identification of the product, the polarity of extracted opinions for each review is classified using Senti-WordNet [42].
Feature Information Gain measures the amount of information of a feature retrieved from a particular review. Impurity which is the measure of reliability of features in the input dataset should be reduced to get significant features. To measure feature impurity, the best information of a feature obtained from each review is calculated as follows • Let P i be the probability of any feature instance f of k feature set F = f 1 , f 2 , . . . f k belonging to i th customer review R i , where i varies from 1 to N. • Let N denotes the total number of customer reviews. • Let O R denotes the polarity of extracted opinions of the Review. • Let S R denotes product rating scale of review (R).

Table 1 Sample set of Product Reviews
The information of a feature with respect to review rating and opinion is denoted by I f Expected information gain of the feature denoted as E f Review Feature Impurity R(I) is calculated as Then Feature Information Gain (� G ) to find out significant features are calculated as Features are selected based on the G value and those with an Information gain greater than 0.5 is selected as a significant feature. Table 2 shows the significant feature from customer reviews and ratings.
Next step is to eliminate the redundant reviews and to replace null values of an active customer from the customer review dataset using an enhanced big data processing approach. Reviews with significant features obtained from feature identification are considered for further processing. (1) (2)

Resilient Distributed Dataset
Resilient Distributed Dataset (RDD) [43] is a big data processing approach, which allows to store cache chunks of data on memory and persevere it as per the requirements. The in-memory data caching is supported by RDD. Variety of jobs at a point of time is another challenge which is handled by RDD. This method deals with chunks of data during processing and analysis. RDD can also be used for machine learning supported systems as well as in big data processing and analysis, which happens to be an almost pervasive requirement in the industry.
In the proposed method the main actions of RDD are: • Reduce (β): Combine all the elements of the dataset using the function β.
• First (): This function will return the first element • takeOrdered(n): RDD is returned with first 'n' elements.
• saveAsSequenceFile(path): the elements in the dataset to be written to the local file system with given path.
The main Transformations of RDD are: • map(β): Elements from the input file is mapped and new dataset is returned through function β. • filter(β): New dataset is returned if the function β returns true. • groupBykey(): When called a dataset of (key, value) pairs, this function returns a dataset of (key, value) pairs. • ReduceBykey(β): A (key, value) pair dataset is returned, where the values of each key are combined using the given reduce function β.
In the proposed work an enhanced Distributed Memory-based Resilience Dataset Filter (DMRDF) is applied. DMRDF method have long Lineage and it is recomputed themselves using prior information, thus it achieves fault-tolerance. DMRDF has been implemented to remove the redundancy in the dataset for product pre-launch prediction. This enhanced method is simple and fast.
• Let the list of n customers represented as C = {c 1 , c 2 , c 3 . . . , c n } • Let the list of N reviews be represented as R = {r 1 , r 2 , r 3 . . . , r N } • Let x significant features are identified from feature set (F ) represented as F x ⊂ F • An active customer consists of significant feature having information Gain value denoted by G In the DMRDF method, a product is chosen and its customer reviews are found out. Eliminate customers with similar reviews on the selected product and also reviews with insignificant features. Calculate the memory-based Resilient Dataset Filter score between each of the customer reviews with significant features.
Let us consider a set C of 'n' number of customers, the set R of 'N' number of reviews and a set of significant features ′ F ′ x are considered. The corresponding vectors are represented as K C , K R and K F x . Then K R i is represented using a row vector and K F j is represented using the column vector. Each entry K C m denote the number of times the m th review arrives in customers. The similarities between ith review of mth customer is found out using L 1 norm of K R i and K C m . The Distributed Memory-based resilient filter score δ is calculated using the Eq. (5).
The δ score is calculated for each customer review whereas the score lies between [0,1]. The significant features are found out using Eq. 4. For customer reviews without significant features, G value will be zero. The reviews with δ score value 0 are found to be insignificant without any significant feature or opinion and hence those reviews are eliminated and not considered for further processing in the work. More than one Distributed Memory-based resilient filter score value is identified then the second occurrence of the review is considered as duplicate.

Prediction classifiers
Logistic regression and Support Vector Machine classifiers are the supervised machine learning approaches used in the proposed work for product pre-launch prediction.

Logistic regression (LR)
We have implemented proposed model using logistic regression analysis for prediction. This model predicts the failure or success of a new product in the market by analysing selected product features from customer reviews. A case study has been conducted using the dataset of customer reviews of mobile phones. Success or failure is the predictor variable used for training and testing the dataset. For training the model 75% of the dataset is used and for testing the model, remaining 25% is used.
• Let p be the prediction variable value, assigning 0 for failure and 1 for success. • p 0 is the constant value. • b is the logarithmic base value.
Then the logit function is, Then the Logistic regression value γ is shown in Eq. (7), The probability value of γ lies between [0,1]. In this work, if this value is greater than 0.5 the pre-launch prediction of the product is considered as success and for values less than 0.5, it is considered as failure.

Support Vector Machine (SVM)
SVM is the supervised machine learning method, used to learn from set of data to get new skills and knowledge. This classification method can learn from data features relationships ( z i ) and its class y i that can be applied to predict the success or failure class the product belongs to.
• For a set T of t training feature vectors, z i ∈ R D , where i = 1 to t. • Let y i ∈ {+1, −1} , where +1 belongs to product success class and -1 belongs to product failure class. • The data separation occurs in the real numbers denoted as X in the D dimensional input space. • Let w be the hyper plane normal vector element, where w ∈ X D .
The hyper plane is placed in such a way that distance between the nearest vectors of the two classes to the hyperplane should be maximum. Thus, the decision hyper plane is calculated as, The conditions for training dataset d ∈ X , is calculated as To maximize the margin the value of w should be minimized. The products in the positive one class (+1) are considered as successful products, [from Eq. (9)] and those in the negative one class (−1) [from Eq. (10)] are in failure class.

Experimental setup
The proposed system was implemented using Apache Spark 2.2.1 framework. Spark programming for python using PySpark version 2.1.2, which is the Spark python API has been used for the application development. An Ubuntu running Apache web server using Web Server Gateway Interface is used. Amazon Web Services is used to run some components of the software system large servers (nodes), having two Intel Xeon E5-2699V4 2.2 G Hz processors (VCPUs) with 4 cores and 16 GB of RAM on different Spark cluster configurations. According to the scalability requirements the software components can be configured and can run on separate servers.

Results and discussions
To evaluate our prediction system several case studies have been conducted. Support Vector Machine and Logistic regression classifiers are employed to perform the prediction. Most significant customer review features are used to analyse the system performance. The prediction accuracy evaluation is taken as one of the system design factors. The system response time is another major concern for big data processing system. In the customer review feature identification, we propose feature information gain and DMRDF approach to identify significant features and to eliminate redundant customer reviews from the input dataset. Figure 2 illustrates significant features required for the mobile phone sustainability. Customer reviews and ratings of 7 brands of mobile phones are identified and evaluated with DMRDF using SVM and LR. The graph shows the significant features identified by the model against the percentage of customers whose reviews are analysed. 88% of the customers identified internal storage as a significant feature. Product price has been identified by 79% of customers as significant feature. With this evaluation customer requirements for a product can be analysed in a better manner, thus can optimize the design of the product for better product quality and for product sustainability in the industry. Figure 3 shows the comparison of the processing time taken by the proposed model with different dataset size against that of the state of art techniques. DMRDF method takes less time for completion of the application compared to other gini-index and latent semantic analysis methods. Hence the proposed model is fast and scalable. It provides a high-speed processing performance with large datasets. This shows the DMRDF applicability in big data analytics, whereas gini-index and LSA-based methods processing time is larger for large volume of dataset. From the Fig. 3 it can be seen that with 9 GB dataset time taken for prediction using LSA-based model, Gini-index model and DMRDF model is 342 s, 495 s and 156 s respectively. With 18 GB dataset time taken for prediction using LSA-based model, Gini-index model and DMRDF model 740 s, 910 s and 256 s respectively. Gini-index and LSA-based methods time taken for 18 GB dataset is twice that of 9 GB dataset. But for DMRDF model time taken for 18 GB dataset is 1.6 times that of 9 GB dataset and also it is 3 times lesser than Gini-index method. DMRDF model has more advantage compared to the other state of art techniques in the case of application execution and performance. The reliability of the methods considered for the pre-launch prediction depends on precision [44], recall and prediction accuracy measurement. Table 5 shows a comparison of precision, recall and accuracy measures of DMRDF, Gini-index and LSA-based methods with Support Vector Machine and Logistic Regression classifiers using customer reviews dataset over a period of 24 months. The results shown in Table 3   Using DMRDF with SVM classifier and LR classifier, the prediction accuracy variations are less compared to LSA-based and Gini-index methods. Hence DMRDF outperforms the other two methods for customer review feature prediction.
Furthermore Fig. 4, shows the DMRDF, LSA-based and Gini-index approaches as applied to the customer reviews and ratings datasets for 3, 6, 12, 18 and 24 months. In DMRDF many features may appear in different customer review aspects, hence performance evaluation will not consider duplicate customer reviews. In Gini-index, features are extracted based on the polarity of the reviews and for large dataset P@R and R@R are less. The results show that DMRDF method outperforms the other two methods in big data analysis. Gini-index approach does not perform well in customer review feature prediction.

Conclusion and future work
Technological development in this era brings new challenges in artificial intelligence like prediction, which is the next frontier for innovation and productivity. This work proposes the implementation of a scalable and reliable big data processing model which identify significant features and eliminates redundant data using Feature Information Gain and Distributed Memory-based Resilient Dataset Filter method with Logistic Regression and Support Vector Machine prediction classifiers. A comparison of the analysis has been conducted with state of art techniques like Gini-index and LSA-based approaches. The prediction accuracy, precision and recall of DMRDF method outperforms the other methods. Results show that the prediction accuracy of the proposed method increases by 10% using significant feature identification and elimination of redundancy from dataset compared to state of art techniques. Large feature dimensionality reduces the prediction accuracy of the LSA-based method where as number of significant features plays an important role in prediction modelling. Results show that proposed DMRDF model is scalable and with huge volume of dataset model performance is good as well as time taken for processing the application is less compared to state of art techniques. Resilience property of DMRDF method have long lineage, hence this can achieve fault-tolerance. DMRDF model is fast because of the in-memory computation method. Proposed design can be extended to other product feature identification big data processing domains. As a future work, the model may be developed to make real time streaming predictions through a unified API that searches customer comments, ratings and surveys from different reliable online websites concurrently to obtain synthesis of sentiments with an information fusion approach. Since the statistical properties of customer reviews and ratings vary over time, the performance of machine learning algorithms can also come down. To cope with the limitations of deep learning matrix factorization integrated with DMRDF can be adapted.