Sentiment analysis for cruises in Saudi Arabia on social media platforms using machine learning algorithms

Al sari, Bador; Alkhaldi, Rawan; Alsaffar, Dalia; Alkhaldi, Tahani; Almaymuni, Hanan; Alnaim, Norah; Alghamdi, Najwa; Olatunji, Sunday O.

doi:10.1186/s40537-022-00568-5

Research
Open access
Published: 18 February 2022

Sentiment analysis for cruises in Saudi Arabia on social media platforms using machine learning algorithms

Bador Al sari ORCID: orcid.org/0000-0003-0421-4981³,
Rawan Alkhaldi¹,
Dalia Alsaffar¹,
Tahani Alkhaldi¹,
Hanan Almaymuni¹,
Norah Alnaim¹,
Najwa Alghamdi² &
…
Sunday O. Olatunji³

Journal of Big Data volume 9, Article number: 21 (2022) Cite this article

5242 Accesses
9 Citations
5 Altmetric
Metrics details

Abstract

Social media has great importance in the community for discussing many events and sharing them with others. The primary goal of this research is to study the quality of the sentiment analysis (SA) of impressions about Saudi cruises, as a first event, by creating datasets from three selected social media platforms (Instagram, Snapchat, and Twitter). The outcome of this study will help in understanding opinions of passengers and viewers about their first Saudi cruise experiences by analyzing their feelings from social media posts. After cleaning, this experiment contains 1200 samples. The data was classified into positive or negative classes using the choice of machine learning algorithms, such as multilayer perceptron (MLP), naıve bayes (NB), random forest (RF), support vector machine (SVM), and voting. The results show the highest classification accuracy for the RF algorithm, as it achieved 100% accuracy with over-sampled data from Snapchat using both test options. The algorithms were compared among the three different datasets. All algorithms achieved a high level of accuracy. Hence, the results show that 80% of the sentiments were positive while 20% were negative.

Introduction

Social media has undergone significant development in recent years; thus, a huge amount of information is in circulation. Various websites have been developed through which users can express their opinions and share their content. This is especially the case with the expansion of social networks (blogs, forums, and social media) in which the content is usually subjective and loaded with opinions and ratings. This kind of information can be very useful for recommending products or brands [1]. First, there is Twitter, which is a microblogging service that allows small blog posts called Tweets to be sent and received [2]. Second, Snapchat is a mobile messaging app for sharing temporary photos and videos called Snaps that disappear after viewing [3]. One of Snapchat’s most important features is the Snap Map that displays a real-time location for anyone who submits a snap to the map. The third platform is Instagram, which is commonly used to post photos and videos in order to share them with followers who can comment on or ‘like’ these posts. Social media provides an enormous amount of data. As a result, there is a need for data mining, which enables analysis of social media data and user sentiments by seeking their opinions on specific topics Saudi Arabia presented plans to change the course of its tourism sector through the development of Vision 2030 that was announced on 25 April 2016 by Crown Prince Mohammad bin Salman [4]. One of these plans is to invest in tourism by launching various events to attract visitors. Some of these events are unprecedented for the Saudi population, as they have been offered for the first time. For the first time, by offering tourist visas, Saudi Arabia was opening its doors to visitors from many countries. Saudi Arabia received 24,000 international visitors during the first ten days of applications for immediate tourist visas [5]. This study aims to analyze passengers’ and viewers’ opinions to see if the pandemic effects on the economy about cruise entertainment, which is the first of its kind in the Kingdom of Saudi Arabia [6, 7]. The sentiment analysis (SA) process is the systematic identification, extraction and quantification of affective states and subjective information using natural language processing [8]. It was made by starting with the collection of opinions as textual data from several social media platforms. The platforms used are Instagram, Snapchat and Twitter, because of their popularity in Saudi Arabia [9].

The opinions of this Red Sea Saudi cruise were analyzed and classified into negative and positive classes. To the best of our knowledge, this research is one of the few studies that classifies emotions by applying machine learning (ML) algorithms to Arabic datasets. This is because of the difficulty of finding logical results and the need for longer pre-processing steps. Furthermore, this study was launched during the Covid-19 pandemic. We study the quality of the sentiment analysis by various ML algorithms for the three selected social media platforms. Five of the most popular ML algorithms were applied: multilayer perceptron (MLP), Naıve Bayes (NB), random forest (RF), support vector machine (SVM), and the voting ensemble algorithm. These algorithms were used to classify opinions about the cruise. Each algorithm relies on a unique method for making predictions. Likewise, ML algorithms were chosen due to the size of the dataset. Finally, a comparison is made to evaluate the efficiency of these models in classifying textual data in the Arabic language.

The remaining of this paper is divided into five sections: “Literature review” section covers related work on SA in tourism. Next, the proposed techniques in this paper are presented, followed by the empirical and experimental studies, after which the results are discussed. Finally, the conclusion is presented along with ideas for potential for future work.

Literature review

In this section, a literature review of the relevant research is provided. The research is summarized and classified based on the platform type used.

Instagram

In [10], the authors searched for a study of criteria for expressing feelings on social media, especially on Facebook, Twitter, Instagram and WhatsApp, and compared their efficacy for expressing six separate feelings. Through the analysis of the samples and the procedures, the results for expressing negative feelings show WhatsApp to be most suitable, followed by Facebook, Twitter, and Instagram. In order to ex- press positive feelings, perceived suitability was highest for WhatsApp, followed by Instagram, Facebook and Twitter. The system only provides a comparative analysis among these four platforms. In another study [11], the authors addressed the problem of predicting the success of music albums by investigating various data sources from social media to mainstream American newspapers. The principal technique applied was the RF approach, which predicted results with an accuracy of 94%. There are limitations regarding the shortness of the data collection period, which is only one month. In [12], the authors explored the use of Instagram to promote tourism destinations in Indonesia. By exploring users’ perceptions using in-depth conversations and interviews with visual styles and image-induction techniques, they tried to describe the potential value of Instagram for promoting tourism sites in Indonesia. They found everyone tried to promote their own cities in their own ways, with Instagram providing complete communication facilities from tourism brands to allow user-generated photographic content. The search was limited only to the Instagram platform’s contribution to the development of tourist destinations.

Snapchat

In recent research [13], the authors investigated data posted to our story on the Snap Map. They collected photos and videos, and applied statistical and deep learning techniques to SA. The data were gathered during three events in Riyadh Tourist Season. Their results indicated the capacity for SA through Snapchat. The authors of [14] analyzed combined data from a questionnaire, Snapchat, and Google Maps. They looked into lexicon-based and ML approaches. The research results revealed that celebrities on Snapchat impact people’s choices of restaurants. In [3], the authors researched how US media uses Snapchat to reach young audiences. The chief technique applied was interviews and content analysis. The principal result showed that publishers on Snapchat Discover are embracing the capabilities of Snapchat, and adapting media types and story themes using visuals. Results also showed that the media retains its own character in judging the news. The system only dealt with the use of Snapchat Discover. Piwek and Joinson [15] ran an online survey using the memory sampling method to inquire into details of a recent photo sent by every Snapchat participant. Results showed that they already share ‘avatars’ and ‘creative logo graphics’, and often use them at home mainly as an easier and more fun way of reaching friends. In [8], the authors performed SA on social media textual data as a rich source of opinions. These textual views were classified into four categories based on their level of extremeness: low, high, moderate and neutral. To classify the data, multinomial NB and linear SVM classifier algorithms were used. The results showed that the SVM algorithm was the most accurate classifier with an accuracy of 82%.

Twitter

In [16], the authors analyzed tweets collected in the Arabic language and compared different algorithms using SA with different n-grams as a method for feature extraction. The performance of the algorithms was evaluated by measuring accuracy, precision, recall, and f-measure. The result showed a 99.96% accuracy with unigram.

Also, Heikal et al. [17] explored a deep learning model for application to Arabic data in order to improve the accuracy of Arabic SA. The fundamental techniques were CNN and long-term memory models. The major result of this study was that the model achieved an F1 score of 64.46%, which outperformed the modern deep learning model’s F1 score of 53.6% for the Arabic sentiment dataset. The system was limited to analyzing sentiments from Twitter data only. In [18], the authors conducted an SA of social media. They applied the NB method and Google Prediction API. The accuracy achieved and the macro-F-measure were 90.21% and 89.98%, respectively. The main finding evaluated the classification performance by comparing it with predictions of the winner of the 2016 US election. However, only Twitter data were used. Furthermore, the authors looked at and discussed social media analysis using Twitter data relating to cruises, representing it in three categories of user group: commercial, news/blogs, and private [19]. Block analysis was the key method used after using three distinct techniques: word repetition, content analysis, and network analysis. Results showed tourists are less influential than celebrities, and celebrity influence is one of the marketing strategies that is relied upon nowadays. The data collection period was short, and sadly, the analysis remains mainly exploratory for this reason. In [20], the authors proposed hybrid algorithms to discover people’s opinions from their Twitter posts. The primary technique, the polarity classification algorithm, contained three stages for classifying 2,116 tweets into positive, negative, or natural groups. The central finding was that this achieved a greater accuracy than other algorithms for the same dataset. The paper evaluated the algorithm by using different metrics, although the authors did not indicate the keywords or the data collection period. In [21], the authors conducted SA of tweets to understand of the effect of the COVID-19 pandemic on the cruise industry, and mined semantic time-series data from social media. They computed the adjusted sentiment score for each tweet posted between 1 February and 18 June 2020. The main finding was that there are two groups, with the first suffering from quarantine and limits on travel because of COVID-19, making them even more eager to travel and explore, and the second, interested in cruise tourism possibly shifting from mass cruises to niche cruises.

Other social networks

In a recent paper [22], the authors analyzed reviews on the TripAdvisor website. They applied multi-classification to get high performance of the SVM algorithm, NB over-sampling, Word2vec, and Knowledge Graph. The best result achieved was a recall of 0.901. As for places, the Tower of London was the best. Banati et al., [23] analyzed the emotions expressed by users about their experiences while traveling. Opinion mining was applied to reviews from the TripAdvisor website which were extracted using a web crawler in Python. The extracted reviews were classified as positive or negative at different levels: document level, sentence level, and feature/entity level. Classification for multiple entities at the document level could not be linked under the same category. In addition, they evaluated the performance of seven ML algorithms, such as RF, RT, NB, and OneR. The best accuracy achieved was for RF at 88.25%, while OneR provided the lowest result, with an accuracy of 68.1%. In addition [24], the authors considered the problem of the glut of information on the Internet discovered while mining reviews from travel blogs. They applied NB and SVM, with the main finding being that the SVM model with N- gram achieved excellent results. However, the system only dealt with the use of sentiment classification for reviews.

Brida et al. [25] considered the experiences of passengers on cruises and their features. The main technique applied was a decision tree (DT), with the authors analyzing data from 1361 responses collected through a questionnaire over three months in 2009. The main finding for the applied DT was an accuracy of 67.6%. However, the authors observed that the lower the characteristics were, the more accurate was the prediction. The paper does not consider different types of evalu-ation nor comparison of algorithms. In study [26], the authors focused on the SA of multilingual textual data from social media to discover the intensity of the sentiments for extremism. They proposed a manual method that effectively found extreme sentiment from multilingual data by creating a new multilingual lexicon or dictionary. Experiments were performed for supervised and unsupervised algorithms. The greatest accuracy achieved for SVM supervised was 82%, while for KNN unsupervised, the best accuracy was 26%.

In [27], the authors presented a data-driven approach to analyze data about trips from location-based social networks (LBSN). The study aimed to discover the mobility pattern for how tourists would travel the world. Moreover, they presented two applications to use the data from each trip. First, travelers were clustered in terms of the Twitter and Foursquare datasets, which obtained three clusters for Twitter and six for Foursquare. The second application area was the spatial clustering of destinations throughout the world. They identified 942 regions as destinations that can be directly used in a regional model for a destination recommender system. However, the results might have been affected by travelers’ continuously location- sharing their LBSNs, resulting in out-of-date datasets from Foursquare and Flickr. Table 1 shows a summary of the relevant studies. It is clear from the literature review that there are many studies in the SA field that have reported useful results. Nevertheless, the literature lacks comparative studies that use different social media platforms to analyze tourist impressions of new tourism events. A comparison of the performance of ML algorithms is made among several popular algorithms, such as MLP, SVM, RF, NB, and Voting based upon their accuracy rates. The experiment is tested using 10-fold cross-validation with the 70% split test option.

Table 1 Summary of previous studies

Full size table

Description of proposed techniques

This section is concerned of with describing the implemented algorithms MLP, SVM, RF, NB and Voting.

Multilayer perceptron

The MLP algorithm was introduced by M. Minsky and S. Pappert in 1969. This algorithm consists of a neural network that contains multiple layers of nodes. The layers are subdivided into three categories: input layer, hidden layers, and output layer. Furthermore, this algorithm processes data by passing it from the input layer to the hidden layers, and up to the output layer to obtain the classification results [28]. Figure 1 shows the grid configuration of the algorithm, explaining the connections and nodes between the layers.

The input data is fed into the input layer and the extracted data is delivered to the output layer. The hidden layers are layers of nodes between the input and output layers, and there may be one or more of these layers, which perform non- linear transformations on the inputs entered into the network. They are layers of mathematical functions, each designed to produce an output specific to an intended result. The connections between the layers are called weights (W), which are normally defined between 0 and 1. The output value of each neuron is calculated in two subsequent stages as follows. In the first stage, the weighted summation of the input values is calculated using the following Eq. (1):

$$\forall l \in \left\{ {1,2, \ldots ,j} \right\},h_{l} = \sum\limits_{i = 1}^{m} {W_{il}^{H} Ii + \beta_{l}^{H} }$$

(1)

where Ii is the input variable i, $W_{l}^{H}$. Is the connection weight between i input neuron and the hidden neuron l, m is the total number of inputs and $\beta_{l}^{H}$ is the bias of the lth hidden neuron. In the second stage, the output value of each neuron in the hidden layer is calculated based on the weighted summation using an activation function, as in Eq. (2):

$$\forall l \in \left\{ {1,2, \ldots ,j} \right\},H_{l} = sigmoid\left( {h_{l} } \right) = \frac{1}{{1 + e^{ - hl} }}$$

(2)

The final output is calculated as in Eqs. (3) and (4) [29]:

$$\forall k \in \left\{ {1,2, \ldots ,n} \right\},o_{k} = \sum\limits_{i = 1}^{l} {W_{ik}^{o} Hl + \beta_{k}^{o} }$$

(3)

$$\forall k \in \left\{ {1,2, \ldots ,n} \right\},o_{k} = sigmoid\left( {o_{k} } \right) = \frac{1}{{1 + e^{{ - o_{k} }} }}$$

(4)

Naive Bayes

NB is a method that uses knowledge of statistics and probabilities and depends on the implementation of Bayes theory. Figure 2 shows how the probabilistic model provides the probability distribution of an instance over a set of classes. In addition, C is the instance where X1…Xn are the classes, and each probability should be calculated with all classes. This model is the opposite of the deterministic model that only outputs whether or not an instance belongs to positive or negative classes [30].

The mathematical expression for Bayes’ theorem [30] is as follows in Eq. (5):

$$P\left( {A\left| B \right.} \right) = \frac{{P\left( {\left. B \right|A} \right) \cdot P\left( A \right)}}{P\left( B \right)}$$

(5)

is given in Eq. (5) above. In the NB classifier, all attributes are separated to provide the value of the class variable (depending on independence), as in Eq. (6):

$$P\left( {F\left| C \right.} \right) = P\left( {f_{1} ,f_{2} \ldots f_{a} \left| c \right.} \right)_{i}^{n} \pi P\left( {f_{i} \left| c \right.} \right)$$

(6)

This algorithm is the easiest and fastest of the Bayesian models [30]. It matches the estimation of the kernel density where it can attain higher levels of accuracy. It works by assuming that all the attributes are independent and affect the results separately [31]. However, this classifier is highly scalpel, requiring several linear parameters for the variables.

Random forest

RF is an ensemble classification method. It is designed as a series of classifiers that take a vote on their forecasts in order to classify the data [32]. These classifiers are tree-structured and randomly divide each node between the subsets of the predictors by taking the best-case scenario [33]. In addition, the trees grow using a random set of features. Figure 3 shows the structure of a RF. The trees run in parallel with no interaction between them. During training time, the algorithm immediately constructs several decision trees, picking a random point k from the training set of data points. After that, the first and second steps are repeated by selecting the number of trees, N, that are needed. Ultimately, each of the N-tree trees predicts the value of the output, y, for the data points.

The process is repeated with new data points, then the average value is taken and assigned as the predicted value, y.

Support vector machines

SVM is a supervised learning algorithm that is mathematically well-founded [32] and is similar to logistic regression. Figure 4 shows how the algorithm works by dividing the sample into two classes by separating the hyper-plane. Furthermore, the few samples at the margin call, support vectors. The distance between the hyper-plane and all training points is called the margin. SVM is recommended to be used in linear model problems. However, one type of SVM, kernel theory, is used to solve nonlinear problems. Linear, polynomial, radial basis function kernels are given in Eqs. (7), (8) and (9) respectively. Linear kernel:

$$k\left( {x_{i} ,x{}_{j}} \right) = x_{i}^{T} x_{j}$$

(7)

Polynomial kernel:

$$k\left( {x_{i} ,x_{j} } \right) = \left( {1 + x_{i}^{T} x_{j} } \right)^{p}$$

(8)

Radial Basis Function kernel:

$$k\left( {x_{i} ,x_{j} } \right) = e^{{ - \frac{{\left\| {x_{i} - x_{j} } \right\|^{2} }}{{2\delta^{2} }}}}$$

(9)

Voting

Ensembling is a method that uses multi-label algorithms together to classify and predict classes. This method is used to optimize the performance obtained from each learning algorithm separately [34]. Furthermore, there are many types of ensemble learning, such as bagging, bootstrapping, stacking and voting [35]. Ensemble voting is used by meta-classifiers to combine ML algorithms by summing the predictions or averaging the predictions made by regression models [36]. Moreover, this classifier is used to aggregate the classes of weak algorithms [37, 38]. Figure 5 shows how this technique sums each classifier with its predicted probabilities to be combined with other classifiers, taking the average for better results.

Finally, the Voting algorithm follows the principle given in Eq. (10) [34],

$$\hat{y} = \arg \max i\sum\limits_{j = 1}^{m} {w_{j} p_{ij} }$$

(10)

where w_j is the weight to be assigned to the j classifier.

For binary classification task with class labels, example i $∈ $0, 1

Methodology

This section presents the methods and tools used for data collection and mining from the social networks Instagram, Snapchat, and Twitter. Figure 6 shows the framework for the data mining process, beginning with collecting data from the three platforms, then extracting it, and finishing with the classifying and predicting process.

Description of dataset

After collecting the data, several features were selected to create a database: gender, text, and class. Table 2 shows the type and utility of each property. The features were chosen according to the content available on each social media platform, in order to compare them.

Table 2 Describe each feature

Full size table

Experimental setup

In this study, the performance of the implemented ML algorithms is experimentally assessed experimentally, and a comparison is made between five ML algorithms: SVM, RF, NB, MLB, and Voting. The algorithms are applied to the extracted textual data, which is written in Arabic. In addition, the algorithms are tested using the Waikato Environment for Knowledge Analysis (WEKA), applying 10-fold cross-validation and a 70% split as evaluation measures on all imbalance sampled, over-sampled and under-sampled data [39].

Cross-validation is a method for evaluating predictive models that divide the original sample into a training set and a test set for training and evaluating the model. Figure 7 depicts the data partitioning in ten folds, which implies that the entire data was randomly partitioned into ten parts, nine of which were used to train the model and one used for testing. After that, the process was repeated ten times, with the error being determined each time. The mean of the errors created in each iteration will be the model’s total error [39]. Another way to split the dataset is directly in this research adopted 70% as training dataset and the rest of dataset to testing.

Data collection

We collected data from different platforms: Instagram, Snapchat and Twitter, from the start of September to the end of October. These platforms were chosen because of their diversity. Through Snapchat, we track the status of tourists in real time, and analyze their feelings by sharing their snaps in Snap Map. Twitter and Instagram were chosen to analyze the comments of tourists and non-tourists by watching the event, and also to compare these different platforms in Sentiment Analysis using ML algorithms. Also, these three applications are the most used in Saudi Arabia, according to what was published by the global media insight [9].

We collected snaps from Snap Map API. The data collection was a real-time process during each trip. The process started with downloading snaps, both pictures and videos, using Python source code. After that, we separated each snap into three layers: textual data, visual content and audio files. The audio files were converted into text using the speech-to-text Python library. The extracted data were recorded in a database file containing snap data from specified map locations in order to build the dataset. On the Twitter platform, we collected the relevant tweets using the Rapid Miner tool to pull out data using keywords. On the Instagram platform, we collected all posts that were either pictures or videos and their comments using hashtags and place tags with the Instaloader tool. Textual data was then manually extracted from these pictures and videos. The keywords used to gather data from Twitter and Instagram were: ‘cruises’, ‘Red Sea’, ‘trip’, ‘prices’, and ‘tourism’.

Pre-processing

After collecting the data, pre-processing was applied to clean the data of noise. This is the most important factor that can make a difference between a good ML model and a poor one. It attempts to fill in missing values and to smooth out the noise in data. Table 3 contains the sample in the dataset before cleaning and after for each platform.

Table 3 Number of datasets before and after pre-processing stage

Full size table

One of most interesting findings was that interaction on the platforms was highest on Instagram, followed by Twitter, then Snapchat as presented in Fig. 8.

Missing data

Ignore the attribute: ignore the attribute such as nationality, because it contains several rows with null values
Fill in a missing value manually: in the case of the categorical feature column, we consider missing data as a new category in itself by replacing the missing values with ‘NA’ or ‘Unknown’ or some other relevant term such as gender column.

Noise

Noise is slightly erroneous data observations that do not comply with the trend or distribution of the rest of the data. Though each error may be small, noisy data collectively results in a poor ML model. Noise in data can be minimized or smoothed out by removing the items listed below:

Arabic diacritics.
Repeated letters such as “Noooo”.
Any irrelevant data.
Numbers such as “123”.
Elongation.
Punctuation marks such as $". ! ?-_*[]:;/() "$.
Focusing on Arabic data and deleting any other language.

Labeling

This section describes the data, called annotation or tagging. This is the process of preparing labeled datasets for ML. Data samples were detected and tagged to establish a foundation for reliable learning patterns. ML systems often require massive amounts of data based on data features that help the model organize the data into patterns that provide an answer. We conducted SA on the sample and labeled it manually as ‘y’, referring to positive sentiment, and ‘x’ referring to negative sentiment. After collecting data from the three platforms, Instagram, Snapchat and Twitter, we configured a separate dataset for each. A total of 10,922 instances were obtained from all platforms and reduced to 1200 after cleaning. In addition, data analysis results show that most of the sample opinions studied were positive about the Saudi Cruise experience. The numbers of positive opinions totaled 342 out of 514 for Instagram users, 256 out of 284 for Snapchat users and 260 out of 462 for Twitter users. This represents 858 positive opinions from the total for all platforms. Figure 9 illustrates the variation of opinions on the three platforms. The results from the data analysis show that the majority of passengers’ opinions were positive about their cruise experience.

Feature extraction

The n-grams applied by using WEKA refer to a neighboring sequence of n words in a text string, with particular words known to be unigrams (1-g), and n- grams of higher order corresponding to all possible contiguous substrings of length n words that can be constructed from a string. Because of their inherent simplicity, n-grams are a desirable option. An n-gram model can capture more context simply by increasing n.

We hypothesized that the addition of n-gram characteristics would allow a classifier to learn richer representations of the underlying text data, and contribute to a concomitant improvement in the output of classification and useful analysis of sentiment.

Data transformation to address imbalance in datasets

Imbalances in data are one of the common problems in classification. This phenomenon is increasing in importance since it is faced in natural data domains when the number of samples is unequally distributed between classes by a large ratio. In order to solve such imbalances, a dataset needs to be re-sampled using under- sampling and over-sampling. Under-sampling focuses on the majority class by re- moving samples in order to balance with another class. Conversely, adding samples to the minority class is called over-sampling [40]. Table 4 shows the number of positive (y) and negative (x) opinions in the dataset in terms of imbalance, under- sampling, and over-sampling for all platforms.

Table 4 Number of positive and negative sentiment when data re-sampling

Full size table