On the development of an information system for monitoring user opinion and its role for the public

Karyukin, Vladislav; Mutanov, Galimkair; Mamykova, Zhanl; Nassimova, Gulnar; Torekul, Saule; Sundetova, Zhanerke; Negri, Matteo

doi:10.1186/s40537-022-00660-w

Research
Open access
Published: 21 November 2022

On the development of an information system for monitoring user opinion and its role for the public

Vladislav Karyukin¹,
Galimkair Mutanov¹,
Zhanl Mamykova¹,
Gulnar Nassimova¹,
Saule Torekul¹,
Zhanerke Sundetova¹ &
…
Matteo Negri²

Journal of Big Data volume 9, Article number: 110 (2022) Cite this article

3068 Accesses
2 Citations
Metrics details

Abstract

Social media services and analytics platforms are rapidly growing. A large number of various events happen mostly every day, and the role of social media monitoring tools is also increasing. Social networks are widely used for managing and promoting brands and different services. Thus, most popular social analytics platforms aim for business purposes while monitoring various social, economic, and political problems remains underrepresented and not covered by thorough research. Moreover, most of them focus on resource-rich languages such as the English language, whereas texts and comments in other low-resource languages, such as the Russian and Kazakh languages in social media, are not represented well enough. So, this work is devoted to developing and applying the information system called the OMSystem for analyzing users’ opinions on news portals, blogs, and social networks in Kazakhstan. The system uses sentiment dictionaries of the Russian and Kazakh languages and machine learning algorithms to determine the sentiment of social media texts. The whole structure and functionalities of the system are also presented. The experimental part is devoted to building machine learning models for sentiment analysis on the Russian and Kazakh datasets. Then the performance of the models is evaluated with accuracy, precision, recall, and F1-score metrics. The models with the highest scores are selected for implementation in the OMSystem. Then the OMSystem’s social analytics module is used to thoroughly analyze the healthcare, political and social aspects of the most relevant topics connected with the vaccination against the coronavirus disease. The analysis allowed us to discover the public social mood in the cities of Almaty and Nur-Sultan and other large regional cities of Kazakhstan. The system’s study included two extensive periods: 10-01-2021 to 30-05-2021 and 01-07-2021 to 12-08-2021. In the obtained results, people’s moods and attitudes to the Government’s policies and actions were studied by such social network indicators as the level of topic discussion activity in society, the level of interest in the topic in society, and the mood level of society. These indicators calculated by the OMSystem allowed careful identification of alarming factors of the public (negative attitude to the government regulations, vaccination policies, trust in vaccination, etc.) and assessment of the social mood.

Introduction

The rapid development of the Internet, social networks, online services, and other web resources have initiated a great interest in the use of information from social networks and the great online activity of users. Research on social media platforms has shown a significant increase in the number of users over the last decade [1]. Older social media platforms like Facebook, YouTube, Reddit, Twitter, etc., save their popularity and are replenished by an even greater number of users. Meanwhile, new platforms, such as Instagram, Tumblr, TikTok, Pinterest, and others, are strengthening their positions in the media space every year [2]. These platforms have been developing not only in the entertainment direction but also in other spheres of life as new events occur almost daily, and their relevance is constantly changing.

In many cases, social networks are used to solve a wide range of business tasks: managing and promoting brands [3], advertising goods and services, creating distribution channels for goods, etc. In addition to business tasks [4], there is a great need for monitoring social networks [5] and content analysis in other areas. Critical topics in politics [6], economics [7], healthcare, medicine, culture, and other areas are gaining great popularity in the media space [8]. It is possible to get the results of public opinion on various social and political topics from discussion places on social networks. In this regard, the technologies of “monitoring social networks” (social listening) and content analysis are gaining great popularity. The number of analytics platforms has significantly increased in the last few years. The lists of the most popular platforms can be easily found online with descriptions of their features and characteristics. Sproutsocial, Hubspot, Buzzsumo, Hootsuite, Brandmention, IQBuzz, and Snaplytics are good examples of such analytics applications. The description of features and characteristics of these platforms are thoroughly described in “Analytics platforms” section of this research. Despite a large number of such platforms, they remind each other in a way that they immensely focus on business purposes leaving significant social, economic, and political problems uncovered. Moreover, all of them are not open access and require a regular paid subscription for their full service. The majority of published papers in reputable journals are devoted to sentiment analysis (SA) of user comments from the Twitter social network [9,10,11]. The research topic of many papers also covers the presidential elections in the USA [9, 10] and other countries [11, 12]. At the same time, the works studying and describing complex social analytics platforms, such as [13], are not fully presented.

Moreover, most of them focus on resource-rich languages such as English, German, French, Italian, Spanish, and Portuguese languages, whereas texts and comments in other low-resource languages such as Russian and Kazakh languages are underrepresented. The web crawlers of the platforms are also not configured to extract texts from the social media space of Kazakhstan. This problem is significant for Kazakhstan, where social media content is mostly written in Russian and Kazakh languages. In addition, it is essential to receive information about current topics in the country from the most popular news portals and discussion platforms on social networks. Even though the news portals tend to publish their content in both languages, it has been noticed during the manual analysis of parsed texts that user comments in Russian prevail over comments in Kazakh, which makes obtaining data even more valuable for understanding the sentiments of the Kazakh speaking population of the country.

Thereby, a new opinion monitoring information system, the OMSystem, which pays much attention to the political, economic, healthcare, education, culture, ecology, and civil society topics, has been developed. This multifunctional platform monitors the media space of Kazakhstan and supports the Kazakh and Russian languages, which allows analyzing the media space efficiently. The OMSystem supports Kazakhstan’s leading news portals and important popular social networks like Facebook, VKontakte, Instagram, Twitter, and YouTube. The core part of the system is the evaluation of the public’s mood and “social well-being” with the use of the SA tool and the social mood indicators such as the level of topic discussion activity in society, the level of interest in the topic in society, and the level of social mood. The SA tool determines the sentiment [14] of the public mood, the range of interests, and information dissemination. It also identifies current problematic issues in society and tracks the dynamics of user involvement in a certain topic. This tool uses the SA methods generally presented by three main approaches: lexicon-based, machine learning-based, and deep learning-based.

This paper describes the architecture of the OMSystem, main modules, and functionalities of this platform, focusing on the SA tools and the module for defining the social mood of society. The use of sentiment dictionaries as a lexicon-based approach and machine learning (ML) algorithms in the OMSystem are also carefully explained. The first part of the experimental section presents the steps to train ML models and select the most efficient ones for use in the OMSystem. The second part demonstrates the definition of the public opinion on the topic of vaccination against coronavirus infection by the evaluation with the following social mood estimating measures: the level of topic discussion activity in society, the level of interest in the topic in society, and the level of social mood. Many scientific articles review the topics related to the Covid-19 pandemic, and research in this field is especially demanded today. Nevertheless, most of the papers were devoted to analyzing labeled sentiment texts, posts, and tweets from social media platforms to evaluate the ML metrics of the trained models. Still, they did not summarize texts together to use other social measures to provide the general people’s attitudes towards the different aspects of this critical topic [15, 16]. Thara and Poornachandran [17] focuses on building SA models with ML algorithms and estimating social mood with the abovementioned measures. The developed ML models have been evaluated by accuracy, precision, recall, and F1-score measures to find the most effective algorithms that need to be used in the OMSystem. The social mood part has also provided exciting findings about the public’s attitude to the vaccination campaign, vaccination policies, and the Government’s activities and methods of combating the pandemic. The reasons for people’s negative moods on this topic have also been extensively analyzed.

The rest of the paper is organized in the following way: “Related works” section provides an overview of the related works to this paper. “Analytics platforms” section describes the features of popular social analytics platforms for brand monitoring, highlighting the essential missing tools implemented in the OMSystem. “The OMSystem information system design methodology” and “The linguistic module” sections describe the structure, functionalities, and module for SA and social mood evaluation. “Machine learning methods” section describes and discusses the experiments on the development of ML algorithms and the public’s attitude towards the vaccination against coronavirus infection. Finally, in “Data collection and data processing” section, we summarize all the previously described sections, analyze the obtained results, and outline directions for future research.

Related works

In recent years, the active development of web technologies has made it possible to analyze users’ moods on various topics. At the same time, marketing campaigns interested in learning users’ opinions and developing many strategies for increasing the flow of customers and profits play a significant role in data analytics. The manual search and filtering of users’ views on websites remain challenging because of their vast number. Therefore, special tools have been developed to automatically track, summarize, and visualize information from social content to solve this problem. In [18], SA of the popular smartphone brand was presented. Data was collected from Twitter using a web crawler that searches through particular hashtags. Benedetto and Tedeschi [19] demonstrates an open framework for monitoring, analyzing, and receiving media content. This framework allows you to collect, index, and retrieve data using the Representational state transfer application programming interface (REST API) from the following sources: Twitter, Facebook, YouTube, Google+, and Flickr. Schinas et al. [20] presents an analysis of the statements of many political leaders, diplomats, journalists, and other media figures on the Twitter platform, the most active social network covering these issues. Radicioni et al. [21] shows an architecture that combines SA and community discovery to understand trends, approaches, business, and policy views on topics such as shopping, politics, Covid-19, and electric vehicles. At the same time, many works are devoted to describing analytics platforms, social networks, and text processing for SA. Bhatnagar and Choubey [22] describes the steps of preprocessing, vectorization, and classification of the textual data using ML algorithms. Nandwani and Verma [23] pays great attention to studying the critical approaches of the most efficient ML algorithms for SA. That work showed that the Support vector machine (SVM) and naïve Bayes (NB) classifier are more effective than other algorithms. The classification of Twitter posts is also performed in [24], where the primary role is assigned to the K-nearest neighbors (k-NN) and SVM. Huq et al. [25] provides detailed SA of user opinions from Twitter and Facebook social networks using convolutional neural networks (CNN), recurrent neural networks (RNN), long short-term memory (LSTM) neural networks, and hybrid approaches. In [26], comments on controversial political discussions in German on YouTube were conducted. SA was performed with various word embeddings, ML algorithms, and RNN. Then the classification efficiency was assessed using the following metrics: Precision, Recall, and F1-score. A new and more advanced approach to text classification using one CNN and two LSTM layers was described in [27].

All these works were mainly devoted to the analysis of texts in the English language. However, most texts and user comments are written in the Russian and Kazakh languages in the Kazakh media space. Thus, it became necessary to analyze the works dealing with these languages specifically. The sentiment classification of Russian tweets using logistic regression (LR), XGBoost, and CNN was carried out in [28, 29]. Unfortunately, the works devoted to the SA of Kazakh texts are greatly underrepresented. The Kazakh language is an agglutinative language with complex morphological and syntactic structures [30]. The sentiment classification tasks require the preprocessing stage, where stemmers or lemmatizers are applied to words to extract their stems or indefinite forms. The existing language packages of NLP tools do not contain the stemmers and lemmatizers for the Kazakh language as for other widely represented languages, especially European. Tukeyev et al., Yergesh et al. and Bekmanova et al. [31,32,33] implemented only a dictionary approach formalizing rules for defining the sentiment of phrases in texts. ML and NN approaches had a limited reflection in these works. In addition, they neither described any open-source analytics platforms nor provided functionalities for evaluating society’s SA and social mood in the Russian and Kazakh media spaces. Thus, various foreign and Kazakh analytics platforms were thoroughly investigated in the next section.

Analytics platforms

The widespread development of Internet technologies, social networks [34], and data analytics has led to numerous tools and analytics platforms for promoting the brand, monitoring public opinions, and assessing social well-being as one of the main tools for determining the socio-economic system in the context of sustainable advancement.

Currently, the foreign market is represented by many tools for monitoring social networks [35], content analysis, and brand promotion. Therefore, the marketers distinguished a list of the most popular and advanced analytics platforms: Sproutsocial, Hubspot, Buzzsumo, Hootsuite, Brandmention, IQBuzz, and Snaplytics take an essential place.

Sproutsocial [36] is a multifunctional analytics tool that allows comparing results in several networks efficiently. This tool monitors and gathers all messages from Facebook, Twitter, Instagram, and other social networks in one unified place. It also benchmarks customer satisfaction by gaining analytics data through an automated Twitter DM survey. Sproutsocial is powered by ML algorithms that allow suggesting replies to users’ frequently asked questions. Generally, Sproutsocial is very useful when it is required to count links on Twitter [37], measure the growth of Instagram followers, evaluate participation on LinkedIn, and much more. This tool then provides an opportunity to evaluate results using understandable visualized reports. Sproutsocial includes marketing, social media management, and analytics of various leading brands and agencies, including Chipotle, Subaru, Zendesk, etc.

Hubspot [38] is a tool that allows marketers to obtain comparative information about the level of engagement on social networks and reflect on past efforts made to support high customer interest in their products. HubSpot provides a detailed overview of how social media affects profit margins and enables you to report on collected data quickly and efficiently. At the same time, it gives an opportunity to compare different platforms, track and view brands on social networks, and understand how the target audience watches business content. This tool has a bunch of features, such as website activity tracking, task management, insight, KPI dashboard, sales automation, etc. Website tracking saves how users interact with websites: visited pages, time spent on each page, the location of the visitor, and so on. This HubSpot feature allows businesses to track how a lead interacts with their website. The task management tool creates to-do lists and sets tasks’ priorities, statuses, and deadlines. The insight allows to automatically add the information about the company that was added to the application. This information includes the size of the company, its description, contact information, etc. The KPI dashboard sets the company’s goal for the sales and the performance of marketing planes. The sales automation feature has automated various stages of sales and deals. Another essential feature of the HubSpot analytics tool is the ability to analyze indicators specific to social networks and the entire path of the client. This tool also provides information about marketing tactics that are most effective for businesses and their impact on social media campaigns and includes dozens of other features for business.

BuzzSumo [39] is an excellent resource for analyzing the social interaction of any particular content. The tool allows searching for information based on requests on the Internet, taking into account various factors, including likes and reposts. The advanced search engine of BuzzSumo finds the most relevant content by topic, author, and domain. The service prompts which directions respond to the initially selected audience. Trying to choose the most accurate direction of content creation, it receives valuable information about answers on social networks. In addition, this tool allows collecting statistics on the number of reposts of a certain message on a blog on such social networks as Facebook [40], Twitter, and Pinterest. The main functionalities features of this platform are Content discovery (browsing topics, trends, and forums), Content research (crawling websites to get the most up-to-date content), Monitoring (finding different competitors, brand mentions, and updates and alerting with the most important events), and APIs (connect, integrate and develop with different sources of data). An essential feature of the tool is the ability to track the effectiveness of competitors as part of a content marketing campaign. BuzzSumo also easily determines competitors’ activity in social networks and identifies key people in a particular area. Such an analysis can help to see which posts receive the most engagement and use this data to adjust the content strategy.

Hootsuite [41] is one of the most popular multifunctional services for working on social networks. The emphasis in this service is on working with Twitter, and, first of all, Hootsuite will be useful for those who maintain several accounts at once. Hootsuite also works successfully with Facebook, LinkedIn, MySpace, and Foursquare accounts and blogs on WordPress. HootSuite offers a wide range of analytical capabilities, such as connecting Google Analytics on the site and viewing graphs for comparing the number of tweets and the popularity of links. The key features of this platform are Post Scheduling, Streams, Analytics, and Assignments. Post Scheduling allows setting the dates and times to create a new post. Streams monitor active social media channels online. Analytics provide opportunities to see the performance of posts, their sentiments, page content clicks, total clicks on posts, and much more. Finally, assignments provide an ability to assign items to different team members. Hootsuite additionally allows you to post on all social networks on a specific schedule. The tool also allows you to track recent social trends and brand mentions.

Brandmention [42] is one of the most powerful platforms for free search and analysis of social networks. The system also offers SA, related keywords, popular sources, etc. Brandmention searches over 100 social networks [43], including social bookmarks, blogs, forums, social services, and more. In addition, data can be exported or configured for e-mail. Brandmention allows configuring the keywords for social monitoring and finding the company’s and its competitors' companies’ social handles. Some keywords can also be excluded from the search result.

IQBuzz [44] is a professional tool for analyzing and managing reputation on the Internet and a social network monitoring service [45]. IQBuzz tracks many sources and platforms such as Twitter, Yandex, LiveInternet, LiveJournal, various blogs, video hosting services such as RuTube and YouTube, various news, entertainment, and specialized services, and thematic and regional portals. One of the key advantages of the service is the ability to connect new sources and Internet resources for monitoring.

Snaplytics [46] is a cloud-based platform that analyzes Snapchat and Instagram stories. Today, millions of active Snapchat and Instagram users present stories as an excellent method of promotion on Instagram. This application also allows you to see peaks and slumps of views. The most important features of Snaplytics are automatic publishing, post scheduling, monitoring, and analytics. Platform users can track comments and replies, post stories from various sources, and view rates. Snaplytics also allows generating reports and exporting them to CSV files and other formats.

In Kazakhstan, social analytics is significantly underrepresented. Only a few works devoted to SA of the Kazakh language could be found in the Scopus database. Their research is mostly restricted to SA with the use of dictionary and ML approaches [30, 32]. Generally, there are only a few brands and social analytics platforms. Among the most advanced applications are the iMAS [47] and the Alem Media Monitoring [48], which work with the Russian and Kazakh languages. The iMAS platform provides SA on specified topics for a given period. The Alem Media Monitoring is software designed to analyze public opinion in the Internet space. This system allows collecting information on certain topics from news portals and social networks [49], determining the sentiment of texts using ML algorithms, visualizing all the performed analyses, and compiling and uploading reports. Unfortunately, these platforms are not open-source, and the information provided on their official websites demonstrates the study by three sentiment classes (positive, negative, and neutral) of texts and comments, the sources (news portals, social networks, and blogs), and periods of monitoring, visualizing them with different graphics and making reports in the word, excel and pdf formats. Nevertheless, there is no description of how these systems estimate the public’s social mood. Moreover, the research papers devoted to the iMAS and the Alem Media Monitoring platforms have not been found online. The proposed OMSystem was first described in [50]. It is designed to provide complex social analytics, including the web crawler, SA with sentiment dictionaries and ML algorithms, and evaluation of the “social well-being.” The following sections demonstrate the structure, functionalities, and module for evaluating the social mood of society.

The OMSystem information system design methodology

The OMSystem, the first automatic tool developed to analyze the opinions of Kazakhstani users expressed through news portals, blogs, and social networks, was developed to provide a complex analysis of the public’s social mood and cover the parts skipped in other analytics platforms in Kazakhstan. The OMSystem allows monitoring of web resources and social networks with subsystems for modeling “social well-being” [51] and supporting sentiment dictionaries of the Russian and Kazakh languages and ML algorithms for determining the sentiment of texts and user comments. The OMSystem supports Kazakhstan’s leading news portals and popular social networks like Facebook, VKontakte, Instagram, Twitter, and YouTube. The platform’s main task is the operational monitoring of the information space and social networks on the most important topics in society. They unambiguously determine the scale of the problem, public opinion, and their quick explanation, analyze the dynamics of the commercial brand, events, and references to activities, and, in turn, assess the degree of “social well-being.”

This system allows working with texts in the Kazakh and Russian languages. It also has built-in modules for connecting to the application programming interfaces (APIs) of social networks: Vkontakte [52], Facebook [53, 54], Twitter [22, 55], Instagram [56, 57], YouTube [58], Telegram [59], and Odnoklassniki [60]. The OMSystem automatically determines the language of the text (Russian, Kazakh) and the sentiment of the topic, as negative, positive, or neutral, using a sentiment dictionary and ML algorithms. Furthermore, there is a possibility to record the time range in the system when monitoring social networks (for a year, for 6 months, for 3 months, for a month, for a week, for a day, etc.). The OMSystem also allows building visual reports on the monitoring results in various graphs and charts (pie, histogram, chart, graph, and others). At the same time, the platform provides ways to identify the profile of a social network participant by reading profile data and counting the activity of a participant in a topic by the number of comments, likes, and reposts.

The development of the OMSystem included the most important stages to achieve all the required goals. First, a module for using API to connect to social networks and a storage system for keeping the parsed data and processed analytical results were created. Then the sentiment dictionaries in the Russian and Kazakh languages were designed to evaluate the sentiment on the analyzed topics. The SA module was further extended with ML modules trained on the texts, labeled by human annotators and sentiment dictionaries. As an analytical application, the convenient quantitative and qualitative graphical visualization of the monitoring results was a significant step in the system’s design. The advanced role policy was the next important step. Finally, the system’s interface and design were improved to match the modern trends and requirements of the development of web applications.

The OMSystem was developed on the Django framework that uses the Python programming language. In addition, Django has its integrated authorization and authentical modules and libraries for web forms with input data validation. The administrative and parsed textual data is kept in the PostgreSQL relation database that is easily connected to the Django application. The SA modules with sentiment dictionaries and ML algorithms are shown in detail in the following chapters. The OMSystem has several roles: “Superuser,” “Administrator,” “User,” and “Expert.” The “Superuser” has the right to login into the System, navigate through the site, set up research and analysis reports, set up a rule profile for the search topic, change settings for uploading data from the System, invite experts, and view, edit, and delete personal data. The “Administrator” has the right to login into the System, view and edit system settings, assign roles for other users, change settings for connecting subsystems and modules, get technical reports (the number of results, the volume of data, search time, etc.), and configure settings for uploading reports. The “User” has the right to login into the System, navigate through the system, set up new topics and parameters for monitoring, and view the monitoring reports. The “Expert” has the right to login into the System, view the analysis page and details, switch to the sources of results, and view the system's functionality. JavaScript libraries and CSS styles were utilized to improve the interface of the application and graphical analytical reports.

The OMSystem’s interface and architecture are schematically shown in Figs. 1 and 2.

The English language is yet to be added to the interface of the platform. Its architecture was also described in [50], where experiments characterized the building of ML models for the OMSystem. The designed system’s functionality is implemented in the components:

Data sources: They are represented by news portals, blogs, and social networks.
Connector module: It is used for the connection to data sources.
The linguistic constructor module: It is used for creating sentiment dictionaries that include words belonging to any of the three classes: positive, negative, and neutral.
Data analysis and processing module: It uses sentiment dictionaries and ML algorithms for SA. In addition, this module creates social analytics defining social mood.
Results module: It contains a formed relational database of texts and comments, analytical reports, graphics, and tables.

The SA tool, labeling texts and user comments in three sentiment classes (positive, neutral, and negative), is the core part of the OMSystem. The sentiment classes are assigned with the use of the hybrid approach: the lexicon-based (sentiment dictionaries) and the ML-based. The lexicon-based approach assigns a label by the largest number of words of one of three sentiment classes. The ML-based approach uses the trained ML models with the highest effectiveness in terms of accuracy, precision, recall, and F1-score, such as NB, LR, SVM, k-NN, Decision tree (DT), Random Forest (RF), and XGBoost.

The linguistic module

A sentiment dictionary is generally represented as a list of words, each of which is assigned a “weight” that describes its emotional coloring. Sentiment dictionaries include hundreds or thousands of such words, and they are then used to determine the sentiment of sentences, paragraphs, or the whole texts based on the average of their weights of the sentiment words. The sentiment dictionaries in the OMSystem are also directed to analyzing social, political, and economic content, so they need to include corresponding words for such texts.

In the OMSystem, the sentiment dictionaries were developed in the following steps:

1.
Forming a sentiment vocabulary, which is marked on the basis of feelings and emotions. The sentiment dictionary consists of such elements as words, phrases, misspelled words and slang forms of words, each of which has its own emotional aspect.
2.
Creating words with errors in Russian and Kazakh languages, which will increase the search results. The words with errors are formed by replacing, inserting, and deleting symbols.
3.
Filling the dictionary. The dictionary is based on a sentiment dictionary of English words from open sources, categorized by their sentiment (https://public.tableau.com/views/NRC-Emotion-Lexicon-viz1/NRCEmotionLexicon-viz1?:embed=y&:toolbar=yes&:loadOrderID=0&:display_count=yes&:showTabs=y&:tabs=no&:showVizHome=no). It is stated that this dictionary is suitable for any language, so the words from this dictionary were translated into Russian and Kazakh.
4.
Expert linguists were involved in labeling the sentiment of words of the newly parsed news topics and social media comments to increase the size of the sentiment dictionaries and fill them with new important words.

Currently, the Russian sentiment dictionary includes 44,381 words, and the Kazakh sentiment dictionary includes 29,654 words.

The linguistic module defines the sentiment of texts with the use of the formed sentiment dictionaries. Here is used a function that calculates the sentiment by the maximum number of positive, negative, and neutral words in the text. This approach’s effectiveness greatly depends on the quality of the designed sentiment dictionary [61]. Although this approach is very effective, creating a high-quality sentiment dictionary requires much effort. After an initial sentiment dictionary is created manually, it is then expanded by the synonyms and antonyms from larger dictionaries existing for many languages.

In the OMSystem, large sentiment dictionaries for the Russian and Kazakh languages are developed. The following formula finds the sentiment of the text:

$$S_{t} = \left\langle {Max(w_{pos} ,w_{neut} ,w_{neg} ),D} \right\rangle ,$$

(1)

where $S_{t}$ is a sentiment of the text; $w_{pos}$ is the number of positive words; $w_{neut}$ is the number of neutral words; $w_{neg}$ is the number of negative words; $D$ is a sentiment dictionary.

The sentiment dictionaries of both languages used in the OMSystem are presented in Figs. 3 and 4.

Machine learning methods

In addition to sentiment dictionaries, ML algorithms are also used in the OMSystem to label the text data. The following algorithms are implemented in the system: NB, LR, SVM, k-NN, DT, RF, and XGBoost. The model for defining sentiment with ML algorithms is calculated by the formula:

$$S_{t} = \left\langle {M,T} \right\rangle ,$$

(2)

where $S_{t}$ is the sentiment of a text; $M$ is an ML model; $T$ is a text document.

An NB classifier [62] is one of the simplest and most commonly used ML algorithms for text classification that uses a probabilistic approach based on the Bayes theorem with strong data independence assumptions. It considers every feature that affects the probability, regardless of the presence or absence of any other features. In text classification, NB is trained on documents for each class, where the conditional probability that document $d$ belongs to class $c$ is computed. This formula is represented by the expression:

$$P(c|d) = \frac{P(c) \times P(d|c)}{{P(d)}},$$

(3)

where $d = \{ x_{1} ,x_{2} , \ldots ,x_{n} \}$, $x_{i}$ is a weight of the $i{\text{th}}$ word in a document $d$, and $c$ is a class of the document.

SVM [63] is another popular ML algorithm. This algorithm works with the feature space separated by hyperplanes. In this case, a good separation is achieved due to the hyperplane, which has the greatest distance to the nearest points of the training data of the two classes (the so-called functional boundary), since the larger the boundary, the lower the classifier error. The formula of SVM is given below:

$$y_{i} (\overrightarrow {w} \times \overrightarrow {x} + b) \ge 0,$$

(4)

where $\overrightarrow {x} = (x_{1} , x_{2} , \ldots , x_{n} )$ is a feature vector; $\overrightarrow {w} = (w_{1} ,w_{2} , \ldots ,w_{n} )$ is a weight vector;$y_{i}$ are output values; $b$ is a bias.

If the value is greater than or equal to zero, it belongs to the positive class. Otherwise, it is in the negative class.

A splitting hyperplane of SVM mainly works with two-class classifiers. However, it can easily be adapted to multiclass classification, using a set of “One-vs-All” classifiers. A hyperplane of SVM is shown in Fig. 5.

An LR classifier [64] predicts the probability of an independent variable in the interval [0,…,1] using a logistic function:

$$p(x) = \frac{1}{{1 + e^{ - f(x)} }},$$

(5)

where $f(x) = w_{0} + w_{1} x_{1} + \cdots + w_{r} x_{r}$ is a linear classification function; $\overrightarrow {x} = (x_{1} ,x_{2} , \ldots ,x_{n} )$ is a feature vector; $\overrightarrow {w} = (w_{1} ,w_{2} , \ldots ,w_{n} )$ is a weight vector. A logistic function $p(x)$ is presented as a sigmoid with the values of probability of 0 and 1. Document $d$ belongs to class 1 if the value $p(x)$ moves to 0. Otherwise, it is put into class 2. In the case of multiclass classification, a “One-vs-All” and “One-vs-One” approaches are used to identify a specific class. A logistic function is shown in Fig. 6.

A k-NN algorithm [65] is one of the simplest data classification algorithms. It calculates distances between vectors and assigns points to the class of its $k$ nearest neighbor points. This algorithm usually classifies documents using the most widely used distance measure called Euclidean distance, which is defined as:

$$d(x,y) = \sqrt {\sum\nolimits_{i = 1}^{N} {(a_{ix} - a_{iy} )^{2} } } ,$$

(6)

where $d(x,y)$ is a distance between 2 documents; $a_{ix}$ and $a_{iy}$ are the weights of the $i{\text{th}}$ term in documents $x$ and $y$, correspondingly; $N$ is the number of a unique word in a set of documents. This algorithm plainly memorizes all feature vectors and their corresponding class labels during the training stage. When working with real data, the unknown class labels, the distance between the new observation vector and the previously stored ones is calculated. Then the $k$ nearest vectors are selected, and the new object belongs to the class to which most of them belong.

DT [66] is a supervised learning method that uses a set of rules to make decisions the same way a person makes decisions. This method divides a data set by features and answers specific questions until all data points belong to a particular class. Thus, a tree structure is formed by adding a node for each question. The first node is the root node. At the first classification step, a word is selected, and all documents containing it are placed on one side, and documents that do not contain it are put on the other side. As a result, two sets of data are obtained. Then a new word is selected in these sets, and all previous steps are repeated. The same procedure continues until the entire dataset is partitioned and assigned to leaf nodes. If all data points in a leaf node uniquely correspond to the same class, then the class of the node is well-defined. In the case of mixed nodes, the algorithm assigns the given node the class with the largest number of related data points. DT is shown in Fig. 7.

RF [67] is another popular ML algorithm based on the concept of ensemble learning. This concept involves combining multiple classifiers to improve model performance. This algorithm includes not a single DT but a bunch of them. In classification problems, each document is classified by all trees independently. At the output, the class of the document is determined by the largest number of votes among all trees. RF is shown in Fig. 8.

XGBoost [68] is considered one of the most superior and advanced methods among all ML algorithms, which uses the principle of boosting. This method also implements an ensemble technique as an RF algorithm. The deviations of the trained ensemble predictions are computed on the training set at each iteration. Thus, optimization is performed by adding new tree predictions to the ensemble, reducing the mean deviation of the model. In addition, XGBoost allows tuning many different hyperparameters to increase the model’s performance.

Data collection and data processing

The web-crawler of the OMSystem parses texts and user comments from different sources, such as Kazakhstan’s news portals, social networks, and blogs. The parsed texts are aggregated in the designated PostgreSQL database. The scheme of the OMSystem’s functioning is presented in Fig. 9.

After the texts are gathered in the database, it is required to apply the following steps before training ML models:

Text preprocessing
Stemming
Vectorization
Class resampling

These mentioned steps are thoroughly described in the following sub-sections.

Text preprocessing and stemming

All words are converted to lower case at the preprocessing stage, and extra words, symbols, punctuation marks, and links are removed. Then it is also necessary to remove the stop words, which are words that do not carry much semantic content. Examples of such words are prepositions, conjunctions, pronouns, etc. (“нa”—“on,” “в”—“in,” “бәpi”—“all,” “жәнe”—“and,” “бipaқ”—“but” and others). Another important step is methods for reducing the number of words with similar meanings. These methods are called stemming and lemmatization. In stemming, affixes and endings of words are removed to obtain their stems. In lemmatization, words are reduced to their indefinite forms. Stemming is an easier way to write an algorithm for removing parts of words. Lemmatization, on the contrary, requires significant efforts to develop rules for reducing words to the infinitive form. The NLTK Python library includes excellent stemmers for the Russian and English languages. Unfortunately, it does not yet contain the same well-developed stemmer for the Kazakh language. Thus, a new stemmer called “KazakhStemmer” has been developed for getting stems of the Kazakh words.

Vectorization

After text preprocessing, the vectorization stage is performed, where the Bag of words (BOW) and Term frequency-inverse document frequency (TF-IDF) [69] techniques are widely used. The BOW model is quite simple, and it is easy to use for feature extraction. The model’s simplicity lies in the fact that it does not take into account either the order, the structure of words, or the features present in it. The model only considers whether the known word occurs in the document or not. The dictionary of words comprises all the words found in all documents. For example, given a number of documents and their corresponding vector representations:

I am writing—[1, 1, 1, 0, 0, 0, 0, 0]
I am writing a poem—[1, 1, 1, 1, 1, 0, 0, 0]
I am writing a poem in the library—[1, 1, 1, 1, 1, 1, 1, 1]

Vectorization involves counting the number of words in each document. It is shown in Table 1.

Table 1 Vectorization with BOW

On the development of an information system for monitoring user opinion and its role for the public

Abstract

Introduction

Related works

Analytics platforms

The OMSystem information system design methodology

The linguistic module

Machine learning methods

Data collection and data processing

Text preprocessing and stemming

Vectorization

Class resampling

Multiclass classification metrics

Defining the social mood of society

Experimental part

Developing ML models for the OMSystem

Defining the social mood on the topic of vaccination against Covid-19

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords