Traffic and road conditions monitoring system using extracted information from Twitter

Congested roads and daily traffic jams cause traffic disturbances. A traffic monitoring system using closed-circuit television (CCTV) has been implemented, but the information gathered is still limited for public use. This research focuses on utilizing Twitter data to monitor traffic and road conditions. Traffic-related information is extracted from social media using text mining approach. The methods include Tweet classification for filtering relevant data, location information extraction, and geocoding in order to convert text-based location into coordinate information that can be deployed into Geographic Information System. We test several supervised classification algorithms in this study, i.e., Naïve Bayes, Random Forest, Logistic Regression, and Support Vector Machine. We experiment with Bag Of Words (BOW) and Term Frequency - Inverse Document Frequency (TF-IDF) as the feature representation. The location information is extracted using Named Entity Recognition (NER) and Part-Of-Speech (POS) Tagger. The geocoding is implemented using the ArcPy library. The best model for Tweet relevance classification is the Logistic Regression classifier with the feature combination of unigram and char n-gram, achieving an F1-score of 93%. The NER-based location extractor obtains an F1-score of 54% with a precision of 96%. The geocoding success rate for extracting the location information is 68%. In addition, a web-based visualization is also implemented in order to display traffic information using the spatial interface.

The government has made various attempts to reduce traffic disruption in Jakarta. One of these is the development of the Jakarta Smart City information system. The Jakarta Smart City information system uses closed-circuit television (CCTV) data from various sources, including the Transportation Agency (DisHub), Bali Tower, the Public Works Service (PU), and Transjakarta, among others. There are approximately 6000 CCTVs scattered throughout the Jakarta area. The data is sent and displayed in real time on the Jakarta Smart City system portal. However, the current smart traffic management system (STMS) still has several shortcomings. Because Jakarta Smart City relies on CCTV data, the system is highly dependent on the availability of data sources. Despite the existence of thousands of CCTV cameras, some CCTV data cannot be accessed or displayed. As a result, road conditions cannot be monitored optimally. Additionally, the coverage of CCTV data is still limited, and several public areas have not been captured by STMS. Furthermore, although Jakarta Smart City allows users to access video data from CCTVs, it has not provided thorough analyses regarding traffic situations. Consequently, users may still have difficulty determining what is happening at a specific road location.
Social media platforms are systems built on the internet technology that grew out of Web 2.0, which allows the exchange of information between social media users. Social media data can be analyzed using an information extraction approach. This study focuses on the use of Twitter data. There are several reasons why Twitter is used as a source of data in this study. Indonesia ranks 5th in the world for most Twitter users. In January 2022, there were 18.45 million twitter users in Indonesia. A survey finds that around 6000 tweets are sent per second. The abundance of data in Twitter is a valuable source of data to be harnessed in social media analytics. Moreover, Twitter provides an API that enables the researchers collect the data.
The objective of this study was to apply the machine learning models to extract information about traffic conditions in Jakarta. The model was built using social media data and can be utilized to monitor traffic situations and road conditions such as accidents and congestion in Jakarta. We hope that proposed application in this paper can be used as supplementary information for the public policy maker to plan and manage the traffic in Jakarta.
The study by D' Andrea [3] utilized social media data to obtain information about traffic-related events. This research focused on the Italian region and specifically analyzed traffic jams and traffic accidents. They built an intelligent system using a text-mining approach and machine learning algorithms to detect real-time traffic events by analyzing Twitter activities. The best model in this study was using the Support Vector Machine (SVM) algorithm, with an accuracy score of 95.75% and 88.89% in two-class and threeclass classification, respectively.
The study by Gu [5] explored the use of social media data to monitor congestion, including recurring and non-recurring congestion. Recurring congestion is congestion that occurs frequently in a particular location, while non-recurring congestion is caused by external factors such as accidents, construction, climate, and other special events. They built the classification model for categorizing tweets as relating to either traffic incidents (TI) or non-traffic incidents (NTI) using the Naıve Bayes algorithm and the classification model for categorizing TI tweets into five categories using the sLDA algorithm. The first model achieved an accuracy 90.5%, while the second model produces true positive results 51% of the time, which indicates that geocodable TI tweets could be correctly classified by the sLDA classification algorithm. Other finding is that 5% of all the tweets fall into the useful (IT) tweet category and have location information. Of these, 60-70% come from "important users" (IU) such as online media, communities, and drivers, while the rest are from individual users. The results of the temporal analysis also show that the majority of sharing activities on Twitter occur on weekdays and during the day, especially at around noon.
Zhang's study [6] also investigated the detection of traffic accidents using data from Twitter. This study focused on detecting non-recurring congestion caused by external factors, including the accidents due to collisions, vehicular malfunctions, and small fires. The case study in Zhang's work is Northern Virginia, a metropolitan city in the United States. More than 500K Tweets was collected over period of January 2014 to December 2014. Time, date, and location information (in the form of latitude and longitude) were retained for each tweet. They experimented with four classifiers, namely SVM, artificial neural network (ANN), deep neural network (DNN), and long short-term memory (LSTM), for building the accident-classification model. DNN performed the best with an accuracy above 90%. Further validation shows that the accident incident data reported on Twitter is reliable information in accordance with the accident reports submitted to the existing system.
The study by Gutierrez [4] aimed to monitor traffic events in the UK using Twitter data. This study involved several steps in their methodology, including Tweet classification, event type classification, named entity recognition (NER), and temporal and location data extraction. The classification model reach the accuracy between 89% and 95%.
Herwanto's study [7] extracted information related to traffic conditions in Indonesia using Twitter data. This study has three major stages: data classification, entity extraction, and relation extraction. They collected the Tweets using several keywords. Beside that, they also acquired the Tweets from official Twitter account of Jakarta police office, e.g., '@TMCPoldaMetro' . The SVM classifier is used to categorize the Tweets into relevant and irrelevant classes. Entity extraction was performed using the NER method. The entities being recognized were location, time, and status. Then, the relation information was identified from the Tweet, i.e., condition, direction, and detail. The results shows that the accuracy of classification model is over 90% and the NER techniques produce an F1 score of 70%.

Model design
We design a pipeline model in order to monitor traffic situations using Twitter data. The pipeline model is described in Fig. 1. It consists of four standalone models, i.e., a classifier to filter out irrelevant Tweets from crawled data, a classifier to categorize relevant Tweets, a model to extract the location information from the Tweets, and a geocoding model used to convert text locations into geographic data points in the form of latitude and longitude.

Methodology
This research is conducted using the quantitative method. Figure 2 illustrates the research methodology that is used. First, we identify the problem by observing the current system and related documents. Second, we study the literature related to text mining using Twitter data. After that, we collect and preprocess the data Then, we build the models and evaluate them.

Data collection
We employ data crawling to collect the Tweets. This method requires an application programming interface (API) obtained from the Twitter development website. The data is collected in the period between February to March 2020. The queries used for data collection are "jalan berlubang", "jalan rusak", "kecelakaan", "kerusakan jalan", "konstruksi", "lalu lintas", "lubang jalan", "macet", "mogok", "terbakar", "pembangunan", "perbaikan jalan", "situasi lalu lintas" and "tabrak". The important accounts used are @TMCPold-aMetro, @NTMCLantasPolri, @LewatMana, @infoll, @RadioElshinta, and @PTJASA-MARGA. We acquire 280,412 tweets from crawling process. The data collected include the unique tweet number, the full text of the tweet, the coordinates of the device when used to post the tweet, the location listed in the user's profile, and the time at which the profile was created.

Data annotation
The annotation process is carried out to produce labeled data that is needed for training the supervised model. We sample 10,000 Tweets to be annotated. Two annotators perform data annotation in two steps. First, a Tweet is labeled as relevant if the Tweet content is related to a traffic situation. Otherwise, it is labeled as irrelevant. Second, the relevant Tweet is categorized into five event categories. Table 1 explains the definition of label for traffic event type annotation. After the annotation is complete, the level of agreement is measured between the two annotators. We use the coefficient value of Cohen's kappa to determine the agreement [9]. We obtain that the Cohen Kappa are 93% for the relevance classification and 95% for the traffic event classification. The agreement between the two annotators in those twostep annotation can be interpreted as "almost perfect agreement". The remaining disagreement occurs due to the inaccuracy of the annotator when assigning the label. These differences are then discussed by the two annotators to determine which label is correct. In addition, the location labeling process is also carried out. The annotation is done by single annotator, who is the first author of this paper. For location label, 400 tweets are annotated.

Data preprocessing
The data preprocessing in this study include several tasks, i.e., case folding, punctuation and stopword removal, stemming, and text normalization. We utilize Python Sastrawi library for stemming and stopword removal.
1. Case folding: lowercase the text representation. 2. Punctuation removal. 3. Stemming: strip the affixation of the word and return a word stem. 4. Stopwords removal: remove frequently occurring words or common words in text corpus. 5. Text normalization: transform the non-standard words (e.g., slang word) into a standardized form. In this study, we apply dictionary-based normalization using Salsabila dictionary [10].

Feature extraction
We apply word vectorization to extract the features from the Tweets. TF-IDF vectorization and count vectorization are the two types of feature extraction explored in this study. The vectorization is conducted using the Python sklearn library.

Text classification
The classification algorithms used in this study are Naıve Bayes, Random Forest, Support Vector Machine, and Logistic Regression.
1. Naïve Bayes, is a classifier that uses statistical and probability technique. 2. Random Forest, is a classifier whose structure in the form of a tree-shaped like a flowchart. The decision tree has a node that represents a condition or feature. Each node has a leaf indicating an output class label.
tagger trained on Bahasa CSUI data [14] using the Conditional Random Field classifier tags the tokens in Tweets with Part-of-Speech information. We leverage the token tagged with Proper Noun class as the candidate of location.

Geocoding
Geocoding is the process by which ambiguous physical locations are represented with numerical coordinates [15]. The geocoding stage requires two different data sets, namely the data set of addresses to be identified with the location of coordinates and the address database to be used as a reference. Two address databases can be used as references in the ArcGIS software, namely databases from ESRI and spatial databases from other sources, such as government agencies authorized to issue spatial data.
The data geocoding in this study uses the Python programming language and utilizes the ArcPy library. The ArcPy Python library can perform analysis, conversion, management, and automation of geographic data. The geocoding process is carried out in two major stages, namely making a locator using a spatial map and geocoding for the dataset using a locator that has already been made. The geocoding locator created in this study uses a dataset of road name maps of Jakarta, sourced from Open-StreetMap, and has more than 64 000 unique addresses listed.

Evaluation
The experiment to determine the best model uses the k-fold cross-validation setting (k = 10). The evaluation metrics used are accuracy, precision, recall, and F1-Score.

Data visualization
To facilitate data visualization, we develop a web-based dashboard. The classification model, location information, and geocoding provide input at this stage. The model is run automatically against operational data and the predicted results of the model are visualized spatially and tabularly. The data visualization process is a series of programs that can retrieve data from previous modeling results and visualize that data in various forms.
The dashboard is built using the PHP (Hypertext Preprocessor), CSS (Cascading Style Sheets), and JS (JavaScript) programming languages. The first step is to create a design or mockup of the web interface. This is based on the monitoring system that was built by the Jakarta government to address the traffic situation of the Jakarta Smart City and on other monitoring systems that have been built by other government agencies. There is no interaction between the user and the system. The interface is divided into two displays. On one side is data whose location information has been successfully extracted, with latitude and longitude values shown in the spatial display of the Geographic Information System. On the other side is location information that could not be successfully extracted because the Tweet did not contain that data.

Experimental result
We conduct three experiments for Tweet classification task. First, we explore the preprocessing effect to model performance. Then, we investigate the feature extraction methods. Finally, we compare the performance among classifiers. In addition, the evaluation is also conducted for the model for location information extraction and geocoding.

Data preprocessing evaluation
Previous works shown that the data preprocessing can affect the classifier accuracy on text classification tasks [16][17][18][19][20]. To investigate the effect of preprocessing in the traffic Tweet corpus, we conduct the experiment with several variation of preprocessed data. We run SVM classifier using count vectorizer on data with four distinguished preprocessing setting, i.e., (1) original data without pre-processing, (2) data after case folding, (3) data with all preprocesing steps (case folding, punctuation removal, stemming, stopwords removal, and normalization), and (3) data with all preprocessing steps except stemming and stopwords removal. Table 2 presents the result of preprocessing experiment. The result shown in Table 2 implies that, unlike in previous works, most preprocessing steps do not help to improve the classification accuracy. Only case folding increases the accuracy slightly.

Classification model evaluation
We experiment with different text representation to obtain the best classification features. For count vectorizer, we select unigram representation. While, for TF-IDF, we compare unigram, bigram, and trigram representation. In addition, we also explore char gram representation. We conduct feature extraction experiment in which we apply only case folding as the preprocessing step. Table 3 presents the results of model evaluation.
Based on the results presented in Table 3, the Logistic Regression classifier using combination of unigram count vector and char gram TF-IDF vector performs the best among all models tested, achieving the accuracy of 90.72% and 94.14% for Tweet relevance prediction and traffic event prediction type tasks, respectively. On the other hand, using TF-IDF trigram features only obtains the smaller accuracy compared to other features. Combining bigram and trigram features shows the positive trend of accuracy improvement.

Location extraction evaluation
We compare the performance of two methods of extraction of location information in Table 4. Based on the evaluation results, NER can extract the location from the Tweet with precision more than 95%. However, it has low recall ( ≤ 40%). We suspect that the model cannot generalize the location name from training data in which the sentences are predominantly written in standard Indonesian language, while the our traffic corpus that sourced from Twitter is informally written and may contain misspelling [21]. Conversely, using POS tagger for location extraction achieve much higher recall (76%) but very low precision (33%). The location is tagged as proper noun, but the proper noun can be other entities, e.g., person and organization.
Furthermore, we conduct geocoding experiments using the Jakarta road map sourced from OpenStreetMap. The map used contains geographic information consisting of 64,930 street names. These maps were used as locators, which were then employed to convert 419 true positive of predicted location extracted from NER and 840 true positive of predicted location extracted as proper noun labels from the POS tagger.
When converting the extracted location from Tweet into the pair of latitude and longitude point, 68% data is successfully identified with geographic information, i.e., 286 of the 419 Tweets with extracted location with NER method and 569 of the 840 Tweets with extracted location with POS tagger method. The location information in the rest of 32% data fails to recognize for several reasons. First, the locations are not  written in a standard manner, so further normalization process is needed to produce exact location name. Second, the predicted location label are incorrect or the locator does not represent the entire location name. Third, the Tweet does not contain any location information. Last, the location is outside Jakarta, so it does not match the locator built from the Jakarta map.

Data visualization
We describes the results of developing a web-based dashboard that serves to visualize data. The main data source of this system is data modeling results that contain information related to disturbing traffic situations. The visualization is displayed spatially on the OpenStreetMap base map using JavaScript from the leaflet and tabularly. This system dashboard is designed for all traffic actors, both users and traffic organizers. Figure 3 presents information related to current traffic flows. Users can see traffic disruption activities that are occurring on the existing monitoring map, where the traffic activity is displayed using the marker feature. The symbol of the marker represents the type of traffic event that is currently occurring. Users can access more detailed information by clicking on one of the markers on the map. This will bring up tweets related to  the traffic event that have been shared by users on Twitter as well as the date, time, location, and category of the incident. Twitter data that is displayed spatially is Twitter data that has been linked to geographic information with geocoding. Meanwhile, tweets that have not been linked to geographic information are displayed in rows and sorted by the time at which they were posted. Table 5 shows examples of Tweets and extracted traffic information.

Conclusion
A daily average of 400-600 tweets related to traffic and road conditions are posted each day. Of that data, 39% is shared by the users who frequently share information related to traffic situations and road conditions, while the remaining 61% are shared by general users. Therefore, it can be concluded that data related to traffic activity and disturbing road conditions on Twitter are available and can be used for traffic monitoring. The text mining models produced in this study are the classification model to filter out irrelevant data, the classification model to identify event type from relevant data, and the location information extraction model. The first classifier filters the data related to traffic situations and road conditions. The second classifier categorizes the relevant data into five categories of events, namely traffic jams, accidents, problems, repairs, and damage.
Based on the empirical evaluation, the best classifier in our experiment is Logistic Regression using feature union of count and TF-IDF, and applying case folding for data preprocessing the data. The classifier achieves the accuracy of 91% dan 94% for Tweet relevance prediction and traffic event type categorization tasks. On the other hand, location extraction model achieves F1-score in range of 45 -54%. Only 68% of extracted location from Tweets can be interpreted as geolocation information through geocoding process. This research delivers a dashboard to visualize analyzed data spatially using geographic information from the results of geocoding and tweet location data.

Future work
For future work, we identify several work directions. The low accuracy of the location extraction model is because of lack of publicly available NER data set for Indonesian Twitter domain. We suggest the work on this resource for Indonesian language. On the other hand, most location mentions in Tweets are not written in standard language (e.g., abbreviated or misspelled), so the location normalization is a crucial task. The namedentity linking task for Twitter domain [22,23] is underexplored for languages other than English.
Currently developed visualization dashboard only displays tweet data related to traffic activities spatially. Future studies may design and implement the dashboard integrated with existing activity monitoring systems developed by the government and the private sector. Other data sources can also be used, such as from other social media or online media, to enrich the information presented by the system. The more data related to the traffic situation the system receives, it is hoped, the better the quality of information users can expect.
The information on the developed system presented in this study is related to problems that occur on the road. Further information, such as alternative routes or preferred modes of transportation, can be provided in future research. Some geospatial analyses such as [24] can be applied to produce more information for the user. Data veracity should be assessed to avoid the extracted information contain fake news and misinformation [25,26]. This analysis provides additional information regarding traffic predictions. Analysis in the area and around the location where traffic problems occur using several methods of spatial analysis, such as [27]. Multistage classification can also provide additional information, such as congestion repeats.