The methodology of this paper comprised three steps: (1) data mining, (2) data cleaning and preparation and (3) machine learning methods. For better understanding, the entire research framework is depicted in Fig. 1.
Data mining
Recently, it has become common to use a web-scraping technique for data collection. Simply put, it is a way to extract structured data from websites in an automated way and has been used by authors like Borde et al. [10], Pérez-Rave et al. [36] and Berawi et al. [7]. In this paper, the Python programming language, with packages made by BeautifulSoup and Selenium, was used to write an algorithm and purposely collect desired variables for apartment listings in the capital city of Vilnius with sell and rent operations. The data were collected monthly from May to August 2020 for a total of 4 months, and the datasets were saved independently for each month. The latter period covers two important aspects: the beginning of coronavirus, including the quarantine period, and the quarantine release period. With the quarantine restrictions increasing and decreasing, it is interesting to test whether the variables would have different impacts on the forecasting model.
Data cleaning and processing
After the extensive data collection and cleaning procedures, a total of 18,992 apartment listings were gathered in the four-month period with at most 16 features: zone (the city zone that the apartment is located in), address, listing price, number of rooms, apartment size, the floor, the number of floors, change in the list price, year built, distance to the shop, distance to the kindergarten, distance to school, built type (whether the apartment is made of bricks, etc.), heating type, vacancy and price change date. Some features, like heating type, had more than 40 levels but were reorganised into 13 levels. It is worth mentioning that the size of the collected dataset was very close to the population size, as the retrieved data represented the majority of all existing apartment listings in Vilnius.
Afterwards, the price drops of the property listings in Vilnius were analysed and compared to previous authors’ work on pandemics. Additionally, since many authors have found the TOM variable significantly predicting price drop, a heatmap of TOM values according to the Vilnius city boroughs was created for all four months and both sell and rent operations. From the heat map, one could also observe whether vacancies were more prominent in the city centre compared to other zones, where darker colours showed higher vacancy values and brighter colours indicated smaller vacancies. Additional variable distribution visualisations of the rent and sell operations are depicted in Appendices 1 and 2.
Before applying supervised learning, data preparation and feature selection processes were initiated. First, the target variable (indicating whether a price change occurred or not) was composed into a dummy variable for each month, as follows:
$$I\left(y\right)= \left\{\begin{array}{ll}1, \quad y \in A\\ 0, \quad y \notin A\end{array}\right.,$$
(1)
where I is an indicator function with space A that composes dummy variable y into 1 if a price change occurred and into 0 if a price change did not occur. Similarly, the location variable was also composed into a dummy variable, where apartments located in the city centre were assigned a value of 1 and 0 if they were outside the city centre. Furthermore, to avoid noise and the curse of dimensionality, this study employed target encoding for the heating and built type variables. The formula for the target encoding has the following form:
$${\mathrm{\varphi }}^{(j)} = \frac{1}{{N}^{(j)}} \sum_{i=1}^{N}{y}_{i}+I\left\{{x}_{i}={x}^{\left(j\right)}\right\},$$
(2)
where N marks the collected data points (\({x}_{i}\), \({y}_{i}\)), x marks the input variables, y marks the target variables, j marks the number of levels and I is the indicator function that maps each level of x into a feature \(\mathrm{\varphi }\). Additionally, particular variables like rooms, the number of floors in the building and the floor on which the apartment is located were encoded ordinally to preserve the rank order.
Machine learning methods
The ML process had two distinct stages, as shown in Fig. 1. In the first stage, the dataset was split into 70% and 30% training and test datasets, and the most consistent ML algorithm (MCMLA) was searched on the training set between the months to ensure equal interpretation when using SHAP values, as different algorithms might exhibit different variable effects. Thus, for all four months, the following 15 algorithms were applied: CatBoost Classifier, Extreme Gradient Boosting (XGB), Light Gradient Boosting Machine, Random Forest Classifier, Extra Trees Classifier, Gradient Boosting Classifier, Linear Discriminant Analysis, Logistic Regression, Ridge Classifier, Naive Bayes, Ada Boost Classifier, K-Neighbors Classifier, Decision Tree Classifier, Quadratic Discriminant Analysis and SVM—Linear Kernel (due to an abundance of algorithms, their formulae will not be shown; however, they are standard in Python libraries). Furthermore, for each algorithm, during the stratified cross-validation, the SMOTE synthetic minority sampling algorithm was deployed on the training set, which, as described by Chawla et al. [14], considers five minority samples and calculates the nearest neighbour’s average according to the Euclidean distance metric to generate new samples. This was done for each month separately and addressed the classification bias problem.
Subsequently, the 15 models’ results for four months and both sell and rent operations were provided in seven different criteria: accuracy, area-under-the-curve (AUC), recall, precision, F1-score and Kappa and Matthews correlation coefficient (MCC). As described by Brownlee [8], in using these criteria, one can objectively choose the best models for the task at hand. In this paper, the most attention was paid to accuracy, F1 score and precision ratios since this study dealt with an imbalanced dataset with many negatives. In all cases, the higher the ratios, the better. The formula for accuracy was as follows:
$$\mathrm{Accuracy}=\frac{\mathrm{True\,Positives }+\mathrm{True\,Negatives}}{\mathrm{All\,Sample}},$$
(3)
which gives the general model accuracy, as it used all samples in the denominator. Meanwhile, the formula for precision in the denominator used only true positives and false positives, and had the following form:
$$\mathrm{Precision}= \frac{\mathrm{True\,positives}}{\mathrm{True\,Positives }+\mathrm{ False\,Positives}}.$$
(4)
As discussed by Buckland and Gey [12] and Chawla [15], there is usually a trade-off between precision and recall, as one goes up and the other goes down, thus, depending on the goal, one or the other metric can be maximised. Additionally, another measure can combine the trade-offs between precision and recall and yield a single metric of a classifier in the presence of rare cases. It is called the F1 metric:
$$F1= \frac{\mathrm{True\,positives}}{\mathrm{True\,Positives }+\mathrm{ False\,Positives}}.$$
(5)
In conclusion, the accuracy, precision and F1 metrics were the most important while deciding the MCMLA. Furthermore, since this paper independently analysed both sell and rent operations monthly, all models metric scores were combined and averaged. One thing to consider is that machine learning processes have a stochastic feature, meaning that in different iterations, the models changed accuracy positions [8, 38]. This is especially true when SMOTE oversampling or stratified cross-validation that splits data into different sets is used. In order to have a replicability of this paper, it was decided to set a random seed fixed.
In the second ML stage, the tuning and application of the MCMLA began. The XGB algorithm yielded the most consistent scores and was thereby chosen as the MCMLA. In the tuning process, the stratified cross-validation with the SMOTE algorithm was used again, and to achieve better precision scores, the hyperparameters of the XGB algorithm were tuned using a grid search. For the sell operations, the tuned XGB algorithm used a max depth of 8, a learning rate of 0.491 and, for the rent operations, a max depth of 8 and a learning rate of 0.41. Furthermore, to highlight the functional form of variable effects when analysing SHAP values, the SMOTE oversample method was applied to the whole dataset, and the tuned XGB model was applied independently once more each month on this oversampled dataset.
Last, the recent adaptation of SHAP values in supervised learning has opened the dimension for explainable artificial intelligence. Lundberg and Lee [31] and Christoph [13] described the principle of SHAP values as the average marginal impact of a feature value across all possible coalitions. Originally, the following formula was used in game theory to compute SHAP values:
$${\upphi }_{i}(v) = \sum_{S \subseteq \frac{N}{\left\{i\right\}}}\frac{\left|S\right|!\left(\left|N\right|-\left|S\right|-1\right)!}{\left|N\right|! }(v(S\bigcup \{i\})-v(S)),$$
(6)
where v represents a characteristic function, S represents a coalition, i represents the target variable to assess and \({\upphi }_{i}\) represents the feature contribution. In this study, the positive SHAP values pushed the prediction for price change to occur, and the negatives reduced the prediction for price changes to emerge. Furthermore, to understand the general variable predictive power, the SHAP values for each feature were averaged in absolute terms, and this number showed what predictive power on average the variable achieved among all other variables. The higher the SHAP value, the higher the predictive power. Thus, in this paper individual SHAP values and the average SHAP values will be presented.