Impact of rail transit station proximity to commercial property prices: utilizing big data in urban real estate

Introduction Investment in rail transit does not only offer benefits to the public interest, such as increasing physical activity [1], releasing traffic congestion [2, 3], reducing emission [4], and expanding consumer amenities [5], but it also enables property owners nearby transit stations to capitalize their assets for greater benefits due to the land value uplift as the effect of accessibility improvement. The increases in land value indicate the positive impact of rail transit to the local economy, people’s quality of life, and the overall property market. The capitalization of real estate market is a combination of structural characteristics, surrounding environment, physical attributes, and accessibility which created the market price of an area [6, 7]. Abstract

Research topics related to the relationship between rail transit and property market have been extensively investigated in the past decades and have taken different types of rail transit into account, including mass rapid transit (MRT), light rail transit (LRT), and heavy rail. Debrezion, Pels, & Rietveld [8] and Mohammad, Graham, Melo, & Anderson [9] have contributed an extensive literature review that covers approximately 200 publications on the topic of rail transit and property values, while some recent studies have also discussed this topic in various cities worldwide such as Missouri, USA [10], Hamburg, Germany [11], Beijing, China [12], and Queensland, Australia [13]. However, most of them focused on the residential properties, and only a few studies acknowledged commercial or offices property values, such as those located in Arizona, USA [14], Australia [15], Seoul, Korea [16], Dubai, UAE [17]. Moreover, none of them elaborates evidence of commercial property prices due to transit proximity from developing countries such as Indonesia or Southeast Asian countries, whereas commercial properties are believed as a significant function to attract business activities that can accelerate the economic growth of a city.
Currently, most of the property sale techniques are shifting from newspapers or offline brokers to online advertisements on both local and international portals offering services such as selling and renting numerous types of buildings and lands for interested buyers. Large databases of property are advertised daily with useful information and various types of listing on these websites. This massive data source has been utilized by social scientists to evaluate urban vacation rentals or rental housing markets using web scraping techniques to understand land use patterns and market activities [18,19]. However, there is limited evidence that further investigates this technique into the transport-property market issue by considering online advertisement or marketplace listings. Although primary and secondary data collection are still playing a significant role in understanding the real data pattern, the utilization of data mining and other big data technical advances in handling and analyzing wide range data dimensions may serve as an alternative approach that can minimize exhaustive primary survey process as well as to accelerate the investigation to large-scale online data sources.
The research objectives of this study are twofold. The paper expects to explore website data that are being under-utilized by employing web scraping technique to property listings data gathered from online real estate marketplaces. Subsequently, the article attempts to evaluate the impact of rail transit proximity to the commercial property market by taking the pre-operation of the LRT project in Jakarta, Indonesia, as the study case. Understanding the potential effect of value uplift from rail transit to property value will not only support local governments to establish appropriate policy and regulation but also decision-makers both in domestic and international to dealing with land value mechanism and rail transit investment. The findings are also expected to make contribution to the body of knowledge of urban planning and regional development by proposing arguments on the benefits of adopting digital technologies such as big data to process and analyze a large urban dataset for a greater purpose to the academics and researchers.

Literature review
Rail transit gauges a significant impact on the increased value of land that yields benefits for the property owners nearby transit stations. The stations are believed as the core of transport network that offers better accessibility and mobility to connect destinations and commuters within walking distance [20]. Numerous research proposes buffer areas in the radial distance to measure travel time needed by users to reach their point of interest [21,22]. The proximity of these properties to the transit station is debatable among researchers and academics. Some of them measured the acceptable proximity based on walking distance ranging from 300, 600, 800 m, and even less than 1 km depends on the building typology (residential, offices, or commercials) or by walking time -mostly less than 5 min-walk. In general, there is no conclusive evidence to determine the best measurement of proximity but heavily relies on the characteristics of cities and citizens when accessing their destination. A compact city may only require 300-600 m to access the building from the station, while other cities may require beyond that distance due to the opposite local condition.
The impact of rail transit proximity to commercial property prices has been investigated despite remains very limited. Cervero and Duncan [22] found a significant capitalization of commercial land located within walking distance from light rail transit (LRT) and commuter rail stations in Santa Clara County, California, for about 23% and 120% respectively. Although the commercial land gained a positive impact from transit proximity, the result may differ from one city to another [16]. A study in Dubai, United Arab Emirates, showed that commercial properties gained the highest effect within 700-900 m from the metro station [17]. However, a similar study in Wuhan, China, showed commercial property closer to the rail station, approximately 300 m, gained the highest value uplift [23]. This study is confirmed by a recent publication from Phoenix, Arizona, which reported property price might increase to 1.8 times higher when the commercial property is located within the 300 m of LRT station [14]. Few studies presented the negative impact of rail station to property [24,25], but overall, accessibility positively correlates with the increasing value of a property. Table 1 presents a summary of selected empirical studies on the impact of rail transit on commercial properties.
There are some reasons that might be the main challenge as to why research on the effect of rail accessibility on the commercial market from developing countries remains limited, which include the lack of investment from both state and private investors, and scattered and lack of reliable data for analysis [26]. Although research in rail transit and property have emerged in recent years from developing countries such as Malaysia [27], Turkey [28], and Thailand [26], none of them discussed the impact of rail proximity to the commercial market nearby. Therefore, research regarding this issue in the context of non-developed countries may contribute to the gap of knowledge and lesson learned for countries currently developing their railway sector worldwide.
In contrast with residential properties that are easier to extract either through sales prices or rental rates, non-residential properties are still facing difficulties in obtaining data to measure property values [8,29]. Some challenges in making the data in nonresidential properties are fewer than residential properties in terms of quality and quantity. Unlike residential properties which experience a high turnover for listings, the sales of non-residential properties are rather limited. The real estate association is believed to have this kind of information as they gather sales of each member in order to expose the annual property landscape, particularly in developing countries such as Indonesia. However, it is not easy to obtain these data due to information confidentiality or when available, the records release on the incomplete manner for analysis.
The limitation to measure the values of the property made academics and researchers elaborate on the alternative sources for data scarcity by taking into account advertising/asking price [29,30]. These types of price-despite limited in numbers, are advertised in various means of media such as magazines that can be easily accessed by potential buyers or tenants in real estate marketplaces. This measurement strategy might experience a gap of price as the asking price did not express contractual price and was believed as less accurate to show the equilibrium price on the real estate market. However, some researchers have shown that asking price also can be used for property analysis and market prediction [31,32] and therefore employed in many property-related studies.
Asking price using real estate marketplaces is categorized as a part of volunteered geographic information (VGI) uploaded by the users in voluntary-basis tagged with geolocation [18]. The asking price using this approach is argued as one of the sources of big data characterized by its large samples with a variety of contents both in a structured and unstructured manner, and the speed of data processing [33]. These types of datasets may promote better planning in urban and transportation areas by leaving subjectivity in data collection from personal interests. Big data is argued to have a huge potential in reshaping and directing social science research towards digitalization [34]. Research dealing with a large dataset such as property market research and development is predicted to become the first that gained the impact of the fast-growing sources of big data. As rail transit in developing countries is experiencing a massive development with more urban rails are currently under construction, the commercial property market continues to grow in urban areas, and internet-based resources are tremendously rising, this research attempts to provide empirical evidence and body of knowledge in transport

Methodology
This research conducted a three-stage approach from data mining, data cleaning filtering, and data modeling by taking into account web scraping, geographical information system (GIS), and hedonic price modeling (HPM) to investigate the property market around the transit station. The research framework can be seen in Fig. 1. The researchers collected more than 900 property listings from two online real estate marketplaces (Lamudi.co.id and Rumah.com) in Indonesia from September 2018 to February 2019. It is crucial to note that the property listings in both marketplaces are the advertised price, which then may be different from the transaction price [35]. This price suggests closer to the price in legal contracts compared to a door-to-door survey to the targeted properties or to the sale value of the tax object (Nilai Jual Objek Pajak/NJOP) issued by the government. The latter has been used to determine property prices in developed countries, while in Indonesia, the property price tends to surpass up to three times of NJOP depends on the location, property condition, and many other considerations. Nonetheless, real estate marketplaces offer invaluable data sources for research development in the property and housing sectors.

Data mining
Web scraping has overcome the difficulties in collecting huge numbers of database for research on various sectors such as transportation, housing, tourism, and many others. There are many existing tools that can be used for data mining from web pages and offer a customized solution for novice users. These web scraping tools enable users to extract information, transform databases, structure the data generated from web pages automatically, and store in both clouds and local databases depends on user interest [36]. Some tools were created using a programming language such as R or python for statistical computing and graphics [37]. In this study, a Parsehub -a web scraping tool to extract data in property market development, was used for web-crawling purposes. This tool offers ease of accessibility for non-technical users, flexible to handle complicated websites and easy to use interface for data retrieval. First, the targeted website is selected and inserted into the Parsehub. The tool then used to select attributes of the property, including property price, building size, and the number of bedrooms. Parsehub also enables users to obtain the location of the property by extracting latitude and longitude points by activating coordinate inspection on the website page prior to submission in parsehub. Finally, the web-scraper saves these structured data into a certain format based on CSV or JSON depends on the need for research purpose and further analysis. Web scraping method extracts information from the property website regarding (1) property price, (2) building size, (3) Number of rooms, and (4) geolocation (X, Y coordinates). Property listings and its attributes generated from previous data mining were added with accessibility and surrounding attributes (see Table 2), including the distance of buildings (central business district/CBD, hospital, school, and university) to the transit station by taking the point of coordinate from google maps. Each building's geolocation, including its latitude and longitude that were retrieved to measure the building's distance to the nearest transit station. Four building typologies were selected due to their capability to attract a significant number of users taking transit mode rather than automobiles for mobility [23,38,39]. All of these data will then be plotted in the Geographic information system (GIS). GIS is a system that allows users to extract, manage, analyze, and assist in decision making based on spatial data. This research used ArcGIS software to evaluate geographic information and interpret factors in determining property prices. The software only included property data located approximately 1 km from the transit station, a distance that provides walkable access for citizens reaching targeted destinations from a transit station [38,40]. Seven stations of upcoming light rail transit in Jakarta (see Fig. 2) namely Dukuh Atas, Rasuna Said, Karet Kuningan, Kuningan, Cawang, Cikoko Station, and Ciliwung were selected for analysis. By taking into account transit stations, GIS enabled researchers to retain properties close to the transit proximity.

Data cleaning and filtering
When conducting the data mining process from property websites, it is common that the raw data are in a disorderly manner. Initially, more than 900 properties were extracted for potential analysis. Although the websites suggest structured text entry to create listings for users, sometimes these listings databases are incomplete, irrational, double, or unstructured. Therefore, data cleaning and data filtering were required to standardize the original dataset for subsequent analysis.
In the first stage, duplicate listings were identified from the initial data mining process. Since the real estate marketplaces do not prohibit users with the same ID to resubmit their listings multiple times to restore their listings into top search results, therefore, a similar property name will be immediately deleted when it has appeared more than once. Afterward, research team members evaluated listings with unique values regarding property price, building size, and the number of rooms. The marketplaces allow property agents to advertise the same property with different prices or similar depends on their agreement with the owner. Therefore, it is crucial to look upon the similarity of these values and exclude them for analysis.
From initial data mining with more than 900 properties, the research team members maintained 114 listings. Unclear listings that may include irrational values, spam, and incomplete submission were dropped using data filtering in excel. Last, the listings with explicit longitude and latitude of the property were maintained. Listings with coordinates are argued to offer higher credibility when the given location is accurate [41,42].
Prior to conduct statistical analysis to form a regression formula, 114 listings were evaluated using a normality test. This cleansing process aims to exclude abnormal data and repair data outliers for better analysis. This research used Kolmogorov-Smirnov due to its practicality and interpretation. Residual value from the regression of dependent and independent variables processed in the Kolmogorov-Smirnov test. Property price becomes the dependent variable, while independent variables consist of distance to station, number of rooms, building size, neighborhood attributes, and location attributes. The four earlier variables are in the form of log normal. The sample has a normal distribution (see Table 3) when sig. (2-tailed) shows a value higher than 0.05 and vice versa. Subsequently, the final listings included for analysis are 95 listings.

Data and modelling
This research adopts the hedonic price modeling (HPM) method to investigate the effects of property values and measure how property values in regard to rail station proximity. This method has been extensively used by academics and researchers worldwide and in diverse sectors. HPM can be used to measure the effect of attributes that affect the overall transaction price. Multiple regression analysis allows property values, among other variables, to be determined based on structural characteristics, neighborhood characteristics, accessibility, and land-use types. HPM can be used in different types of functional forms, including linear, log-log, and semi-log. The linear form assumed the effect of property price in the constant term in regard to the distance to the rail station. Unlike semi-log, which only transformed dependent variables, log-log form converts all variables into the logarithmic model. Log-log form predicts a constant elasticity of property price to the distance to rail station [23]. The theory does not guide how to select the functional form of HPM. However, some studies have attempted to shed light on the benefits and limitations of each form [43], however, there is no evidence showing that one form is dominant over the others [44].
In general, the specification of the model depends on the relationship between the dependent and the independent variables. The selection of functional form suggests following the data availability, the nature of the analysis, and case studies to generate optimum results [45]. This research used the semi-log model due to its practicality in showing the price elasticity of associated characteristics and its contribution to the property prices. The models are estimated based on the ordinary least square (OLS) method to estimate independent variables and are used for subsequent regression analysis. The model denotes the following formula.
where, Y = Dependent variable (Property prices). β = Regression Coefficient. x = Independent Variables. e = Error value (5%). This research uses two data sources to generate values for each identified variable in the HPM models. The first data source was generated from real-estate marketplaces operated in Indonesia obtained through the data mining process using Parsehub. This data consists of property price (US$), building size (square meter), and the number of rooms (unit). The second data source regarding the distance, buildings, and location is obtained from a combination of google maps and GIS. Both have been thoroughly discussed in the previous section.

Geographical distribution of property
Urban planners and practitioners require data for analysis from the local scale that represents the population and geographical diversity. Instead of using manual questionnaire surveys, extracting data from real estate marketplaces is believed to gain similar results with limited effort use during the research data collection. This data provides useful indicators depends on the interest of the researchers regarding components for evaluation. However, this mapping only shows the property location without taking into account the pattern of the property price. This mapping also can not directly specify the location of the property based on the street or transit proximity due to the limitation of data crawling and real estate marketplaces availability. The map in Fig. 3 shows listings in Jakarta Selatan -an administrative city in the capital city of Indonesia, where seven investigated stations in this research are located.
The map reveals not only property location but also other building typologies such as CBD, high school, and hospital represented by different symbols. The maintained 114 listings located within 1 km radius from the transit station are shown in Fig. 4.
Based on the filtered dataset, descriptive analysis for the dependent variable and independent variables are shown in Table 4, including the minimum, maximum, mean, and standard deviation. Property prices extracted from data mining are expressed in Rupiah and then converted to US$. 1 The mean value of property price is US$ 3666.86 per square meter and the mean distance of the property to rail station is 604.85 m. The highest property price is US$ 9578.54, located 800 m from the transit station, while the lowest property price is US$ 1179.67, located 787 m from the station. Based on these findings, it is argued that proximity to the station is not the only factor that contributes to the property price elasticity near the transit station. Other factors that may affect the price include the number of rooms, building size, and location of the property.

Effect of proximity to transit station on property values
A correlation test was first performed to determine the degree of correlation between independent variables and the dependent variable. A Pearson correlation was used due to the nature of data using parametric (nominal and scale). Kendall or Spearman test is proposed for analysis if non-parametric data was used. The correlation coefficient may vary from − 1 to + 1. A value close to − 1 or + 1 indicates a strong relationship between  the two variables and a value close to 0 indicates a weak relationship. On the other hand, positive ( +) and negative (-) signs suggest the direction of the relationship between the two variables. If the value is + (positive), then the two variables have a direct relationship. The significance of the variable can be seen from Sig. (2-tailed) when the value lower than 0.05. Among three variables (distance, number of rooms, and building size), as shown in Table 5, the variable of building size shows a correlation coefficient close to 1 and sig. A 2-tailed value lower than 0.05. A positive mark means that a larger building size contributes to a higher property price. In terms of location, three of the stations, namely Rasuna Said, Dukuh Atas, and Cawang, are highly correlated and contribute to the property price, respectively. Subsequently, the result suggests excluding CBD and university as both of them do not significantly contribute to the property price.
The HPM equation uses SPSS software by taking into account the stepwise regression method. This method evaluated the input and produced four equation models as the best alternative in the case study. The best equation selected based on the highest R-squared close to 1. It will represent the impact of independent variables on the dependent variable. Based on the result in Table 6, model 4 is the best model due to the highest R-squared with 0.919 and suggests a lower error value with 7.80%, among other variables.
Following the correlation test, another round of tests is performed, including multicollinearity and heteroscedasticity tests. The multi-collinearity test aims to determine inter-correlation between independent variables in a regression model. The model will proceed for further analysis when any correlation between independent variables was not found. This test was taking into account tolerance and VIF score, which should be higher than 0.10 and less than 10, respectively. The result shows that all four models (Table 6) do not correlate with independent variables. However, the fourth model which shows the highest score among others. Based on the findings, the formula regarding the effect of proximity to transit station on property values in the case study is presented as follows.
where, Ln Price = Log Natural of property price (US$). Ln BSize = Log Natural building size (square meter). Ln NRooms = Log Natural number of rooms (unit). Hospital = Location of hospital within 1 km (dummy). RasunaSaid = Property location within 1 km from Rasuna Said (dummy). Overall, two variables that consist of building size and the number of rooms significantly contribute to the property price around the transit station. A larger building is suitable for commercials and offices, which requires large space to perform daily activities. The findings showed that expanding the building size every square meter will increase the property price by US$ 6.45. These findings also found two variables consist of hospital and Rasuna Said-a location of transit station generates negative signs, thus create an opposite correlation. Hospital existence in a 1 km radius of transit station is believed to decrease property price about 6.8% or equal to US$ 7.29 while the property located 1 km from Rasuna Said station is cheaper US$ 8.21 than properties beyond the radius.

Discussion
Previous studies showed insightful evidence to create a basis comparison for the present studies. Despite all these findings, the property market on each country has different characteristics and conditions, causing direct comparisons rather difficult. Table 1 (2) LnPrice = 14.288 + 0.524 Ln_BSize − 0.068 Hospital − 0.077 RasunaSaid + 0.032 Ln_NRooms + %error illustrates selected studies regarding the impact of rail station on commercial properties worldwide. Unlike previous studies that showed premium accessibility to rail transit, this research found the accessibility attribute to the transit station has an insignificant impact on the property price. In contrast, structural and neighborhood attributes play a critical role in increasing the price of property nearby rail transit. There are some reasons the variable of distance to transit was out of the equation. First, data mining was conducted starting from September 2018 to March 2019, where the transit stations are yet to operate. At this stage, owners believed transit stations offer less added-value when the users still require additional modes of transportation (bus or motorbike ride-hailing) to reach their property. Accessibility within walking distance is argued to remain unattractive due to various reasons. Citizens of Jakarta tend to have a negative experience when accessing destinations through pedestrian paths since those paths are still lacking in width outside the arterial roads; usually only 1 m wide, lacking in amenities, and lacking in the number of lightings. Moreover, poor condition of walkway pavements, street vendors occupying the sidewalks, and the excessive exposure from vehicle noise and pollution are also among other reasons. Therefore, it is understandable from the viewpoint of property owners to not taking into account better accessibility and mobility as they believe that do not significantly correlate to property prices.
Second, location with strong branding plays a significant role in determining property prices despite the far location from the rail transit station. Rasuna Said, which was included in the formula, is located in the Kuningan area of South Jakarta that provides a balanced lifestyle between working and leisure. This area comprises a variety of office buildings, hotels, embassy offices, and a variety of tourist attractions, including shopping malls, art galleries, public open space, and many others that can be regarded as a place to relax or refresh after a day of work. However, Rasuna Said station in this project is argued to be located in a bad environment where there is a slum area nearby, bad odor from the polluted river, and it is crime-prone due to the incomplete street amenities. Therefore, it is not a surprise when the property in Rasuna Said has a higher price though it is outside of 1 km walkability radius compared to those located with proximity from the rail station.

Conclusion
There is a huge potential to utilize a value capture approach to finance rail transit projects by fully understanding the positive impact of accessibility improvement due to the existence of transit stations on property values. Although capitalization of property market around transit stations has been extensively investigated in developed countries such as in European Countries, the United States, and the East Asian region, research in the same topics in developing countries such as Indonesia remains limited. This research elaborates properties listed in several real estate marketplaces and takes into account seven LRT stations located in Jakarta, Indonesia, that are planned to fully operate in 2021 to fill the research gaps.
Most studies use geographical information system (GIS) and hedonic price modeling (HPM) by taking into account the contractual price of property market. For instance, Barbara et al. [13] elaborated property sales data in Australia and New Zealand from a property business company. On the other hand, Mohammad et al. [17] used repeated cross-sectional survey data in order to evaluate the value of commercial properties from metro existence. This research argued that the combination of web scraping, geographical information system (GIS), and hedonic price modeling (HPM) has found limited in the literature study. Therefore, this approach is argued to contribute to the body of knowledge from an operational research perspective regarding the transit and property sector.
The paper identifies the effect on commercial property values considering the proximity to rail transit. The area of investigation of rail transit accessibility was within a 1 km radius of road network distance to LRT station. The distance in this research is argued to be similar to the literature that has suggested a distance between 500 to 1500 m [17,23]. The result of this research shows opposite evidence compared to previous studies, which suggested that accessibility highly contributes to the property price near rail transit [8,22,29,38]. The findings of this research suggest that proximity to rail transit does not have a significant impact on the commercial property price compared to other variables such as building size, number of rooms, location, and hospitals. These other variables contribute to the increasing property price by 52.4, 7.7, and 3.2%, respectively. Building size and the number of rooms suggest a positive effect which means larger building attributed to a number of rooms is highly preferable. On the other hand, property located outside the radius of rail transit (e.g., Rasuna Said station) may experience a higher property price rather than those located closer to the station (see the previous section for more discussion). Hospitals also generate a negative effect on commercial property prices due to several reasons, including pollution, noise, and accessibility. A negative result from the hospital also concurrent with the previous study conducted by Xu,Zhang & Aditjandra [38] where they saw the negativity coming from pollution and cultural factors. The findings of this research contributed as the basis to further examine the value capture scheme from transit operation. It is crucial that policy-makers committed to seeking the best alternative design of land value uplift from different commercial property distance to gain maximum benefit for the public interest.
There are some limitations that should be addressed for future research development. This study was conducted to the under construction transit project, which helps improve the attractiveness of selling commercial property in transit line for property owners. It is suggested that follow-up research takes into account property listings in operated lines such as MRT, BRT or commuter lines in Indonesia to gain more comparative insight into real estate market conditions. The same research also can be performed when the LRT is already operated providing a longitudinal study and suggesting comparable findings. This research also experiences a limitation regarding property values from contractual price due to data scarcity by the real estate industry. Therefore, asking price was used as one of alternative strategies to generate property values for analysis. Future research is also suggested to incorporate a variety of measurements, different spatial models, and techniques to recommend a framework regarding transport-real estate market issues. Furthermore, academics and researchers are also encouraged to involve urban big data and digitalization methods regarding data extraction and analysis to encounter limited data availability. More advanced techniques such as machine learning, predictive modelling, and natural language processing can be adopted in future work to generate more comprehensive result. Furthermore, the results from this paper are also expected to provide an informative result that can be used by other similar case studies or different domain beyond transportation sector.