Tracing outliers in the dataset of Drosophila suzukii records with the Isolation Forest method

The analysis of big data is a fundamental challenge for the current and future stream of data coming from many different sources. Geospatial data is one of the sources currently less investigated. A typical example of always increasing data set is that produced by the distribution data of invasive species on the concerned territories. The dataset of Drosophila suzuki invasion sites in Europe up to 2011 was used to test a possible method to pinpoint its outliers (anomalies). Our aim was to find a method of analysis that would be able to treat large amount of data in order to produce easily readable outputs to summarize and predict the status and, possibly, the future development of a biological invasion. To do that, we aimed to identify the so called anomalies of the dataset, identified with a Python script based on the machine learning algorithm “Isolation Forest”. We used also the K-Means clustering method to partition the dataset. In our test, based on a real dataset, the Silhouette method yielded a number of clusters of 10 as the best result. The clusters were drawn on the map with a Voronoi tessellation, showing that 8 clusters were centered on industrial harbours, while the last two were in the hinterland. This fact led us to guess that: (1) the main entrance mechanisms in Europe may be the wares import fluxes through ports, occurring apparently several times; (2) the spreading into the inland may be due to road transportation of wares; (3) the outliers (anomalies) found with the isolation forest method would identify individuals or populations that tend to detach from their original cluster and hence represent indications about the lines of further spreading of the invasion. This type of analysis aims hence to identify the future direction of an invasion, rather than the center of origin as in the case of geographic profiling. Isolation Forest provides therefore complimentary results with respect to PGP. The recent records of the invasive species, mainly localized close to the outliers position, are an indication that the isolation forest method can be considered predictive and proved to be a useful method to treat large datasets of geospatial data.

and management of these species of concern. The amount of data about IAS is always increasing, producing huge data sets whose analysis is becoming always more demanding. The so produced geospatial data is one of the main source of Big Data and one of the less investigated sources. "Big Data" is a definition referring to data sets that are not only simply large, but also of difficult management with traditional analysis methods [40]. The analysis of big data is, in general, a fundamental challenge for the current and future stream of data coming from many different sources [13].
Drosophila suzukii (Matsumura, 1931) (Diptera: Drosophilidae) (also known as vinegar fly and as Spotted Wing Drosophila, hereafter DS) is an invasive insect species that recently colonized many countries outside its native range [9]. DS attacks mainly thin exocarp fruits such as soft and stone fruits, particularly belonging to family Rosaceae, such as cherries and apricots [9]. DS can lay eggs in fruit and the developing larvae feed on the fruit flesh [33]. DS can reach 15 generations per year causing possible further damage by secondary infections by bacteria and fungi, with a potential damage to crops up to 30-100% of the total yield [44]. The estimated economic loss amounted to more than 500 million dollars every year in the USA after [4]. As for comparison, even in the small traditional fruit production area of Trento Province (Italy), about 500,000 EUR were lost in 2010 and 3 million in 2011 [11].
DS is endemic of South East Asia [9] and was first recorded in California, Spain and Italy in year 2000 [8], later colonizing large part of USA, Canada and Europe [9,44]. South America and Central America [15,20]. Despite more than ten years passed since DS first report in Europe, the spatial spreading of the DS invasion is still to be cleared. Cini et al. [8] investigated the spread pattern of DS in Europe by means of a spatial analysis technique (Probabilistic geographic profiling, hereafter PGP) to understand the possible spreading centre/s in European countries. This approach provided fruitful insights, depicting the South of France as a possible centre of spread [8] and also strongly suggesting that the most likely pathway of introduction of DS e is the trade of fruit, with eggs or larvae being transported unnoticed in fruits sea-traded from South East Asia [33].
A PGP is an analysis method aiming to identify the origin of linked events on a map. It was firstly proposed for crimes done by a serial killer in criminology and later for the spreading populations of invasive species and epidemiology [5,6,25,28,35]. PGP uses coordinates of events on a map to calculate a probability surface called geoprofile [14,32]. The geoprofiles does not provide exactly the center of origin of the event, but rather provide areas at different priority on the map with a variable probability density [31,32]. After its first use in criminoloy, PGP was applied to biological problems such as the targeting of an infectious disease [27], the prediction of nest locations of bumble bees [39], animal foraging [21,30], sharks hunting patterns [24] and even the distribution pattern of V2 bombing [3]. More recently, PGP was used to guess the source of an invasion by alien organisms using the positions of their current populations [8,28,38]. This analysis is useful, since it can suggest control methods and give an idea about the gateway of the invasion [8]. Recently, further refinements of the method allowed to improve the power and reliability of PGP, in particular by allowing (a) to clarify if a distribution pattern is caused by more waves of invasion, rather than from a single starting point [8], (b) to evaluate the robustness of the results with a jackknife procedure [26], (c) to give different weights to data on a quantitative basis (on the basis of the population dimension) or new methods of data partitioning [10,36]. The PGP was applied on the DS distribution in Europe, since it does not require any a priori knowledge about the invasion routes, while the biological justifications of the used parameters can be evaluated also by testing several parameters values.
The Isolation Forest method is a machine learning algorithm belonging to the algorithms family based on the "Random Forest" and decision trees used to identify anomalies in big data data sets. The term "isolation" means the separation of one of the items of the data set from the rest of the other items on the basis of one or more given rules [22,23]. The items (called "instances" by Liu et al. [23]) that remain separated from the rest of the data are the outliers or "anomalies". The main usage of the Isolation Forest method was for data mining in general [7] and particularly for monitoring of networks and genomic analysis [41]).
While geographic profiling tries to identify the source of a series of events and hence "something" that occurred in the past, it does not provide an idea about the future development of the series of events. On the contrary, the Isolation Forest method attempts to find the points of the geospatial distribution of events apparently less related to the main bulk of data and so, in our interpretation, to give an idea about the future directions of the spreading itself.
Here we compare the results obtained with PGP on DS distribution pattern with the results obtained with the Isolation Forest (IF) method, with the aim to identify, with the latter method, observations that are in particular positions among the general distribution (outliers), in order to evaluate them as possible beginning secondary invasions/ spreading points. The considered dataset of observations is referred to central-westsouth Europe and was the same used in Cini et al. [8], since it is a well checked dataset. While the results about the future development of Drosophila suzukii in Europe is of interest per se, we intend to use this dataset as a test for the application of the Isolation Forest to biological invasions data analysis.
The rest of the article is structured as follows: first, we explain the method of the Isolation Forest in detail, then we pass to the other techniques of data partitioning employed in the article. In the results section we show also graphically the type of data produced as output by the Isolation Forest analysis, giving an idea of the robustness of the data so obtained. Then we show the results obtained with data partitioning. In the discussion section, we show the high correspondence between the Isolation Forest and the classical methods of data partitioning here employed, concluding with a discussion about the meaning of the "anomalies" found by the Isolation Forest and how this data can be used to treat problems such as biological invasion or other similar problems.

Isolation Forest
The IF needs a preliminary "training" phase to build the model of the system and the decision tree to be used in the following phases of classification, or "test" [23]. Usually, if a data set previously classified (known data) to train the system is not available, a subset of the original data set is selected and used for the training. In our case, the training subset was 25% of the complete data set. The so built model is then used for the analysis on the test data set.

Replicates randomization
In order to eliminate the influence due to the random choice of the test data set, we resampled the data set randomly. This technique can be considered an analogous to the Jackknife or the Bootstrap [26] resampling methods and to the "Random Forest" [37]. For each resampling we execute a new "training" and a new "test". For each replicate the anomalies are recorded. The aim was to increase the feasibility of the anomalies inferred with a single test, considering as more probable the anomalies that are recorded in more than 50% of the replicates. The other anomalies were considered as possible random fluctuations of the training data set.
In our case we have 91 observations (consisting in geographical coordinates on a map).
Since we used a training set of about 25% of the total data set, we used 22 samples for this purpose. The test data set consisted of 69 samples, while we performed 100 replicates.

K-Means and silhouette
The K-Means method consists in the partitioning of a data set without supervision [18,19,35]. It is necessary to provide a priori the number of clusters in which the data set should be divided. If it is not possible, it is necessary to find other criteria to guess the right number of clusters in which the data should be divided. We used as criterion the maximization of the function Silhouette [34] evaluating a number of clusters varying from 2 to 15. When the value of the Silhouette function was optimized for a given number of clusters, we took that number of clusters as the number of clusters for the k-Means analysis.
In a data set of N items, the minimal number of partitions is 2, while the highest number would be N (only one item for each partition). Apart this case, the best number of clusters should avoid a single clusters having less than a given value and the presence of isolated points, so that each partition should have locally homogeneous features.
After the partitioning of the dataset in subclusters, we divided the space (map) accordingly with the Voronoi algorithm [2] as implemented by Santosuosso and Papini [35].

The Isolation Forest method
The Isolation Forest algorithm was described by Liu et al. [23]. Replicate randomization to evaluate robustness of data was introduced by Gnerucci et al. [16]. Kmeans clustering and the Silhouette method for assessing the best number of clusters were applied to the analysis of a spreading pattern of an alien species by Papini et al. [26,28]. The algorithms were applied with procedures written in Python 2.7.14 (default, Sep 23 2017, 22:06:14, www.pytho n.org) programming language. We used the Sk-Learn 0.19.0 library (Pedregosa et al. [29] and MathPlotLib 2.2.2 library [17] (https ://matpl otlib .org/). The Python procedure Plt_IF_DATA_0.1.1.py (https ://bitbu cket.org/ugosn t/al_and_ugo/), was written by the authors.

Anomalies found with the training and the test data sets
In the training set we found n = 24 anomalies with, in total, 300 replicates. In the test set we found n = 46 anomalies with, in total, 1000 replicates. Figure 1 shows the type of data produced by the Isolation Forest analysis during one of the replicates. The anomalies Fig. 1 Some examples of the results of replicates of the execution of IF analysis. The anomalies are reported: those obtained from the training set as yellow squares and in red those obtained from the test set were those observations (points in the map) that were less coherent with respect to the other observations. The anomalies were reported on the maps in Fig. 1 separating those obtained with the training set from those obtained from the test set in the shown replicates. The average number of occurrences in the training phase was = 24 [12.5 ± 1165 (mean ± standard deviation)]. The average number of occurrences in the test phase was 46 [occurrences = 2774 ± 2306 (mean ± standard deviation)].
The total number of anomalies observed was 49. It corresponds to the average number of occurrences (train + test) = 49.
The set composed by "train + test" is the union (in the set theory sense) of the items of the two sets and not by the mathematical sum, since some items may appear in both sets. There is not a significant difference in frequency of anomalies between the training and the test sets (the f-ratio value was 3.38299 and p-value 0.070238. Hence, the result was not significant at p < 0.05). Even between the test set and the whole set of the anomalies the f-ratio value was 1.05112 and the p-value was 0.307966. However, a difference was found between the training set and the total set (the f-ratio value is 5.06563 and p-value 0.027595). Hence, the result was significant at p < 0.05. In conclusion, the training and the test data sets do not present significant differences from the point of view of the anomalies. The robustness of the results is represented in Fig. 2 as proportional to the dimension of the blue triangles. The triangles correspond to the cases in which the single anomaly appears in more than 24 replicates (50%). Of the 24 items resulting from the IF-training analysis phase, only 4 appeared in the replicates more than 24 times (value higher than the mean of the sample + standard deviation = 3.57). Of the 46 items resulted from the IF analysis-Test phase, 6 appeared more than 45 times.
The histogram in Fig. 3 shows the number of times of occurrence of each specific observations among the anomalies during the replicates. The height of the column is proportional to the number of occurrence.
Of the 29 results of the IF test, 6 "anomalies" appeared in the replicates more than 45 times (value higher than the mean of the sample + standard deviation (i.e. 3.57). Considering the union of the two subsets of data ("train + test"), only 7 points presented an occurrence frequency higher than the mean of the sample + standard deviation (i.e. 3.57).

K-Means and Silhouette analysis
The Silhouette values for the varying number of possible clusters are drawn in Fig. 4.
In our results, best number of partitions N = 10 (Fig. 5), we had 2 clusters with relatively few items, but the IF algorithm had already indicated such points as anomalies. This reason, together with the best value of the Silhouette function (Fig. 4), induced us to accept this value as the correct number of clusters. We represented the distribution of the clusters on the map by partitioning space with Voronoi tessellation in Fig. 5, where on the left we indicate both the numerical consistency of the cluster and the Silhouette value for each cluster. Of the 10 clusters, 8 resulted centered on industrial ports on the coast, while the last two are located in the inland.

Discussion
The meaning of the anomalies The anomalies were interpreted as the observations of populations (points in the map) that are tending to become independent with respect to the rest of the populations.
The main populations clusters were identified by K-Means. In the case of an invasion, the k-Means identifies the main clusters of individuals or populations observed on the map [35]. The anomalies would then be individuals or parts of populations that are tending to detach from the cluster to which they belong. The anomalies would then also represent an indication of the future lines of further invasion in a territory.
Since the data set on which we based our analysis was of 2011, we compared the anomalies position of the 2011 database with the more recent observations of D. suzukii. Where available, the new observations where often corresponding to areas close to anomalies identified by the Isolation Forest analysis. See, for instance the new records in the region south-east of Bordeaux 2011-2014 in France by Delbac et al. [12]. Also the expansion from Croatia towards Hungary, Bosnia and Serbia [1] corresponded to the position of the outliers in Croatia on the basis of the 2011 data set, while quite unexpected was the record of D. suzukii in Poland [1], possibily as a result of a new independent introduction.
Almost all the clusters identified with K-Means (interpreted as main invasion sites) contained anomalies, possibly indicating that the invasion of D. suzukii is still in development in Europe.
The method may hence be considered predictive of the lines of future development of the invasion and this knowledge may provide useful suggestion about where an intervention to limit and contain the invasion could be more effective.
Also the DBSCAN [10,36] clustering method is able to identify outliers but requires a priori knowledge about the maximum distance for which points on a map are considered adjacent and the minimum number of adjacent points that are to be considered for a cluster. In both cases, it is necessary to find a biological justification for these assumptions. The isolation forest is a disruptive method based on a probabilistic approach composed of 2 phases: unsupervised learning on a random sub-sample of the data distribution and subsequent extension to the whole data set of the one learned during the test phase. This method considers the whole data set as a single cluster affected by "anomalies" and tries to recognize and distinguish these anomalies from the rest of the points in order to rule out them from a second phase from the planned processing. In both cases it is necessary to find a biological justification for these assumptions.
Anomalies can be considered in our analysis either all the cases highlighted by the various replicates or, more restrictively, the outliers found in all replicates. We can then apply an analysis method (the PGP in our case) for NON-anomalous cases, while on the anomalous cases a further analysis is made to understand why they are "anomalous" and which information they contain.

Conclusions
The Isolation Forest algorithm was used to identify the anomalies within the dataset containing the localities in Europe where the invasive species D. suzukii was recorded. By adding these data to cluster analysis with Kmeans and Silhouette algorithms and by partitioning accordingly the space with Voronoi tessellation we could identify the most probable areas where the invasion of DS is more probable to occur.
The results show that the Isolation Forest can be considered a useful tool to identify the anomalies in a data set consisting in records of observation of an invasive species in a territory or, possibly, other datasets referred to positions in the space. The anomalies can be interpreted as the individuals or populations that are tending to become independent with respect to the rest of the distribution of populations. Such anomalies represent hence an indication of the spreading lines of the invasion. This is a type of analysis not addressed by other methods of spread analysis such as the GPG, that aims rather to the identification of the center of origin.