Defining user spectra to classify Ethereum users based on their behavior

Bonifazi, Gianluca; Corradini, Enrico; Ursino, Domenico; Virgili, Luca

doi:10.1186/s40537-022-00586-3

Research
Open access
Published: 04 April 2022

Defining user spectra to classify Ethereum users based on their behavior

Gianluca Bonifazi¹,
Enrico Corradini¹,
Domenico Ursino ORCID: orcid.org/0000-0003-1360-8499¹ &
…
Luca Virgili¹

Journal of Big Data volume 9, Article number: 37 (2022) Cite this article

3541 Accesses
12 Citations
2 Altmetric
Metrics details

Abstract

Purpose

In this paper, we define the concept of user spectrum and adopt it to classify Ethereum users based on their behavior.

Design/methodology/approach

Given a time period, our approach associates each user with a spectrum showing the trend of some behavioral features obtained from a social network-based representation of Ethereum. Each class of users has its own spectrum, obtained by averaging the spectra of its users. In order to evaluate the similarity between the spectrum of a class and the one of a user, we propose a tailored similarity measure obtained by adapting to this context some general measures provided in the past. Finally, we test our approach on a dataset of Ethereum transactions.

Findings

We define a social network-based model to represent Ethereum. We also define a spectrum for a user and a class of users (i.e., token contract, exchange, bancor and uniswap), consisting of suitable multivariate time series. Furthermore, we propose an approach to classify new users. The core of this approach is a metric capable of measuring the similarity degree between the spectrum of a user and the one of a class of users. This metric is obtained by adapting the Eros distance (i.e., Extended Frobenius Norm) to this scenario.

Originality/value

This paper introduces the concept of spectrum of a user and a class of users, which is new for blockchains. Differently from past models, which represented user behavior by means of univariate time series, the user spectrum here proposed exploits multivariate time series. Moreover, this paper shows that the original Eros distance does not return satisfactory results when applied to user and class spectra, and proposes a modified version of it, tailored to the reference scenario, which reaches a very high accuracy. Finally, it adopts spectra and the modified Eros distance to classify Ethereum users based on their past behavior. Currently, no multi-class automatic classification approach tailored to Ethereum exists yet, albeit some single-class ones have been recently proposed. Therefore, the only way to classify users in Ethereum are online services (e.g., Etherscan), where users are classified after a request from them. However, the fraction of users thus classified is low. To address this issue, we present an automatic approach for a multi-class classification of Ethereum users based on their past behavior.

Introduction

In recent years, we have witnessed an impressive development of the blockchain technology [1]. It was initially introduced by Satoshi Nakamoto [2] to support the development of the cryptocurrency Bitcoin [3]. Later, smart contracts were introduced in Ethereum^{Footnote 1} and this technology has spread to a variety of applications in the financial sector. Finally, it is now starting to be adopted in an increasing number of sectors.

Anyone can participate to a blockchain network; therefore, different actors can be identified in this ecosystem [4]. For example, if we consider a blockchain like Ethereum, some actors (called miners) maintain the blockchain network, while others allow users to trade different cryptocurrencies and/or make banking transactions. Some others deal with auctions, others offer games or services, and so on. In some cases, there are online systems that provide a classification of the users of a blockchain network, even if the fraction of classified users is very small. The most known of such systems is Etherscan^{Footnote 2}, which provides this service for Ethereum. Through Etherscan, the developer of a smart contract can publish the corresponding code and request verification. Etherscan performs such a task and, if positive, also provides a categorization of the corresponding user^{Footnote 3}.

Knowing a user’s category can be extremely relevant in the context of blockchain networks [5, 6]. For example, such a knowledge allows us to find a set of competitors of a user performing a certain activity (exchange, bancor, etc.). In addition, through appropriate analyses, it is possible to identify whether, within a category, there are backbones of users connected to each other to avoid competing with one another or to gain dominant positions over others. Again, thanks to even more complex analysis, it is possible to understand the different strategies carried out by users of the same category and which of them is the winning one.

Despite the importance of this knowledge, in the past literature, there exist very few approaches that, given a user of a blockchain network, can automatically derive her category [6,7,8,9,10,11]. Furthermore, the few categorization approaches currently existing are usually tailored to the Bitcoin blockchain, while general ones have been tested on small specific blockchains. As for Ethereum, several approaches to identify users belonging to a certain category of interest have been proposed in the past. Instead, to the best of our knowledge, no tailored classification approach, like the ones presented for Bitcoin, have been proposed for Ethereum. As a consequence, the only current way to classify users in this blockchain is based on the activity of providers of this service, like Etherscan. However, they can classify at most those users who submit their smart contracts to them for verification. Unfortunately, such users are only a small minority of those present in Ethereum.

In this paper, we aim at filling this gap by proposing an automatic approach for classifying users in Ethereum. The starting point of our approach is that each Ethereum user has an address in order to carry out her activities. It is an alphanumeric code allowing a user to be identified in the blockchain network and to carry out transactions with other users^{Footnote 4}. All the transactions made by a user in a certain time period allow us to reconstruct, at least partially, her behavior in that period.

More specifically, in order to define user behaviors in a certain time interval, our approach first builds a social network representing the users involved in Ethereum and their transactions. Then, starting from this social network, it defines and computes a set of features for each user. They are the number of incoming and outgoing arcs of the node corresponding to the user, the number of incoming and outgoing transactions, the amount of incoming and outgoing money (expressed in Ether, the Ethereum cryptocurrency), the clustering coefficient and the PageRank. The values of these features can change over time. Given a time period T and a user $u_j$, we call the spectrum of $u_j$ in T the set of time series expressing the values of the features for $u_j$ in T. The spectrum of $u_j$ provides a concise, but accurate, picture of the behavior of $u_j$ during T.

Having a spectrum for each user might lead to think that categorizing users is a simple task. In fact, in principle, one could build a spectrum for each class starting from the spectra of the users belonging to it, identified from the training data. At this point, given a new user, whose spectrum is known, she could be assigned to the class with the spectrum most similar to her own. Although this procedure seems simple at an abstract level, it is much more complex in reality. In fact, we have seen that the spectrum of a user (and, consequently, the spectrum of a class) consists of a set of time series, one for each feature. As a consequence, it is necessary to define a similarity measure between two sets of time series. Furthermore, the various features are not totally independent of each other. In fact, as we will see later, a correlation study on them showed us that some features are totally or partially correlated. Therefore, the spectrum of a user must be managed as a multivariate time series.

As a consequence, we must face a classification problem in which each element to classify and each available class are represented by multivariate time series. To the best of our knowledge, there is no out-of-the-box classification algorithm with these characteristics. Thus, it is necessary to define a new one. The core of such an algorithm consists of a metric capable of measuring the similarity degree between two multivariate time series (which, in our case, are the spectrum of the user to be classified and the spectrum of each class). Several metrics proposed for this purpose exist in the literature. Among them, we mention the Dynamic Time Warping [12], the Weighted Sum SVD [13], and the Eros distance, also known as Extended Frobenius Norm [14]. The latter has been shown to outperform the other more traditional metrics [14]. Hence, it would represent the natural choice in our case. Unfortunately, as we will see in "Evaluation" section, the results obtained by applying the Eros distance to our reference scenario were not satisfactory. However, we managed to define a variant of it. Even if more expensive in terms of computation time (albeit, as we shall see, these costs are largely acceptable), this variant achieves a very high classification accuracy. It represents the core of our classification approach and will be described in detail in this paper, from both a theoretical and an experimental point of view.

A more formal definition of the proposed approach involves structuring it into the following steps:

Construction of the support social network starting from the training set.
Construction of the spectrum of each user from the data about her behavior stored in the dataset and the social network built in the previous step.
Selection of the classes of interest.
Construction of the spectrum of each class from the spectra of the corresponding users.
Definition of a new version of the Eros distance tailored to our scenario.
For each new user:
- Computation of the Eros distance between her spectrum and the one of each class.
- Assignment of the user to the nearest class (or to no class, if her spectrum is very far from the ones of all classes) based on the values of the Eros distance computed in the previous step.

We highlight that our approach, albeit designed to classify Ethereum users, can be easily extended to other blockchains. In fact, as we will see below, the social network-based representation of a blockchain, the definition of the spectrum of a user or a class, and the classification algorithm are characterized by a high abstraction level. Therefore, they can be easily applied to many different blockchains.

In summary, the main contributions of this paper are as follows: (i) we define a social-network based model to represent Ethereum; (ii) we introduce the concept of spectrum of a user or a class of users; (iii) we propose a multivariate representation, instead of a univariate one, of a user’s behavior; (iv) we introduce a modified version of the Eros distance to measure the distances between spectra; (v) we propose an automatic multi-class algorithm (instead of the single-class existing ones) for classifying Ethereum users based on their past behavior.

The outline of this paper is as follows: In "Related literature" section, we present the related literature. Then, in "Proposed method" section, we provide the description of the proposed approach. In "Experiments" section, we present some experiments aimed to perform an Exploratory Data Analysis on our dataset and tune our approach. In "Evaluation" section, we present the tasks we performed to evaluate our approach and the results obtained. In "Discussion" section, we highlight the strength of our approach, compared to the current state of the art described in detail in "Related literature" section . Finally, in "Conclusion" section, we draw our conclusions and highlight some possible future developments of our research.

Related literature

After the introduction of Bitcoin in 2008 [2], many cryptocurrencies have been created and have spread [15]. This prompted researchers to investigate both the development of this phenomenon and the issues related to it [16,17,18,19,20,21]. Indeed, while the growth of cryptocurrencies has opened new opportunities, it also led to new challenges to face and several problems to overcome.

As a matter of fact, malicious users have found in cryptocurrencies new opportunities for profit by deceiving newcomers [22], thanks also to the fact that blockchains guarantee a certain degree of anonymity [23, 24]. Many researchers have proposed approaches to detect frauds, scams and, generally, illegal transactions on several cryptocurrencies, such as Bitcoin and Ethereum [25,26,27,28]. Other ones have focused on tracking accounts and people, or groups of people, who performed these illegal acts [29,30,31]. This last challenging issue has paved the way to the more general problem of classifying and characterizing accounts, addresses and smart contracts in a blockchain [32, 33].

As for this topic, the authors of [7] propose to characterize an entity in the Bitcoin blockchain by analyzing information revealed by the patterns of the transactions made by its neighbors. Here, the term “entity” is used to denote the set of the addresses of a single user. This way of proceeding is motivated by the fact that a user can have associated several addresses in the Bitcoin blockchain. The approach of [7] models the Bitcoin blockchain as a directed weighted bipartite graph. Using the WalletExplorer website, the authors of [7] obtain a final labeled dataset with 30,331,700 addresses, associated with 272 entities. These are divided into five categories, namely Exchange, Service, Gambling, Mining Pool and DarkNet Marketplace. Classification is performed using 315 features belonging to five different categories, namely address, entity, time, centrality and motif. This approach achieves an overall accuracy of 0.85 with the Logistic Regression classifier and 0.92 with the LightGBM one.

The authors of [6, 8] propose a multi-class service identification of Bitcoin addresses based on a summarization of transaction history. Specifically, the authors of [6] consider eight parameters to perform this task. Using WalletExplorer and Blockchain.info, they identify seven types of Bitcoin-enabled services, along with a set of more than 26,000 addresses associated with them. Starting from this training set, they achieve a classification accuracy of 0.70 (resp., 0.72) in the address-based (resp., owner-based) scheme using a random forest classifier. The authors of [8] start from the approach proposed in [6] but add two more parameters to support classification. Using a dataset of 13 million transactions, they evaluate the new set of features with eight different classifiers. Proceeding in this way, they manage to improve the results obtained in [6]. In fact, they achieve a Micro F1 score equal to 0.87 and a Macro F1 score equal to 0.86 with the LightGBM classifier.

The authors of [9] present a new approach to decrease the anonymity of Bitcoin through entity characterization based on a cascade of machine learning models. This approach uses data on entities, addresses and motifs as classification features. The simultaneous usage of several machine learning models, each inserted several times in a cascade, allows the authors to reach a very high global accuracy, equal to 0.9968. This result is obtained after an appropriate training of all the models involved that, de-facto, tailors the overall model on the training data. The testing campaign was performed using the WalletExplorer dataset and the Gradient Boosting model.

The authors of [34] propose an approach focused on the detection of entities belonging to a single class, i.e., Exchange. First, they model the Bitcoin blockchain as a directed hypergraph. Then, they use this hypergraph to build classification models capable of detecting a set of discriminating features. Finally, they employ these features to decide whether an entity belongs to the Exchange class. The accuracy achieved by this approach is equal to 0.80, which is lower than that of other ones. However, this approach has the important advantage of exploiting only purely structural features of the hypergraph.

Finally, in [10, 11, 35], the authors propose two different methods that perform classification and clustering of addresses in a blockchain starting from the behavior of the corresponding users. In particular, the authors of [11] propose a deep learning based classification method called PeerClassifier. Instead, those of [10] propose a clustering method that uses the Dynamic Time Warping similarity measure applied to two sequences represented as two univariate time series. In both cases, the experimental campaign is conducted on a real blockchain operating on stock trading.

In the past literature, there are some Etherscan-based approaches to classify Ethereum users in a multi-class hierarchy. Furthermore, there are few automatic approaches that aim at identifying addresses belonging to a specific category of Ethereum users [5, 36,37,38,39,40,41]. All of them, are single-class, i.e., they were conceived to identify users belonging to a certain class. For example, the authors of [36] (similarly to [42,43,44]) propose an approach to store Ethereum data in a graph database in order to carry out analyses on it. They derive the data of interest from Etherscan and create a hierarchy representing addresses and their transactions. The authors of [37] propose a methodology for labeling the addresses of cryptocurrencies. First of all, they classify cryptocurrencies in two groups. The former includes those with an Unspent Transaction Output, such as Bitcoin and Litecoin^{Footnote 5}. The latter comprises account-based cryptocurrencies, such as Ethereum and EOS^{Footnote 6}. After this, based on their experience in the field (they work at Binance.com), they propose an approach to label addresses of the first and second cryptocurrency type and verify it on Bitcoin and Ethereum. The authors of [5, 39, 45] propose an approach to detect phishing accounts in Ethereum. It first collects the data of interest from Etherscan and uses this data to build an Ethereum transaction network. Then, it applies a network embedding method to extract latent features of the accounts performing phishing activities. Finally, it uses these features to train a one-class Support Vector Machine. In [38], the author proposes an approach to cluster Ethereum addresses in order to identify entities controlling multiple addresses. The clustering task is done considering the following features: deposit address, multiple participation in airdrops and token authorization mechanisms. The author shows that his approach can cluster 17.9% of all active externally owned account addresses. He also finds that there are more than 340,000 entities that likely control multiple addresses. The authors of [40] propose an approach to detect Ponzi schemes implemented as smart contracts in Ethereum (also called “smart Ponzi schemes”). First, they manually identify 200 smart Ponzi schemes in Ethereum. Then, starting from the analysis of these schemes, they extract features to recognize smart Ponzi schemes. Finally, they use the extracted features to identify new smart Ponzi schemes. They show that their approach achieves a very good accuracy and estimate that there are at least 500 smart Ponzi schemes running on Ethereum.

Proposed method

In this section, we present our approach. As mentioned in the Introduction, it consists of several steps, each introducing innovations with respect to the corresponding tasks proposed in the past. More specifically, the outline of our approach is as follows:

1.
Construction of a social network supporting the representation of a training set concerning Ethereum users and their behavior.
2.
Construction of the spectrum of users of the training set from their data stored in the dataset and some metrics computed on the social network built at Step 1.
3.
Selection of the classes of interest. These are presumably the ones most prevalent in the dataset and, thus, in Ethereum. However, if we want to focus on one or more uncommon classes (e.g., for studying an outlier class), we can do it.
4.
Construction of the spectrum of each class selected at Step 3 starting from the spectra of the users of the training set associated with that class.
5.
Definition of a new version of the Eros distance tailored to our scenario and computation of the corresponding weights starting from the dataset.
6.
For each user to be classified (whether she belongs to the test set or is a new user of whom nothing is known):
1. (a)
  Construction of the corresponding spectrum.
2. (b)
  Computation of the Eros distance between the spectrum built at Step 6(a) and the class spectra built at Step 4.
3. (c)
  Assignment of the user to the nearest class according to the values of the Eros distance computed at Step 6(b). Otherwise, assignment of the user to no class if the Eros distance between her spectrum and that of all available classes is higher than a certain threshold.

In the next subsections, we describe the various steps of our approach in detail.

Modeling a blockchain as a social network

A blockchain can be modeled through a social network in a very direct way. In fact, the social network nodes can represent the blockchain addresses, while its arcs can denote the transactions between the addresses corresponding to the involved nodes. The capability of building such a model for a blockchain leads to the possibility of extracting knowledge about the behavior of blockchain actors by employing the Social Network Analysis based techniques proposed in the past [46,47,48]. In the following, we show this property taking Ethereum as the reference blockchain because it is the blockchain of interest for this paper. However, we point out again that our approach to build and characterize a social network from a blockchain (and, consequently, the next classification approach representing the core of this paper) can be applied to most blockchains. Indeed, the features used to model Ethereum as a social network (such as the sender, receiver and timestamp of a transaction, and the amount of transferred money) are also present in many other blockchains, like Bitcoin, Litecoin, and so on.

After this necessary preliminary remark, we can now see how a social network ${{{\mathcal {G}}}}$, representing the Ethereum blockchain, can be built. Specifically:

$${{{\mathcal {G}}}} = \langle N, A \rangle$$

Here, N is the set of nodes of ${{{\mathcal {G}}}}$. A node $n \in N$ corresponds to an Ethereum address that has made at least one transaction. Since there is a biunivocal correspondence between a node of ${{{\mathcal {G}}}}$ and an Ethereum address, in the following we will use these two terms interchangeably. Each node n has associated a label $l_n$, indicating the class which it belongs to (see below); $l_n$ is set to null if no class has been assigned to n yet.

A represents the set of arcs of ${{{\mathcal {G}}}}$. There is an arc $a = (n_i, n_j, TrS_{ij}) \in A$ if there was at least one transaction from $n_i$ to $n_j$. $TrS_{ij}$ consists of a set of triplets $(tr_{ij_k}, \tau _{ij_k}, v_{ij_k})$, where $tr_{ij_k}$ represents the $k^{th}$ transaction from $n_i$ to $n_j$, $\tau _{ij_k}$ indicates the corresponding timestamp and $v_{ij_k}$ denotes the amount of Wei^{Footnote 7} transferred from $n_i$ to $n_j$ through $tr_{ij_k}$.

Modeling Ethereum as a social network allows us to use various Social Network Analysis measures to characterize each Ethereum address. In particular, we chose a set F of features that can support in distinguishing one class from another. They are:

In-degree: it represents the number of arcs incoming to $n_i$ and, therefore, the number of nodes of ${{{\mathcal {G}}}}$ pointing to $n_i$. It can be determined by computing the cardinality of the set:
$$IN_i = \{ n_j | (n_j, n_i, TrS_{ji}) \in A \}$$
Out-degree: it denotes the number of arcs outgoing from $n_i$ and, therefore, the number of nodes of ${{{\mathcal {G}}}}$ which $n_i$ points to. It can be determined by computing the cardinality of the set:
$$OUT_i = \{ n_j | (n_i, n_j, TrS_{ij}) \in A \}$$
In-transaction: it indicates the number of transactions towards $n_i$ made by the nodes of ${{{\mathcal {G}}}}$. It can be computed as:
$$\sum _{n_j \in IN_i} |TrS_{ji}|$$
where $|TrS_{ji}|$ denotes the cardinality of the set $TrS_{ji}$.
Out-transaction: it represents the number of transactions towards the nodes of ${{{\mathcal {G}}}}$ made by $n_i$. It can be computed as:
$$\sum _{n_j \in OUT_i} |TrS_{ij}|$$
In-value: it denotes the total amount of Wei received by $n_i$. It can be computed as:
$$\sum _{n_j \in IN_i} \sum _{k = 1 .. |TrS_{ji}|} v_{ji_k}$$
Out-value: it indicates the total amount of Wei sent by $n_i$. It can be computed as:
$$\sum _{n_j \in OUT_i} \sum _{k = 1 .. |TrS_{ij}|} v_{ij_k}$$
Clustering-coefficient: it represents the clustering coefficient of $n_i$. Recall that, in Social Network Analysis, this parameter is an indicator of the tendency of $n_i$ and its neighbors to form a cluster.
PageRank: it denotes the PageRank of $n_i$. This parameter is an indicator of the number of links received by $n_i$, the centrality of the neighbors of $n_i$ and their propensity to link to each other [49].

In our reference scenario, the time factor plays a key role. As a consequence, our model should take time into account. In fact, users continuously make transactions on Ethereum, which leads to continuous changes in the structure of the corresponding social network and the labels of its arcs.

In order to take time into consideration, given a time instant t, we denote with ${{{\mathcal {G}}}}(t)$ the social network associated with Ethereum that considers the transactions made on that blockchain from its appearance until t and, therefore, the transactions whose timestamp is less than or equal to t.

Similarly, given two time instants $t_\alpha$ and $t_\beta$, we can build a social network ${{{\mathcal {G}}}}(t_\alpha ,t_\beta )$ representing Ethereum, and the transactions made on it, in the time interval $(t_\alpha ,t_\beta ]$. More formally, ${{{\mathcal {G}}}}(t_\alpha ,t_\beta )$ considers only the transactions on Ethereum such that the corresponding timestamp is higher than $t_\alpha$ and less than or equal to $t_\beta$.

Defining the spectrum of a user or a class of users

We have introduced the eight features able to characterize an Ethereum address and we have presented the social network $\mathcal{G}(t_\alpha ,t_\beta )$, modeling Ethereum in the time interval $(t_\alpha ,t_\beta ]$. We are now able to define the concept of spectrum of an Ethereum address in $(t_\alpha ,t_\beta ]$.

Let F be the set of features introduced in the previous section and let $T=(t_\alpha ,t_\beta ]$ be a time interval. We assume that T consists of a certain number of days. Let $d_h$ be the $h^{th}$ day of T. T can be represented as a succession $T = \{ d_{\alpha +1} = d_1, d_2, \cdots , d_h, \cdots , d_q = d_\beta \}$ of q days. Let $f_p$ be a parameter of F. It can have associated a time series $\Phi _p = \{\phi _{p_1}, \phi _{p_2}, \cdots , \phi _{p_h}, \cdots , \phi _{p_q} \}$, where $\phi _{p_h}$ is the value assumed by $f_p$ at a constant and default time of $d_h$ (for instance, at 12:00 am).

We define the spectrum ${{{\mathcal {S}}}}_i^T$ of a node $n_i$ in the time interval T as the set ${{{\mathcal {S}}}}_i^T = \{ \phi _{p_i} | f_p \in F \text{ and } \phi _{p_i} \text{ is }$ $\text{ the } \text{ succession } \text{ of } \text{ the } \text{ values } \text{ assumed } \text{ by } f_p \text{ in } n_i \text{ during } T \}$. In other words, the spectrum of $n_i$ in T is given by a set of successions, one for each feature of F. Each succession is made of the values assumed by the corresponding feature for the Ethereum address associated with $n_i$ for the days belonging to T.

The spectrum ${{{\mathcal {S}}}}_i^T$ can be represented by a matrix that has q rows (one for each day of T) and nine columns. The first column is used to indicate the date, while the other eight ones correspond to the features of F. In particular, the semantics of the columns is as follows:

1.
Day: its $h^{th}$ element indicates the date corresponding to $d_h$.
2.
In-degree: its $h^{th}$ element denotes the number of addresses from which $n_i$ received transactions during the time interval $\tau _h$ between 12:00 am of $d_{h-1}$ and 12:00 am of $d_h$.
3.
Out-degree: its $h^{th}$ element indicates the number of addresses to which $n_i$ has made transactions during $\tau _h$.
4.
In-transaction: its $h^{th}$ element denotes the number of transactions received by $n_i$ during $\tau _h$.
5.
Out-transaction: its $h^{th}$ element indicates the number of transactions made by $n_i$ during $\tau _h$.
6.
In-value: its $h^{th}$ element denotes the amount of Wei received from $n_i$ during $\tau _h$.
7.
Out-value: its $h^{th}$ element indicates the amount of Wei sent by $n_i$ during $\tau _h$.
8.
Clustering-coefficient: its $h^{th}$ element denotes the clustering coefficient of $n_i$ in the social network ${{{\mathcal {G}}}}(d_{h-1}, d_h)$.
9.
PageRank: its $h^{th}$ element indicates the PageRank of $n_i$ in ${{{\mathcal {G}}}}(d_{h-1}, d_h)$.

Defining the new version of the Eros Distance

The algorithm for the Eros distance computation applies Principal Component Analysis [50] to two multivariate time series, each represented by means of a matrix. First it generates the principal components and their corresponding eigenvalues and eigenvectors. In our case, the eigenvectors are associated with the eight spectrum features. More specifically, each eigenvector corresponds to a feature and the associated eigenvalue represents the importance of that feature for the characterization of the address or the class which the spectrum refers to. Then, the algorithm uses principal components and their associated eigenvectors to compute the similarity of the two matrices associated with the multivariate time series under consideration. It is easy and fast to implement; at the same time, as stated in [14], the Eros distance outperforms other traditional similarity measures for multivariate time series, such as the Dynamic Time Warping [12], the Weighted Sum SVD [13], and so forth.

We selected the Eros distance as the reference metric for computing spectra similarities in our classification algorithm. In fact, this computes the distance between a blockchain address to be classified and each possible class and assigns the address to the closest class. In this context, the Eros distance allows us to measure the similarity degree between two multivariate time series representing the spectrum of the address to classify and the one of a class.

The way our algorithm proceeds and the adoption of the Eros distance allow us to perform the address classification in a way that minimizes the distances between the spectra of the addresses of the same class and maximizes the distances between the spectra of the addresses of different classes.

The algorithm for the Eros distance computation uses some weights, one for each time series considered and, therefore, one for each feature. Each weight denotes the relative importance of the corresponding time series (and, therefore, of the corresponding feature) with respect to all the other ones.

The original version of the Eros distance described in [14] obtains these weights from the eigenvalues associated with the eigenvectors representing the time series being considered. Initially, we applied this version but, as we will see in "Evaluation" section, the results of the classification obtained in this way were not particularly satisfactory.

Nevertheless, we considered that the possibility, offered by the Eros distance, to associate a single value with the distance between two sets of multivariate time series was a key feature for our context. Therefore, we planned to define a new version of the Eros distance in which the weights are computed in a way tailored to our reference scenario. Regarding this, we recall that, in our case, whenever the Eros distance measures the similarity degree of two spectra, it has to consider two sets, each consisting of 8 time series. Each time series has associated a weight and the overall sum of the weights must be equal to 1. Therefore, in principle, we should consider 2 sets of 8 weights that can vary in any way between 0 and 1, with the only constraint that their overall sum must be equal to 1. It is reasonable to assume that the weights are decimal numbers with two digits after the decimal point. Even with this assumption, the problem is still NP-hard, because it would be necessary to exhaustively examine all the possible valid combinations of weights. As a consequence, despite the fact that, at the moment, the classes are only 4 and the features are only 8, we have judged opportune to preserve the scalability of our approach and to determine since now a heuristics to solve it. We have defined such a heuristics, which is reported in Algorithm 1.

Our heuristics receives in input:

The set Cl of the classes of interest; in our case, this set consists of the classes “Token Contract”, “Exchange”, “Bancor” and “Uniswap”.
The set ${{{\mathcal {S}}}}_{Cl}$ of the spectra of the classes of Cl; as for our dataset, these are the spectra shown in Figures 4, 6, 8 and 10.
The set ${{{\mathcal {S}}}}_{train}$ of the spectra of the training addresses; the element ${{{\mathcal {S}}}}_{train}^i$ represents the set of spectra of the training addresses assigned to the class $Cl_i$.
The parameter step, which is a decimal number in the range [0, 1]. As we will see below, it allows the management of a tradeoff between the accuracy and the computation time of our heuristics. In fact, the smaller the step, the more accurate the output of our heuristics, but the longer its computation time.

Our heuristics returns a set ${{{\mathcal {W}}}}_{best}$ of weights sets, one for each class. ${{{\mathcal {W}}}}_{best}$ is computed in such a way as to minimize the Eros distance between the spectra of the addresses of the same class and maximize the Eros distance between the spectra of the addresses of different classes. It also uses a function Eros that receives two spectra $S_x$ and $S_y$ and a set w of weights and computes the Eros distance between $S_x$ and $S_y$ using the weights specified in w.

For each class $Cl_i$ belonging to Cl, our heuristics builds the set $w_t$ of weights as a random combination of two-digit decimal numbers such that $\sum _{k=1}^8 w_t^k = 1$. This last condition is required by the Eros distance and must be verified by any admissible set of weights.

Starting with $w_t$ as seed, our heuristics builds a set $\mathcal{W}_{temp}$ by increasing one of the weights of $w_t$ of a value equal to step and decreasing another one of the same value. It repeats this procedure for any pair of weights of $w_t$. In doing so, it may happen that some of the new combinations obtained are not admissible because one or both of the modified weights do not fall within the range [0, 1]. These combinations are discarded.

Once the construction of this initial version of ${{{\mathcal {W}}}}_{temp}$ is finished, our heuristics proceeds with its enrichment. For this purpose, it repeats the same procedure by increasing a weight of $w_t$ of a value equal to $2 \cdot step$ and decreasing another one of the same value. After this second iteration has been finished, it repeats the same procedure by increasing and decreasing the weights of $w_t$ of a value equal to $3 \cdot step$, $4 \cdot step$, and so on. The enrichment of ${{{\mathcal {W}}}}_{temp}$ terminates when, during one iteration of this procedure, no new admissible pair is obtained.

From this description, we can see how step acts as a regulator between accuracy and computation time. In fact, the lower its value, the higher the number of weight sets present in ${{{\mathcal {W}}}}_{temp}$ and, consequently, the higher the accuracy of our heuristics, but the longer its computation time. On the contrary, the higher the value of step, the lower the accuracy of our heuristics but the smaller its computation time.

At this point, ${{{\mathcal {W}}}}_{temp}$ has been completely constructed. Now, for each set $w_q \in {{{\mathcal {W}}}}_{temp}$, our heuristics applies the Eros function, with the set $w_q$ of weights, for computing the minimum distance $min_q$ between the spectrum $S_i$ of $Cl_i$ and the spectrum $S_j$ of any address assigned to $Cl_i$. Then, it applies Eros, with the same set of weights, for computing the maximum distance $max_q$ between $S_i$ and the spectrum $S_j$ of any address assigned to a class different from $Cl_i$.

If the minimum current distance $min_d$ concerning $Cl_i$ is greater than $min_q$ and the maximum current distance $max_d$ concerning $Cl_i$ is less than $max_q$, then $max_d$ is set to $max_q$, $min_d$ is set to $min_q$, $w_q$ becomes the new best current set of weights for $Cl_i$ and is assigned to ${{{\mathcal {W}}}}_{best}^i$.

After all the sets of weights of ${{{\mathcal {W}}}}_{temp}$ have been examined, the current value of ${{{\mathcal {W}}}}_{best}^i$ becomes final. At this point, a new class of Cl is selected and the whole procedure described above is repeated. After all the classes of Cl have been examined, our heuristics terminates and returns ${{{\mathcal {W}}}}_{best}$.

We end this description of the heuristics with some considerations regarding its accuracy and computation time. As mentioned above, our heuristics has one parameter, namely step, which acts as regulator. Its presence guarantees that our heuristics terminates (in fact, it would be enough to choose a high value of step). Clearly, this is not enough to say that our heuristics is adequate for the problem for which it was designed. In fact, it is necessary: (i) to show that the accuracy of results is acceptable; (ii) to verify that the computation time is acceptable and, in any case, much less than the time taken by an exhaustive approach for defining weights; (iii) if possible, to find a default value for step that can guarantee in most cases an excellent tradeoff between accuracy and computation time. We will devote Section of the paper to address these issues. For now we anticipate that: (i) we found that setting step to 0.05 guarantees an excellent tradeoff between accuracy and computation time; (ii) the accuracy of the results obtained by our heuristics proved to be comparable with the one of the exhaustive approach; (iii) the computation time employed by our heuristics is much (in particular, several orders of magnitude) less than that of the exhaustive approach. In light of these results, we can say that our heuristics is adequate for the problem it aims to address.

Classifying users based on their spectra

In this section, we define a classification algorithm that, given a time interval T and an address $a_j$ whose spectrum in T is known, assuming that the spectra of the four classes of interest in T are known, is able to classify $a_j$. In particular, the algorithm may assign $a_j$ to one of the four classes or may conclude that $a_j$ does not belong to any of them.

We observe that the classification problem we are considering is complex because it involves comparing spectra and calculating a similarity degree between them. In particular, each spectrum consists of a set of time series. As we saw in "Defining class spectra" section, these are not independent of each other but are correlated. Even if, given two features with a correlation degree equal to 1, we remove one of them and keep the other, we would not have solved the problem because the remaining features would still be partially correlated to each other. As a consequence, we must handle multivariate time series.

Recall that, as stated in the Introduction, the past literature provides some approaches to classify multivariate time series [51,52,53]. We have also specified that, to the best of our knowledge, there is no out-of-the-box classification approach that can be easily implemented in our case. Therefore, we preferred to define a new technique tailored to the characteristics of the problem we want to face. This technique involves the modeling of the blockchain as a social network and the next derivation of the appropriate features from it.

The core of such an algorithm consists of a metric able to compute a similarity degree between multivariate time series. In order to perform this task, we rely on the Eros distance, also known as Extended Frobenius Norm [14].

Once the weights of ${{{\mathcal {W}}}}_{temp}$ have been computed, the definition of the classification algorithm is straightforward. In fact, given an address $a_j$ to be classified, it is sufficient to compute the Eros distance between the spectrum $S_j$ of $a_j$ and the spectrum of each available class. $a_j$ will be assigned to the class with the minimum distance. We report the corresponding pseudo-code in Algorithm 2.

Experiments

In this section, we present several experiments that helped us to define the details of our approach. In particular, in "Dataset" section, we present the dataset we used for training and testing it. In "An example of user spectrum" section, we describe an example of user spectrum. In "Defining the classes of interest" section, we present the process that led us to define the classes of interest. In "Defining class spectra" section, we illustrate the spectra of the selected classes. Finally, in "Weights of the Eros distance" section, we present the application, to the dataset of interest, of the method for computing the weights of the Eros distance.

In order to carry out our experiments, we used a server equipped with 16 Intel Xeon E5520 CPUs and 96 GB RAM with the Ubuntu 18.04.3 operating system. We adopted Python 3.6 as programming language, its library Pandas to perform ETL operations on data, and its library NetworkX to carry out operations on networks.

Dataset

In order to carry out our analyses, we derived a dataset from Ethereum. In particular, we downloaded the corresponding data from Google BigQuery^{Footnote 8}. The data we selected covers a period from September $1^{st}$, 2019 to October $31^{st}$, 2019. We chose it because we wanted to test our approach in a “normal” period for Ethereum, i.e., a period when there were no particular speculative bubbles. In fact the latter can heavily modify user behaviors and deserve a separate study [4]. We selected all the transactions made on Ethereum in that period. The total number of transactions considered in the dataset is 41,420,435, whereas the total number of addresses is 5,553,645. We computed some statistics on the dataset; they are reported in Table 1.

Table 1 Some preliminary statistics performed on our dataset

Defining user spectra to classify Ethereum users based on their behavior

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Introduction

Related literature

Proposed method

Modeling a blockchain as a social network

Defining the spectrum of a user or a class of users

Defining the new version of the Eros Distance

Classifying users based on their spectra

Experiments

Dataset

An example of user spectrum

Defining the classes of interest

Defining class spectra

Spectrum of the class “Token Contract”

Spectrum of the class “Exchange”

Spectrum of the class “Bancor”

Spectrum of the class “Uniswap”

Weights of the Eros distance

Evaluation

Evaluating our approach with the original Eros distance

Evaluating our approach with an exhaustive examination of all weight combinations for the Eros distance

Evaluating our approach with our version of the Eros distance

Computation time analysis

Discussion

Conclusion

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords