Illicit Bitcoin transaction detection makes a good use case for Blockchain Machine Learning and Web3 Data Science
“Future of the World Wide Web.”
“A decentralized world.”
“Fair and transparent.”
“Just another buzzword.”
These are some of the things people use to refer to the Web3 (or Web 3.0). Simply put, Web3 is just the internet and World Wide Web that is based on blockchain technology. Bitcoin, which works on the principle of blockchain, is a nice example to understand the decentralized aspect of Web3.
Bitcoin is a decentralized digital currency meaning that it is not controlled by a central bank or an authority. All transactions made on the Bitcoin network are available to the public where the account name is anonymized. You can retrieve the transaction amount, but the account names appear as 26–35 alphanumeric character strings.
This one aspect of Bitcoin (ie, anonymity) enables the participation of entities doing illegal transactions which include money laundering, exchange of illegal goods and services over dark marketplaces, ransomware attacks, etc. And seeing the growth of the Bitcoin network over the decade, with more than 700 million transactions, we can conclude that there’s a problem emerging from all this.
One way to tackle this problem is by constructing a Bitcoin Transaction Graph.
A transaction graph is simply a graph dataset where the nodes are represented by Bitcoin addresses (or public keys) and the edges are represented by the transactions. The edge weight can be used to store the transaction value. This graph would be directed in nature as we want to capture the flow of funds from a sender to a receiver.
The Bitcoin transaction graph is a Directed Acyclic Graph (DAG) which is in the form of
G(V, E) where
V is the vertex/ node and
E is the edge/ link between the two vertices. Here, the edges are an ordered pair of nodes as we are dealing with a directed graph.
We can add a time component to these transaction graphs and make it a dynamic graph (or dynamic DAG). We define a dynamic graph as a series of snapshots, ie,
G = (G₁, G₂, …., Gₜ) where
Gₜ = (Vₜ, Eₜ) and
T is the number of snapshots.
What purpose does this transaction graph serve, you may ask?
For starters, we can find a possible solution in identifying the wrong (or illicit) transactions in the Bitcoin network. This would contribute to an effective fight against the dark marketplaces and it is one of the several use cases of the transaction graph, ie, fraud detection.
Data structures, like a graph, can capture complex real-life scenarios in a way that is easily interpreted by humans, and with some help can be interpreted by machines too (Graph Machine Learning).
A fraudulent or illicit transaction detection can be formulated as a supervised or an unsupervised problem and then modeled using a predictive machine learning algorithm.
A supervised problem needs a labeled dataset. If we want to use the Bitcoin transaction graph with a supervised machine learning algorithm, we need to find a way to label the nodes. In other words, we need to find a way to de-anonymize the Bitcoin addresses.
This paper talks about a few techniques that can be used to annotate the transaction graph. The approach is as follows:
- Web Scraping— Bitcoin users sometimes openly share their public key on the crypto forums. It can be for several reasons like hoping to receive a donation, asking a question, etc. This makes our job easy, and we can use this information to match the public key with the user ID of the forum.
- Transaction Fingerprinting — Here, we use transactional metadata to pinpoint the user with the highest accuracy possible. For instance, over the forums we find some chat discussing the transfer of X number of Bitcoins. We then move to our transactional database and cross-verify the approximate time and date of the transfer. The public key which matches with the transaction is picked this way.
Et voila, we managed to de-anonymize some fraction of the Bitcoin transactions. Finding additional details about the user, we can put a label on every de-anonymized Bitcoin address. Say, we call the labels “Licit” or “Illicit” transactions which now make it a problem of binary classification.
De-anonymizing is a laborious task and does not guarantee the identification of all Bitcoin addresses. This paper talks about the application of unsupervised machine learning for fraud detection on the Bitcoin network. In an unsupervised setting, we do not need to label the public keys anymore.
We go with the assumption that there are a maximum of 1% illicit transactions on the Bitcoin network (as frauds are generally sparse). Feature engineering is done, and it includes the following:
- Features relating to currency — amount sent, received.
- Features relating to network — In-degree, Out-degree, clustering coefficient.
A total of 14 such features were extracted and used as input to K-Means clustering. We can find the optimal number of clusters using the elbow curve method. Basically, we are categorizing all the Bitcoin transactions into X clusters. Some of these clusters would make up more than 99% of the transactions. Our focus would be on the smallest cluster as it is most likely to be the cluster with illicit transactions (according to our assumption).
I stumbled upon this Elliptic dataset which is a Bitcoin transactional sub-graph with about 200,000 nodes with around 23% of nodes labeled as Licit or Illicit. The rest of the nodes are labeled as “unknown.” Every node has a 166-dimensional feature and also includes temporal information with 49 timesteps.
This is an example of a partially de-anonymized transactional graph and we can apply the supervised machine learning algorithms to formulate the fraud detection problem.
The Elliptic dataset has been annotated to a small extent which can be seen in the TSNE visualization where a majority of the Bitcoin addresses are unknown. In order to train the fraud detection model, we pick the addresses with annotated labels only.
I trained a Random Forest classifier to perform the fraud detection and I used the node features (with 166 dimensions as input). The code is shown below:
The results are as follows:
***** RF MODEL *****
ACC: Train: 1.0 Test: 0.987
ROC: Train: 1.0 Test: 0.935
F1: Train: 1.0 Test: 0.993
The test scores are pretty good, and we have successfully trained a fraud detection machine learning model.
Since we are dealing with a graph datatype, I also built another model using a Graph Neural Network. Here’s the code:
Here’s the results:
***** GCN MODEL *****
ACC: Train: 0.98 Test: 0.975
ROC: Train: 0.9 Test: 0.894
F1: Train: 0.99 Test: 0.986
Although the results were not as good as the random forest model, we still have a nice fraud detection model that is based on graph neural networks.
I used the best model to predict the labels of unknown Bitcoin accounts and this is what I obtained:
From the visualization above, we can see that the majority of the unknown transactions actually turned out to be licit. For the unknown labels, the model predicted 95% of the accounts to be licit and 5% to be illicit. This is more than the threshold of our assumption (1%), but still, it is significantly less than the licit transactions — which is good!
With data science, we have the possibility to make the decentralized web a bit safer by tracking down fraudulent transactions. Other areas where machine learning is applied include crypto trading bots based on reinforcement learning or time-series forecasting, optimizing the mining process, and the development of several use cases involving safe trading of sensitive information.
There exists a wide range of possibilities that actively involve data science in building safe and secure Web3.