# Machine learning approach for Bitcoin network inference

## December 19, 2019

In an earlier post we looked at previous attempts of finding the Bitcoin P2P network topology. As a quick refresh: we have a system where computers talk directly to around 8-12 other computers each on a network about 10,000 computers big. Finding the topology means finding exactly which computers are connected to which others.

The central approach to this problem has been in measuring the transactions that are sent around the network. Transactions originate at one computer and then each tells all its peers until the transaction has covered the whole network to be added to ledgers. This is known as a “gossip protocol” because we can think of it like rumours spreading around a town. This mechanism should be viewed as a stochastic process, with the transaction passing along the edges of the network in a probabilistic manner. Randomness is included in a few places.

- The latency between two peers on the network;
- The processing time on a computer to receive and verify the transactions. This may take longer if there are other processes (or even other transactions) going on simultaneously;
- Random delay times in sending transactions onto peers that are purposefully built into the Bitcoin protocol. This is known as “trickling” and is discussed in this paper.

Some methods developed to find the topology have attempted to model all of these factors, which requires strong assumptions about the network. The problem we have here is that we have a structural problem where all inferences are highly correlated, and so any assumptions we introduce should be heavily scrutinised. What I found annoying during my research of this problem is that I hadn’t encountered any machine learning approaches. I would have thought that some ML based nonparametric method would be ideal here because it’s quite complex and hard to model. In fact, networks have been studied heavily in the ML literature, but usually in the context of predicting links where others already exist, rather than finding a whole network.

If we want to develop an approach, we need some notion of a testing set and validation set. Ideally we’d have a large set of different implementations of the Bitcoin network, but unfortunately we only have the one. So what we’re going to do is inject new nodes into the system, and try to learn some properties of the nodes that would lead to links being formed.

## Delay correlation

Here’s the idea. If two nodes are peers (there’s a link between them), then they should receive a transaction at a similar time. That’s because one will send it to the other, or they’ll receive it at similar times from other peers with not enough time to send it to each other. If two nodes aren’t peers, then they might for whatever reason still receive a transaction at similar times, but it’s quite unlikely that this will happen many times over multiple transactions. So the idea here is that we should formalise some notion of reception time, then for any two nodes, if those reception times are highly correlated then it’s likely that they’re peers.

So let’s imagine we have some transaction $k$ that’s spreading over the network, and we have a monitor node that we connect to every node on the network to receive the transaction. One node is going to be the first that will send us the transaction. We’ll label that time as $tr_{k0}$. Then for every node we label the time we receive the transaction from them, i.e. we receive the transaction from node $i$ at time $tr_{ki}$. Note that this isn’t the time that the node receives the transaction, but a fuzzy measurement of that because we have to wait a random time for the node to send it to the monitor. We then “normalise” these times by subtracting the first time we saw. For node $i$, our delay measurement is $t_{ki} = tr_{ki} - tr_{k0}$. So if we watch $n$ transactions, at a node $i$ we have the $m$ measurements $t_{i1},\dots,t_{im}$. We will treat these measurements as independent outcomes of a random variable $X_i$ that models these delays. This is a nonparametric method so we don’t put any assumptions on $X_i$.

Now all we have to do is measure the correlation of all these random variables. We can measure the empirical correlation $r_{ij}$ of $X_i$ and $X_j$ using our delay measurements for $i$ and $j$. This will give us the delay correlation of those two nodes, essentially a numerical value between -1 and 1 for the link between them. From there, we can classify those links as either existing or not existing. We then use accuracy metrics – specifically recall (true positives/actual positives), precision (true positives/predicted positives) and F1 score (harmonic mean of recall and precision) – to evaluate our methods.

We’ll fire up the Bitcoin simulator we developed to generate those delay measurements, then see if we can reconstruct the original network using some algorithms. Let’s consider 500 nodes and 10,000 transactions that we’re going to measure. Given about 3.5 new transactions per second on the real Bitcoin network, this roughly corresponds to 45 minutes of measurement.

## Edge recovery algorithms

Here’s the three algorithms we’re going to develop and try using. The code can be found here.

### MaxCorr

So this one is pretty simple. Let’s consider some node $i$. If we have $n$ nodes, then we’ll have $n-1$ possible edges with node $i$. Let’s take the highest value empirical correlation, say $r_{ij}$, and infer that $i$ and $j$ are peers. Repeat this for each node. This will produce at most $n$ edge predictions.

### ThreshCorr

If we have $n$ nodes then we have ${n\choose 2}$ possible edges to choose from. What we’re going to do here is use some learning. We set up 5 testing nodes and 5 validation nodes. We use F1 score as our metric that we want to optimise. What we want to find is the threshold for the minimum correlation that we will accept if an edge is going to be included in our model. We adjust this threshold to optimise for F1 score on the testing set, then we can evaluate this model as the F1 score on the validation set.

### EdgeCorr

Similarly to before, we are looking for an optimal threshold value on the training set. However this time we assign an individual threshold to each node for an edge to be included, then optimise all $n$ of those values. This will give an enormous search space as there are $n^n$ possible combination of values to consider. Optimisation heuristics will need to be used here; I went with simulated annealing.

## Results

In one trial, MaxCorr got recall of 11.9% with a precision of 100%. It doesn’t capture many edges but for the ones that it does predict, it gets them right.

ThreshCorr got recall of 23.8% with precision of 76.9%, corresponding to an F1 score of 0.36.

EdgeCorr had a recall of 54.8% with a precision of 85.7%, making an F1 score of 0.6, better than some of the results in the literature.

I am fairly happy with this first rudimentary attempt. Further work can be done on improving these algorithms and exploring metrics other than just correlation.