Implementation docs
Load balancing
Provider rating

Provider rating

The core thing of DRPC’s load balancing algorithm is provider rating. The idea is this rating defines how well given provider matches for given request.

Rating dimension

The first thing we should to talk about are rating dimensions. Each provider can host many different networks and perform differently on different methods, so we actually need a lot of ratings per provider.

So, basically our rating is multidimensional, this simply means that each provider has many different ratings and each request use one of those “rating dimensions” for load balancing.

Rating dimension can be described as a Go structure.

type RatingDimension struct {
	Kind         DimensionKind
	Cluster      WorkloadCluster
	Chain        dshackle.ChainRef
	SourceRegion reducerstates.Region
}

Let’s describe what each field means.

  • First is Cluster. Here we could instead use method name, but there are a lot of methods and that would be unreasonable, as there are also a lot of methods with very similar performance characteristics and behaviour. On the other hand, there are some methods like eth_getLogs which depends heavily on parameters, so we may want to separate those in several clusters. So Cluster helps us to abstract away from specific methods and compare providers’ performance for the approximately same workload
  • Chain is network id. Obviously, we need to have separate ratings for different chains
  • SourceRegion helps us to compare providers’ performance for requests that come from the same region. So, for example providers from Europe almost always will have better rating for requests from Europe than providers from US.
  • Kind is more of a way to filter some providers from our rating table defined by other parameters. We could, of course, filter providers before sampling, but maintaining separate tables for each kind is more efficient.
    1. Public — filters only public providers that are free
    2. Best latency — filters only top performers
    3. All — does not filter out anyone

Rating calculation

DRPC rating generation algorithm

We update rating every 1 second on each dproxy independently, this means that rating could vary a bit, but not significantly.

This picture describes the basics of rating calculation. Rating consists of several components.

  1. Base rating, this component is calculated from data we observed on provider in last minute.
  2. Previous rating, the value of base rating last time it was calculated for this dimension for this provider.
  3. Rating modifiers, some of the provider characteristics we do not want to include in base rating
  4. EMA, this combines previous rating and current base rating in a way that when provider base rating grows it grows slowly, but when it drops it drops momentarily

Resulting final rating is put into rating registry for further use.

Exponential moving average

EMA is a term from signal processing. It can be represented as follows

yt=αxt+(1α)yt1y_t = \alpha x_t + (1 - \alpha) y_{t-1}

EMA smoothes the signal that is very noisy, but it retains overall trend. There are a lot of patterns of bad behaviour by provider: one of them of course failing in 100% cases, however even more often we see that bad behaving provider shows us “fence” pattern, when there are regular bursts of errors. EMA helps to smooth provider rating and keep it low while there are problems with provider.

Common rating patterns before EMA

We actually use slightly modified version of EMA

yt={0.001xt+0.999yt1,if xt>yt1xt,otherwisey_t = \begin{cases} 0.001 x_t + 0.999 y_{t-1}, & \text{if } x_t > y_{t-1} \\ x_t, & \text{otherwise} \end{cases}

So, when provider rating grows it grows very slowly (approx 30 minutes to restore from min to max), because previous rating heavily influences it. When it drops we just use new rating, so it drops instantly.

Base rating

Base rating is calculated based on last 60 seconds performance of provider. Remember that we operate in a certain rating dimension and using data only from this dimension (chain, region, etc).

We won’t list here exact formulas as they can change often, but we will describe key features. Following logic will change often also, so we provider the current state as we’re writing this doc.

The main idea is that there is a maximum rating of 100,000 and we apply penalties using following features.

Average latency

We use average latency of fulfilling a request for each provider. However, as for each cluster it may vary greatly we need to distinct what latency is “normal”. For this purposes we calculate median of all average latencies for current dimension and use it as expected latency.

So, if provider’s average latency is times higher than expected, rating will be reduced greatly. If it lower than expected penalty will be negligible. So, there is actually not a lot of benefits to be significantly faster than expected latency. However, it gives provider a slight edge.

Errors

We have very little tolerance for errors, so if provider returns 10-20 errors in last minute their rating will be zeroed. As we currently observer the system, errors are rare and this is a working strategy.

Served CU

Each provider has their maximum throughput, which is assigned manually by DRPC team. However, we plan on open-sourcing internal tool that will allow providers to measure their throughput.

We currently measure throughput in CU/minute. So, we apply rating penalty as provider closes to their throughput limit. As our providers have different setup and amount of resources we don’t want to overwhelm them with requests. Otherwise we want requests balanced uniformly across all well-behaving providers.

Rating modifiers

There are some provider parameters that we want to include in final rating, but don’t want them to affect base rating and EMA.

  • We exclude public (free) providers from best latency dimension, but still we want to downgrade them in all dimension, so there is a low probability of selecting them in case of fallback
  • If provider is lagging for current chain (their chain head is significantly behind compared to other providers), we will deprioritise their rating, because it is sign of problems. However, this won’t affect EMA, so when provider stops lagging their rating restores immediately
  • We also exclude providers that are not in the same region as request origin from best latency. However, we also deprioritise them when their region is not the same as dimensions SourceRegion.

Best latency filtering

We previously said that we filter our bad behaving data providers from best latency dimension. Let’s talk about how we do that. The goal is to only keep providers that we consider well-behaving.

There are several observation we have make currently, we use them as our assumptions to build filtering algorithm.

  • There are more well-behaving providers that bad-behaving.
  • Sometimes all providers’ rating dip because of some service wide event.
  • When we can distinguish bad-behaving providers, they are outliers.

Thus, we decided to use outlier filtering algorithm called “modified Z-score” (opens in a new tab). So after we recalculated rating for given dimension, we calculate z-score for each provider and filter out providers with z-score < 2.5.

Combined with balancing rounds approach described here, this allows us to stop sending requests to the outliers completely.

One disadvantage of this approach is that it stops working when there is a significant number of bad-behaving providers (close to 50%).