In order to blend in with legitimate traffic, threat actors try to make the bots they deploy “unpredictable.” That is to say, they configure the bots to behave as much like humans as they can, rather than simply repeating the exact same sequence of actions over and over.
For example, these bots will still go to login pages, but they will attempt to make each request appear different and unique. They will cycle among different IP addresses, change their user-agents, and change the properties and headers of their requests. This is called “rotation” of identifiers. But rotation isn’t in and of itself unpredictable.
Humans are unpredictable, but there’s a paradox in that they too follow specific patterns, just patterns they don’t know they’re following, and that exist only at a high, meta level. We can, as statistical nerds, predict those patterns given enough data.
Consider the bell curve. We can’t with any real accuracy predict the outcome of any individual event, but over time and with a large enough sample, we can reliably predict the range and general frequency of outcomes of random events:
From an attacker’s perspective, this means that rotating identifiers is very hard to pull off believably. Even if they attempt to use each value in the same proportions as real humans do, they may end up with some very unlikely values appearing too often in the distribution… exposing them as threat actors!
Below, we can see how humans have a tendency to use certain web browsers over others. If a threat actor were to randomly rotate among the same number of browsers, they would overshoot on the rare browsers, making their behavior really obvious when viewed like this:
Threat actors get caught if they make their distribution too normal or if they make their distribution too abnormal. That’s the paradox.
The Human Defense Platform observes more than 20 trillion interactions a week. With that information, researchers create complex statistical models that attempt to predict which values are expected behavior and flag anything that is suspicious. We call this mechanism by the very original name - “expected values.”
This method is upside-down from how “traditional” bot mitigation works: rather than attempting exclusively to detect bot behavior and block it, expected-values attempts to detect human behavior and block the rest.
Let’s look at a slightly more technical example: request headers.
When sending any request online, your browser also sends a list of all the languages it supports. The website uses this information to give you a response in the language that you prefer. This is called the header_accept_langauge, which we’ll call HAL going forward.
A browser supporting English isn’t unexpected. But the same browser also supporting French, Spanish, Persian, Japanese and Samoan… that feels statistically far less likely. Unexpected, if you will.
As researchers, we could go and map each value manually, create a list of what we think is expected, and block the rest. Or we could create a statistical model to automate the predictions for us.
Statistics nerds, head this way please!
The first thing we do is define two very important properties of each value:
Let's start with probability. We just need to count how many times each value appeared and divide that by the total number of instances. Easy, right?
Now all we need to do is define a certain probability threshold (which we’ll call TH going forward) of expected behavior. Everything above TH is considered expected and everything below it is unexpected. But how do you set what the value for TH is? Let’s examine:
In this method, we will add up the probabilities of each of the common values, and stop after we reach 99%. Every value that didn’t make it into the list (those that are particularly uncommon and cumulatively represent less than 1% of the total population), get the “unexpected” categorization.
Now, how do we know that 1% is the right threshold? Maybe the outlying 5% of values are bad, or maybe only 0.00001% are. Maybe nothing is unexpected in this instance.
We need to find a measurement that represents how predictable our set of values is. If the data set is very predictable (for example, if it has a few overwhelmingly frequent values), we can set a lower TH, and if it's not - we’ll set a higher TH.
Luckily, math presents us with a solution: entropy!
Entropy is a scary word. If you have never heard it in your life, I envy you. Think of entropy like a function that takes a set of probabilities as an input, and spits out a measurement between 0 and 1 reflecting how predictable the set is. 1 is very unpredictable, meaning that the probabilities of each entry are similar to one another, so it is harder to predict what the next value in the set is likely to be. 0 is the opposite, it means that we can reliably predict the next value.
Now we only need to integrate the entropy of the set into the definition of the TH, and make it so that the more predictable our set is, the lower the TH is.
Here, we will simply choose a certain probability above which all entries are defined as expected. For example, a value with a probability of more than 1% is considered expected.
Question: What is the problem with both of these methods?
Answer: If a value is rare, it’s unexpected, by definition. But that punishes users that consistently use niche products, and it repeatedly punishes the same population. That's a big issue, even if it’s a small percentage of the users.
This is why we also need to look at consistency.
If my browser is set to work with Kinyarwanda (a language spoken in Rwanda), I am probably going to represent a very small probability when compared against all traffic. But if I’m regularly using my browser with Kinyarwanda, even my very small probability is predictable and expected. It's like that one regular customer who always orders the McFish at McDonald’s.
There are a couple ways to define whether a value is consistent, but let’s start with a simple count:
HAL | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
---|---|---|---|---|---|---|---|
English | 1,423,523 | 1,532,552 | 1,786,340 | 1,800,328 | 1,634,589 | 1,522,679 | 1,529,820 |
Kinyarwanda | 123 | 100 | 133 | 150 | 98 | 201 | 127 |
Maltese | 1201 | 1679 | 2309 | 2397 | 1877 | 1672 | 1100 |
Pig Latin | 0 | 0 | 0 | 12,423 | 34,303 | 0 | 8 |
Spanish, Croatian, Hebrew, Arabic and Korean | 2 | 0 | 3 | 12 | 0 | 0 | 1 |
What can we infer from this data?
And how can we statistically get to the same conclusions?
Option number one - relative standard deviation. A term that keeps statistics students up at night. The relative standard deviation is a measurement of how close the mean of the set is to the standard deviation. The closer the values of the set are to one another, the closer this number is to 0.
Option number two - counting like a kindergartener. Yes. For example, we can count the number of days with more than 0 appearances; it works as a great measurement of consistency. Similarly, summing up the total traffic throughout the week is also a great and simple measurement.
After implementing these methods, we get these results:
HAL |
stddev |
mean |
rel_stddev |
sum |
empty days |
English |
142898.7 |
1,604,261.6 |
0.09 |
11,229,831 |
0 |
Kinyarwanda |
35.0 |
133.1 |
0.26 |
932 |
0 |
Maltese |
497.0 |
1,747.9 |
0.28 |
12235 |
0 |
Pig Latin |
13032.1 |
6,676.3 |
1.95 |
46734 |
4 |
Spanish, Croatian, Hebrew, Arabic and Korean |
4.3 |
2.6 |
1.68 |
18 |
3 |
Notice how the standard deviation of “Spanish, Croatian, Hebrew, Arabic and Korean” is very small (4.3, the smallest value in the set), yet the relative standard deviation is high (1.68)! This is the strength of this particular measurement.
We can aggregate all of the information here into a single super-secret formula, to get a consistency score for each value.
Together with the probability THs, we can create quite a complex machine to predict whether a value is expected or not!
You may be thinking - “but my customers are multilingual! They could be falsely labeled by this model.”
It’s true, this mechanism is prone to false positives by design. Humans are expectedly unexpected, and it is functionally impossible to reach an accuracy of 100% with any expected values model, especially when the decision is generated by a machine using nothing but statistics without common sense or context.
That’s why at HUMAN, we have multiple methods to alert us if a model runs the risk of flagging humans by mistake. We can take the results of this model, input them into a different statistical model that assesses expected false positives, and see what the results are. If the number of false positives is too high, we can tweak all of the different thresholds identified above, and run it again until we reach low levels of false positives. These thresholds are updated constantly even as the expected-values detections are running in production, and fit to the unique traffic of each customer.
HUMAN’s standard for detection is an accuracy of 99.99% in other words, as long as no more than 1 in 10,000 decisions are wrong, a model is considered to be effective.
I like this mechanism not only because I worked on implementing it, but for three main reasons: