Building Resilience to Reverse Engineering: A Study in CAPTCHA
Cybercriminals have found several different workarounds for CAPTCHA, and while recent updates promise to revitalize it, the technology’s focus on “bot-like” behaviors leave it unprepared for modern and sophisticated threats.
If you’ve ever signed up for a service or account on the internet, you’ve probably experienced a CAPTCHA. The bot detection system (which stands for the mouthful “Completely Automated Public Turing test to tell Computers and Humans Apart”) prompts users to enter codes, choose the right images in a sequence, or solve a math problem in order to prove that they’re human.
CAPTCHA was an important initial step in distinguishing human and bot activity on the web — but for several reasons, it’s had trouble keeping up with determined attackers. Though companies are revamping the technology for a new era, CAPTCHA’s 17-year history is a good case study in how the incentive of profit and observable behaviors will allow cybercriminals to overcome most technological barriers.
In this blog post, we explain why CAPTCHA can be reverse engineered, as well as alternative methodologies – like the ones we use at White Ops – that keep cybercriminals on their toes.
A Brief History of CAPTCHA Reverse Engineering
The first incarnation of CAPTCHA was born in 2000, when engineers at Carnegie Mellon came up with a simple way for internet users to prove they were human: look at an image featuring distorted text a computer couldn’t read, type it into the box, and pass through. Bots at the time were rudimentary and consistently failed these tests. Other than CAPTCHA farms that paid human workers to pass the tests, no workaround for the system existed until over ten years later.
In 2013, researchers at Vicarious, a California-based AI firm with funding from both Amazon and Facebook, proved they could successfully train computers to crack CAPTCHA codes using neural networks. The networks were able to analyze the shapes in these warped images and correctly identify them as letters and numbers, enabling Vicarious’ bots to solve 90% of Google, Yahoo!, PayPal, and Captcha.com codes.
In response to the Vicarious research, many companies phased out these tests in favor of more advanced ones like reCAPTCHA. These new versions relied on deeper passive analysis of a user’s behavior and browser information, monitoring for hallmarks of user activity like solving, navigation, and submission time to create a portrait of that user’s behaviors. Clicks that happened too fast, too slowly, or in a pattern unusual for that user were seen as possible indications of bot activity. Once flagged, the suspicious user was asked to complete an image-based test that would be harder for an AI bot to solve.
But it was only a matter of time before cybercriminals figured out how to reverse engineer this solution, too. Some graduate students at Columbia University broke through reCAPTCHA in 2016 with success rates between 70.8% and 83.5%. These white hat hackers modified user-agent reputation, used reverse image searches to find keywords associated with the photos used in bot tests, then cross-referenced those keywords with the test’s prompt to select the appropriate images. For example, if a bot were asked to choose images with birds in them from the usual set of 9 photos, the bots would reverse-image search them, pick out the ones associated with the word “bird,” and select those images to pass the test.
This ongoing arms race between CAPTCHA and bot developers demonstrates a simple principle of cybersecurity: as long as people have enough time on their hands, there’s no single test that criminals won’t eventually figure out how to pass.
Why CAPTCHA Can Be Reverse Engineered
Two aspects of CAPTCHA make the task of bypassing it pretty easy on hackers. First, there’s the tool’s relatively straightforward set of observable inputs; second, the adversary gets real-time feedback about the success of their efforts as soon as they finish. CAPTCHA presents hackers with a fairly clear objective — enter these characters, display these characteristics, or select these photos — and immediately lets them know whether their attempts to achieve that objective were successful.
These two characteristics allow hackers to enter into a rapid “iteration loop,” continually fine-tuning their inputs until they get the result they want. That rapid iteration loop is especially vulnerable to hackers with access to machine learning technology.
CAPTCHA also suffers from attempts to balance human accessibility with security. If a cybercriminal doesn’t want to task a botnet to solve CAPTCHAs, all he or she has to do is find a third-party accessibility plugin like CAPTCHA Be Gone, which solves CAPTCHA puzzles for blind or visually impaired people. So long as bot detection methods are visible to the user, bots will find a way to get around them.
Building Resilience to Reverse Engineering
While detection mechanisms such as CAPTCHA can be an effective deterrent for less sophisticated bots, they will never be a perfect defense against a motivated, highly advanced adversary.
So how can a bot defense system be designed to avoid such reverse engineering? At White Ops we think of this solution across three dimensions:
As security specialists, we shouldn’t be shocked when, after putting all our time and effort into developing one test, our adversaries pour their time and effort into successfully passing it. We need to expand our approach to cybersecurity and force cybercriminals to work harder, continuously developing and improving our tactics – even when they appear to be working.
CAPTCHA shows us that bots will eventually pass any tests we throw at them. What remains to be seen is just how many tests cybercriminals are willing to pass — and how many times they’re willing to pass it — in order to achieve a single objective.