HUMAN BLOG

Controlling AI-driven content scraping with HUMAN

Read time: 6 minutes

Alexa Levine

November 22, 2024

AI, Automated Threats, Bot Mitigation, content scraping

Controlling AI-driven content scraping with HUMAN

In the process of creating and summarizing content, large language models (LLMs) and generative AI platforms rely on extensive automated scraping from entities across the web. For publishers and other content-driven digital platforms, scraping by AI agents is often unwanted at best, and copyright infringement at worst.

Understanding AI-driven scraping bots

Content-driven platforms face several types of unauthorized scraping by AI agents:

AI agents scraping content to summarize a response.

AI agents use bots to scrape web content to return a response to user queries. For example, you could ask the LLM ChatGPT “What is the latest news in my area?” and it will perform a search, scrape the results, and return a summary—all without the user ever having to visit the source application. In addition to potentially taking that information without permission, sometimes the provided summary is taken out of context or is factually incorrect, which compounds the problems associated with content scraping, as now misinformation is added to the mix.

Scraping of content to train LLMs.

LLMs are essentially a giant knowledge base, and they can grow only by being fed more information. AI developers sometimes use scraping bots to capture information from websites to feed the beast. Responsible LLMs will identify themselves, allowing organizations to decide whether to allow or block them. The challenge arises with AI agents that may not declare themselves or actively try to obfuscate their intention to scrape information. Recent articles have shown that some new AI players are scraping even more aggressively while ignoring robots.txt (a file that contains instructions for bots on which pages they can and cannot access).

Performance impact.

Large-scale scraping can also negatively impact web application and site performance. This can result in a poor experience for users, loss of revenue or, in extreme cases, denial of service. Scraping bots can also boost invalid traffic (IVT) rates by viewing ads as they crawl an application, devaluing your ad space.

The challenges of using robots.txt to manage AI-driven scrapers

The robots.txt file has long been a standard for managing web crawlers, but its limitations are increasingly evident with AI-driven scraping bots. As mentioned above, many LLMs and their scrapers do not identify themselves, making it hard for website operators to block them effectively. Others may change their user agents frequently as they update scrapers, which can inadvertently cause headaches in blocking their access. 

The robots.txt file’s 500 KB size limit also poses challenges for organizations managing complex content ecosystems, especially as the number of AI scrapers grows. Manual updates to accommodate every bot quickly become untenable. 

For large crawlers like Googlebot and Bingbot, there is currently no way to differentiate between data used for traditional search engine indexing—where publishers and search engines have an implicit “agreement” based on citations to the original source—and data used to train or power generative AI products. This lack of granularity forces publishers into a difficult position, where blocking Googlebot or Bingbot entirely to prevent data usage for generative AI also hurts search engine results. 

How HUMAN helps

HUMAN makes it easy to identify and manage AI-related traffic. Customers can choose from three primary response options:

Block all LLM bots by default

HUMAN blocks LLMs by default to ensure that Publishers are protected from unwanted scraping and theft of their proprietary content. HUMAN’s industry-leading decision engine uses advanced machine learning, behavioral analysis, and intelligent fingerprinting to block scraping bot traffic at the edge, often before they can access a single page. HUMAN will allow bots to access your application only if you have specifically allowed them to do so. 

Allow known AI bots and crawlers

Customers can choose to allow trusted bots to access their content unimpeded. With an easy Off/On toggle, users can easily make quick decisions on allowing or blocking AI traffic. This adds a layer of enforcement to your robots.txt file, which is sometimes ignored by LLM scraping agents. If you choose to allow the scrapers, you can also set up custom policies to suppress ads or show alternative content.

HUMAN allows custom policies and rule responses for individual bot and LLM scrapers. Here we see a list of LLM bots, along side custom rules for how the HUMAN CyberFraud Defense platform will respond to them.
HUMAN allows custom policies and rule responses for individual bot and LLM scrapers.

Monetize LLMs on a per-use basis 

If an AI bot is detected that is not on the customer’s allow list, HUMAN allows the option to send that bot to TollBit’s scraping paywall. There, you can set and enforce payment policies that require bots to pay per scrape. This enables publishers to prevent unauthorized AI agents from scraping their proprietary content unless they provide fair compensation. 

An example code snippet for the set up of custom block messages with TollBit. The snipped reads: You are not authorized to access this content without a valid TollBit Token. Please follow this URL to find out more: https://tollbit.com
An example snippet for the setup of custom block messages with TollBit.

HUMAN provides visibility into AI bots and agents

Traffic from known bots, crawlers, and AI agents is automatically highlighted in HUMAN activity dashboards, and the system notifies you if new bots are present on your applications. This makes it easy for organizations to understand the volume of traffic hitting their applications and websites, as well as a granular breakdown of which bots are contributing to the traffic, and to what degree.

A HUMAN activity dashboard tracking overall bot traffic, overall legitimate requests, the percentages of bots blocked and allowed, CAPTCHAs served and solved, and traffic sources.
HUMAN’s activity dashboards track overall bot traffic, overall legitimate requests, the percentages of bots blocked and allowed, CAPTCHAs served and solved, and traffic sources.

HUMAN also surfaces information about which bots and AI agents are accessing your content and what paths they are targeting. This allows publishers to monitor impacts and make informed decisions to protect your assets.

A dashboard displaying the daily requests from known legitimate bots, as well as the request rate as a percentage of all traffic.
A dashboard displaying the daily requests from known legitimate bots, as well as the request rate as a percentage of all traffic.

Effective strategies to combat AI-powered scraping

Keeping an up-to-date robots.txt file is a strong first step to manage legitimate bots and crawlers—but robots.txt alone is no longer enough. Organizations need granular visibility into AI scraping activity and complete control to respond in a way that works for their unique business.

With HUMAN Cyberfraud Defense, you can see exactly how much AI scraping is hitting your site.

 Request a free demo and see:

  • How to track the volume and paths of LLM bots on your domain
  • How to block, allow, or monetize them in a single toggle
  • Real-time dashboards built for publishers.

Book your demo now and start protecting your content today.

Spread the Word