This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs — but really, like the plants it is named after, it'll eat just about anything that finds its way inside.

It works by generating an endless sequence of pages, each of which with dozens of links, that simply go back into the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

You can take a look at what this looks like on the interactive demo. (Note: simulated in-browser!)

WARNING

THIS IS DELIBERATELY MALICIOUS SOFTWARE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU AREN'T FULLY COMFORTABLE WITH WHAT YOU ARE DOING.

ANOTHER WARNING

LLM scrapers are relentless and brutal. You may be able to keep them at bay with this software; but it works by providing them with a neverending stream of exactly what they are looking for. YOU ARE LIKELY TO EXPERIENCE SIGNIFICANT CONTINUOUS CPU LOAD.

YET ANOTHER WARNING

There is not currently a way to differentiate between web crawlers that are indexing sites for search purposes, vs crawlers that are training AI models. ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.

So why should I run this, then?

So that, as I said to Ars Technica, we can fight back. Make your website indigestible to scrapers and grow some spikes.

Instead of rolling over and letting these assholes do what they want, make them have to work for it instead.

Further questions? I made a FAQ page.

Latest Version

Nepenthes-Py 1.0

Docker Image All downloads


Installation

You can use Docker, or install manually. Python 3.11+ is required.

> Docker (recommended)

docker compose up -d

> Manual Installation

pip install -r requirements.txt
python -m nepenthes config.yml

The tarpit starts on port 8893 by default. Sending SIGTERM or SIGINT will shut the process down.


Webserver Configuration

Expected usage is to hide the tarpit behind nginx or Apache. Directly exposing it to the internet is ill advised. We want it to look as innocent and normal as possible.

location /maze/ {
    proxy_pass http://localhost:8893;
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_buffering off;
}

The X-Forwarded-For header is technically optional, but not setting it will make your statistics significantly less useful.

The proxy_buffering directive is important. LLM crawlers typically disconnect if not given a response within a few seconds; Nepenthes counters this by drip-feeding a few bytes at a time. Buffering breaks this workaround.


Nepenthes Configuration

A simple configuration that matches the above nginx block:

---
http_host: '::'
http_port: 8893
templates:
  - 'templates'
seed_file: 'seed.txt'

min_wait: 10
max_wait: 65

silos:
  - name: default
    wordlist: 'corpus/words.txt'
    corpus: 'corpus/sample.txt'
    prefixes:
      - /maze

> Environment Variable Overrides

In addition to YAML, these env vars can be used:


Markov

Nepenthes-Py keeps the corpus entirely in memory. The Markov chain is trained at startup from the configured corpus file. For reasonable corpus sizes (~60,000 lines), training takes several seconds on modern hardware.


Templates

Template files consist of two parts: a YAML front-matter prefix and a Jinja2 template body.


Silos

Silos work similarly to virtual hosts on a web server. Each silo can have its own configuration: Markov corpus, wordlist, delay times, statistics, templates, etc.

silos:
  - name: fast
    corpus: corpus/fast.txt
    wordlist: corpus/words.txt
    default: true
    min_wait: 5
    max_wait: 15
    prefixes:
      - /maze

  - name: slow
    corpus: corpus/slow.txt
    wordlist: corpus/words.txt
    min_wait: 60
    max_wait: 300
    prefixes:
      - /deep

Statistics

curl http://localhost:8893/stats | jq
{
  "hits": 10015,
  "addresses": 1850,
  "agents": 145,
  "bytes_sent": 14733541,
  "delay": 56020.624,
  "active": 25,
  "bogons": 4,
  "redirects": 57
}

Additional endpoints:


Configuration Reference

> Global

> Per-Silo


License

Nepenthes-Py is distributed under the terms of the MIT License.