NEPENTHES://

This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs — but really, like the plants it is named after, it'll eat just about anything that finds its way inside.

It works by generating an endless sequence of pages, each of which with dozens of links, that simply go back into the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

You can take a look at what this looks like on the interactive demo. (Note: simulated in-browser!)

WARNING

THIS IS DELIBERATELY MALICIOUS SOFTWARE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU AREN'T FULLY COMFORTABLE WITH WHAT YOU ARE DOING.

ANOTHER WARNING

LLM scrapers are relentless and brutal. You may be able to keep them at bay with this software; but it works by providing them with a neverending stream of exactly what they are looking for. YOU ARE LIKELY TO EXPERIENCE SIGNIFICANT CONTINUOUS CPU LOAD.

YET ANOTHER WARNING

There is not currently a way to differentiate between web crawlers that are indexing sites for search purposes, vs crawlers that are training AI models. ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.

So why should I run this, then?

So that, as I said to Ars Technica, we can fight back. Make your website indigestible to scrapers and grow some spikes.

Instead of rolling over and letting these assholes do what they want, make them have to work for it instead.

Further questions? I made a FAQ page.

Latest Version

Nepenthes-Py 1.0

Docker Image • All downloads

Installation

You can use Docker, or install manually. Python 3.11+ is required.

> Docker (recommended)

docker compose up -d

> Manual Installation

pip install -r requirements.txt
python -m nepenthes config.yml

The tarpit starts on port 8893 by default. Sending SIGTERM or SIGINT will shut the process down.

Webserver Configuration

Expected usage is to hide the tarpit behind nginx or Apache. Directly exposing it to the internet is ill advised. We want it to look as innocent and normal as possible.

location /maze/ {
    proxy_pass http://localhost:8893;
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_buffering off;
}

The X-Forwarded-For header is technically optional, but not setting it will make your statistics significantly less useful.

The proxy_buffering directive is important. LLM crawlers typically disconnect if not given a response within a few seconds; Nepenthes counters this by drip-feeding a few bytes at a time. Buffering breaks this workaround.

Nepenthes Configuration

A simple configuration that matches the above nginx block:

---
http_host: '::'
http_port: 8893
templates:
  - 'templates'
seed_file: 'seed.txt'

min_wait: 10
max_wait: 65

silos:
  - name: default
    wordlist: 'corpus/words.txt'
    corpus: 'corpus/sample.txt'
    prefixes:
      - /maze

> Environment Variable Overrides

In addition to YAML, these env vars can be used:

NEPENTHES_HOST → http_host
NEPENTHES_PORT → http_port
NEPENTHES_MIN_WAIT → min_wait
NEPENTHES_MAX_WAIT → max_wait
NEPENTHES_LOG_LEVEL → log_level
NEPENTHES_SEED_FILE → seed_file

Markov

Nepenthes-Py keeps the corpus entirely in memory. The Markov chain is trained at startup from the configured corpus file. For reasonable corpus sizes (~60,000 lines), training takes several seconds on modern hardware.

Templates

Template files consist of two parts: a YAML front-matter prefix and a Jinja2 template body.

markov — Fills a variable with markov babble. name, min, max
markov_array — Creates random paragraphs of babble.
link — Creates a single named link.
link_array — Creates a variable sized array of links.
booleans — Creates probabilistic flags (0-100).

Silos

Silos work similarly to virtual hosts on a web server. Each silo can have its own configuration: Markov corpus, wordlist, delay times, statistics, templates, etc.

silos:
  - name: fast
    corpus: corpus/fast.txt
    wordlist: corpus/words.txt
    default: true
    min_wait: 5
    max_wait: 15
    prefixes:
      - /maze

  - name: slow
    corpus: corpus/slow.txt
    wordlist: corpus/words.txt
    min_wait: 60
    max_wait: 300
    prefixes:
      - /deep

Statistics

curl http://localhost:8893/stats | jq
{
  "hits": 10015,
  "addresses": 1850,
  "agents": 145,
  "bytes_sent": 14733541,
  "delay": 56020.624,
  "active": 25,
  "bogons": 4,
  "redirects": 57
}

Additional endpoints:

/stats/agents — User-agent strings
/stats/addresses — Client IPs
/stats/buffer — Raw request buffer
/stats/silo/{name} — Per-silo stats

Configuration Reference

> Global

http_host — Listen host (default: localhost)
http_port — Listen port (default: 8893)
unix_socket — Unix socket path
templates — Template directories
seed_file — Persistent instance ID
min_wait / max_wait — Default delay range
log_level — Logging level
stats_remember_time — Rolling window (default: 3600s)

> Per-Silo

name — Silo identifier (required)
corpus — Markov corpus file (required)
wordlist — Dictionary file (required)
template — Template name
zero_delay — Disable all delays
redirect_rate — % of 302 redirects (0-100)
bogon_filter — Validate URLs (default: true)

License

Nepenthes-Py is distributed under the terms of the MIT License.