This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs — but really, like the plants it is named after, it'll eat just about anything that finds its way inside.
It works by generating an endless sequence of pages, each of which with dozens of links, that simply go back into the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.
You can take a look at what this looks like on the interactive demo. (Note: simulated in-browser!)
WARNING
THIS IS DELIBERATELY MALICIOUS SOFTWARE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU AREN'T FULLY COMFORTABLE WITH WHAT YOU ARE DOING.
ANOTHER WARNING
LLM scrapers are relentless and brutal. You may be able to keep them at bay with this software; but it works by providing them with a neverending stream of exactly what they are looking for. YOU ARE LIKELY TO EXPERIENCE SIGNIFICANT CONTINUOUS CPU LOAD.
YET ANOTHER WARNING
There is not currently a way to differentiate between web crawlers that are indexing sites for search purposes, vs crawlers that are training AI models. ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.
So why should I run this, then?
So that, as I said to Ars Technica, we can fight back. Make your website indigestible to scrapers and grow some spikes.
Instead of rolling over and letting these assholes do what they want, make them have to work for it instead.
Further questions? I made a FAQ page.
Latest Version
Installation
You can use Docker, or install manually. Python 3.11+ is required.
> Docker (recommended)
docker compose up -d> Manual Installation
pip install -r requirements.txt
python -m nepenthes config.ymlThe tarpit starts on port 8893 by default. Sending SIGTERM or SIGINT will shut the process down.
Webserver Configuration
Expected usage is to hide the tarpit behind nginx or Apache. Directly exposing it to the internet is ill advised. We want it to look as innocent and normal as possible.
location /maze/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}The X-Forwarded-For header is technically optional, but not setting it will make your statistics significantly less useful.
The proxy_buffering directive is important. LLM crawlers typically disconnect if not given a response within a few seconds; Nepenthes counters this by drip-feeding a few bytes at a time. Buffering breaks this workaround.
Nepenthes Configuration
A simple configuration that matches the above nginx block:
---
http_host: '::'
http_port: 8893
templates:
- 'templates'
seed_file: 'seed.txt'
min_wait: 10
max_wait: 65
silos:
- name: default
wordlist: 'corpus/words.txt'
corpus: 'corpus/sample.txt'
prefixes:
- /maze> Environment Variable Overrides
In addition to YAML, these env vars can be used:
NEPENTHES_HOST→http_hostNEPENTHES_PORT→http_portNEPENTHES_MIN_WAIT→min_waitNEPENTHES_MAX_WAIT→max_waitNEPENTHES_LOG_LEVEL→log_levelNEPENTHES_SEED_FILE→seed_file
Markov
Nepenthes-Py keeps the corpus entirely in memory. The Markov chain is trained at startup from the configured corpus file. For reasonable corpus sizes (~60,000 lines), training takes several seconds on modern hardware.
Templates
Template files consist of two parts: a YAML front-matter prefix and a Jinja2 template body.
- markov — Fills a variable with markov babble.
name,min,max - markov_array — Creates random paragraphs of babble.
- link — Creates a single named link.
- link_array — Creates a variable sized array of links.
- booleans — Creates probabilistic flags (0-100).
Silos
Silos work similarly to virtual hosts on a web server. Each silo can have its own configuration: Markov corpus, wordlist, delay times, statistics, templates, etc.
silos:
- name: fast
corpus: corpus/fast.txt
wordlist: corpus/words.txt
default: true
min_wait: 5
max_wait: 15
prefixes:
- /maze
- name: slow
corpus: corpus/slow.txt
wordlist: corpus/words.txt
min_wait: 60
max_wait: 300
prefixes:
- /deepStatistics
curl http://localhost:8893/stats | jq
{
"hits": 10015,
"addresses": 1850,
"agents": 145,
"bytes_sent": 14733541,
"delay": 56020.624,
"active": 25,
"bogons": 4,
"redirects": 57
}Additional endpoints:
/stats/agents— User-agent strings/stats/addresses— Client IPs/stats/buffer— Raw request buffer/stats/silo/{name}— Per-silo stats
Configuration Reference
> Global
http_host— Listen host (default:localhost)http_port— Listen port (default:8893)unix_socket— Unix socket pathtemplates— Template directoriesseed_file— Persistent instance IDmin_wait/max_wait— Default delay rangelog_level— Logging levelstats_remember_time— Rolling window (default:3600s)
> Per-Silo
name— Silo identifier (required)corpus— Markov corpus file (required)wordlist— Dictionary file (required)template— Template namezero_delay— Disable all delaysredirect_rate— % of 302 redirects (0-100)bogon_filter— Validate URLs (default:true)
License
Nepenthes-Py is distributed under the terms of the MIT License.