CONNECT://
Step-by-step guide to deploy Nepenthes-Py and connect it to your website. Follow each step in order. Your tarpit will be live in under 5 minutes.
01 // Prerequisites
> Option A: Docker (recommended)
All you need is Docker and Docker Compose installed:
# Check Docker is installed
docker --version
docker compose versionIf Docker is installed, skip directly to .
> Option B: Manual
You will need:
- Python 3.11+ — check with
python3 --version - pip — Python package manager
- A web server — nginx, Apache, Caddy, or similar
- A corpus file — any text file for Markov training (books, articles, etc.)
- A wordlist —
/usr/share/dict/wordsworks on most Linux systems
IMPORTANT
Do NOT run Nepenthes as root. Create a dedicated user:
useradd -m nepenthes
su - nepenthes02 // Install
> Clone the repository
cd /home/nepenthes
git clone https://github.com/YOUR_USERNAME/nepenthes-py.git
cd nepenthes-py> Install dependencies
pip install -r requirements.txtDependencies installed:
aiohttp— async HTTP serverpyyaml— YAML config parserjinja2— template engine
> Prepare your corpus
The corpus is the text that Nepenthes uses to generate Markov babble. The better the corpus, the more convincing the garbage output. Good sources:
- Project Gutenberg books (public domain)
- Wikipedia article dumps
- Any large collection of plain text
# Example: download a public domain book
wget https://www.gutenberg.org/files/1342/1342-0.txt -O corpus/my_corpus.txt> Verify installation
python -m nepenthes --help
# or just test with sample corpus:
python -m nepenthes config.yml03 // Configure
Edit config.yml to match your setup:
---
# Network
http_host: '::' # Listen on all interfaces
http_port: 8893 # Internal port (not exposed directly)
# Templates & seed
templates:
- 'templates'
seed_file: 'seed.txt' # Persistent instance ID
# Default delay range (seconds)
min_wait: 10
max_wait: 65
# Logging
log_level: info
# Statistics rolling window
stats_remember_time: 3600 # 1 hour
# Silos
silos:
- name: default
default: true
corpus: 'corpus/my_corpus.txt'
wordlist: '/usr/share/dict/words'
prefixes:
- /maze
redirect_rate: 5 # 5% of requests get 302'd
bogon_filter: true # Reject impossible URLs> Key configuration decisions
PREFIX CHOICE
The prefixes setting determines what URLs the tarpit responds to. Choose something that looks natural on your site: /blog, /archive, /docs, /pages. Avoid obviously fake paths.
DELAY TUNING
min_wait: 10 and max_wait: 65 are good defaults. Lower values serve content faster (more CPU, less crawler time wasted). Higher values waste more of their time but serve fewer pages per hour. The sweet spot depends on your server capacity.
REDIRECT RATE
Set redirect_rate: 100 for maximum trolling — creates an infinite 302 redirect chain. Set to 0 to always serve pages. 5-15 is a reasonable middle ground.
> Environment variable overrides
For containerized deployments, override config via env vars:
export NEPENTHES_HOST="::"
export NEPENTHES_PORT=8893
export NEPENTHES_MIN_WAIT=15
export NEPENTHES_MAX_WAIT=45
export NEPENTHES_LOG_LEVEL=info04 // Web Server Setup
Nepenthes must sit behind a reverse proxy. Never expose it directly to the internet. The proxy makes the tarpit look like a normal part of your website.
> nginx
# Add this inside your server {} block
# The tarpit
location /maze/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off; # CRITICAL: enables drip-feed
}
# Optional: link to tarpit from your sitemap/pages
# to lure crawlers in
location /sitemap-extra.xml {
proxy_pass http://localhost:8893;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}CRITICAL: proxy_buffering off
Without proxy_buffering off, nginx will buffer the entire response before sending it to the client. This defeats the drip-feed mechanism entirely. Crawlers will just get a normal fast page load. This is the #1 misconfiguration mistake.
> Apache
# Enable required modules
a2enmod proxy proxy_http
# Add to your VirtualHost
ProxyPass /maze/ http://localhost:8893/maze/
ProxyPassReverse /maze/ http://localhost:8893/maze/
ProxyPreserveHost On
# Pass real client IP
RequestHeader set X-Forwarded-For "%{REMOTE_ADDR}s"
# Disable buffering
SetEnv proxy-sendchunked 1> Caddy
example.com {
handle /maze/* {
reverse_proxy localhost:8893 {
flush_interval -1
header_up X-Forwarded-For {remote_host}
}
}
# Your normal site
handle {
file_server
}
}> Multiple silos
If using multiple silos, set the X-Silo header per location:
location /maze/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Silo 'fast';
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}
location /deep/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Silo 'slow';
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}Then reload your web server:
# nginx
sudo nginx -t && sudo systemctl reload nginx
# Apache
sudo apachectl configtest && sudo systemctl reload apache2
# Caddy
sudo systemctl reload caddy05 // Deploy
> Docker (fastest)
cd nepenthes-py
# Edit config.yml first, then:
docker compose up -d
# Check logs
docker compose logs -f> Manual
# Start in foreground (for testing)
python -m nepenthes config.yml
# Start as background service
nohup python -m nepenthes config.yml > /var/log/nepenthes.log 2>&1 &> systemd service (production)
Create /etc/systemd/system/nepenthes.service:
[Unit]
Description=Nepenthes Anti-AI Tarpit
After=network.target
[Service]
Type=simple
User=nepenthes
Group=nepenthes
WorkingDirectory=/home/nepenthes/nepenthes-py
ExecStart=/usr/bin/python3 -m nepenthes config.yml
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable nepenthes
sudo systemctl start nepenthes
# Check status
sudo systemctl status nepenthes> Lure the crawlers
Crawlers need to find your tarpit. Add hidden links somewhere on your real pages:
<!-- Hidden link that only crawlers follow -->
<a href="/maze/welcome" style="position:absolute;left:-9999px">
Archive
</a>
<!-- Or in your sitemap.xml -->
<url>
<loc>https://example.com/maze/index</loc>
<priority>0.8</priority>
</url>06 // Verify
> Test the tarpit
# Direct test (should be slow, ~10-65 seconds)
curl -v http://localhost:8893/maze/test/page
# Through nginx (same, but via your domain)
curl -v https://example.com/maze/test/page
# Test that non-tarpit URLs still work
curl -v https://example.com/ # Should be your normal site> Check statistics
# Overview
curl http://localhost:8893/stats | python3 -m json.tool
# Watch in real-time (refresh every 5s)
watch -n 5 'curl -s http://localhost:8893/stats | python3 -m json.tool'> What to look for
| METRIC | HEALTHY | PROBLEM |
|---|---|---|
| cpu_percent | < 10% | > 50% = too many crawlers or zero_delay |
| unsent_bytes_percent | < 5% | > 30% = crawlers disconnecting early |
| active | 10-100 | > 500 = may need rate limiting |
| bogons | low | high = someone probing the tarpit |
> Monitor agents
curl http://localhost:8893/stats/agents | python3 -m json.toolYou should start seeing crawler user-agents within hours. Common ones: GPTBot, ClaudeBot, CCBot, Bingbot, SemrushBot, Barkrowler.
07 // Advanced
> Multiple silos
Run different tarpits for different crawler profiles:
silos:
- name: fast
corpus: corpus/tech_articles.txt
wordlist: /usr/share/dict/words
default: true
min_wait: 5
max_wait: 20
redirect_rate: 10
prefixes:
- /blog
- name: deep
corpus: corpus/literature.txt
wordlist: /usr/share/dict/words
min_wait: 60
max_wait: 300
redirect_rate: 0
prefixes:
- /archive> Custom templates
Create your own Jinja2 templates to make the tarpit look like your site:
---
markov:
- name: title
min: 3
max: 8
- name: body
min: 80
max: 250
link_array:
- name: nav_links
min_count: 15
max_count: 40
depth_min: 2
depth_max: 4
booleans:
- name: show_comments
probability: 40
---
<!DOCTYPE html>
<html>
<head><title>{{ title }}</title></head>
<body>
<nav>
{% for link in nav_links[:6] %}
<a href="{{ link }}">{{ link.split('/')[-1] }}</a>
{% endfor %}
</nav>
<article>
<h1>{{ title }}</h1>
<p>{{ body }}</p>
</article>
{% if show_comments %}
<div class="comments">
<!-- fake comments section -->
</div>
{% endif %}
<footer>
{% for link in nav_links[6:] %}
<a href="{{ link }}">{{ link }}</a>
{% endfor %}
</footer>
</body>
</html>> Export stats to external DB
# Poll the buffer endpoint and insert into PostgreSQL
while true; do
curl -s "http://localhost:8893/stats/buffer/from/$LAST_ID" | \
python3 import_to_db.py
sleep 60
done> Rate limiting with iptables
If crawlers are too aggressive, add rate limiting at the firewall level:
# Limit connections per IP to port 8893
iptables -A INPUT -p tcp --dport 8893 \
-m connlimit --connlimit-above 20 -j REJECT> Monitoring with Prometheus
Poll /stats periodically and export as Prometheus metrics. Track cpu_percent, hits, active, and unsent_bytes_percent over time for alerting.
DEPLOYMENT CHECKLIST
- Nepenthes running as non-root user
- systemd service enabled for auto-restart
- nginx/Apache configured with
proxy_buffering off - Hidden links placed on real pages to lure crawlers
- Stats endpoint accessible for monitoring
- Corpus file is large enough (>10,000 lines ideal)
- Seed file configured for deterministic pages across restarts
- Log rotation configured for access logs