CONNECT://

Step-by-step guide to deploy Nepenthes-Py and connect it to your website. Follow each step in order. Your tarpit will be live in under 5 minutes.

01 // Prerequisites

> Option A: Docker (recommended)

All you need is Docker and Docker Compose installed:

# Check Docker is installed
docker --version
docker compose version

If Docker is installed, skip directly to .

> Option B: Manual

You will need:

Python 3.11+ — check with python3 --version
pip — Python package manager
A web server — nginx, Apache, Caddy, or similar
A corpus file — any text file for Markov training (books, articles, etc.)
A wordlist — /usr/share/dict/words works on most Linux systems

IMPORTANT

Do NOT run Nepenthes as root. Create a dedicated user:

useradd -m nepenthes
su - nepenthes

02 // Install

> Clone the repository

cd /home/nepenthes
git clone https://github.com/YOUR_USERNAME/nepenthes-py.git
cd nepenthes-py

> Install dependencies

pip install -r requirements.txt

Dependencies installed:

aiohttp — async HTTP server
pyyaml — YAML config parser
jinja2 — template engine

> Prepare your corpus

The corpus is the text that Nepenthes uses to generate Markov babble. The better the corpus, the more convincing the garbage output. Good sources:

Project Gutenberg books (public domain)
Wikipedia article dumps
Any large collection of plain text

# Example: download a public domain book
wget https://www.gutenberg.org/files/1342/1342-0.txt -O corpus/my_corpus.txt

> Verify installation

python -m nepenthes --help
# or just test with sample corpus:
python -m nepenthes config.yml

03 // Configure

Edit config.yml to match your setup:

---
# Network
http_host: '::'          # Listen on all interfaces
http_port: 8893          # Internal port (not exposed directly)

# Templates & seed
templates:
  - 'templates'
seed_file: 'seed.txt'    # Persistent instance ID

# Default delay range (seconds)
min_wait: 10
max_wait: 65

# Logging
log_level: info

# Statistics rolling window
stats_remember_time: 3600   # 1 hour

# Silos
silos:
  - name: default
    default: true
    corpus: 'corpus/my_corpus.txt'
    wordlist: '/usr/share/dict/words'
    prefixes:
      - /maze
    redirect_rate: 5        # 5% of requests get 302'd
    bogon_filter: true       # Reject impossible URLs

> Key configuration decisions

PREFIX CHOICE

The prefixes setting determines what URLs the tarpit responds to. Choose something that looks natural on your site: /blog, /archive, /docs, /pages. Avoid obviously fake paths.

DELAY TUNING

min_wait: 10 and max_wait: 65 are good defaults. Lower values serve content faster (more CPU, less crawler time wasted). Higher values waste more of their time but serve fewer pages per hour. The sweet spot depends on your server capacity.

REDIRECT RATE

Set redirect_rate: 100 for maximum trolling — creates an infinite 302 redirect chain. Set to 0 to always serve pages. 5-15 is a reasonable middle ground.

> Environment variable overrides

For containerized deployments, override config via env vars:

export NEPENTHES_HOST="::"
export NEPENTHES_PORT=8893
export NEPENTHES_MIN_WAIT=15
export NEPENTHES_MAX_WAIT=45
export NEPENTHES_LOG_LEVEL=info

04 // Web Server Setup

Nepenthes must sit behind a reverse proxy. Never expose it directly to the internet. The proxy makes the tarpit look like a normal part of your website.

> nginx

# Add this inside your server {} block

# The tarpit
location /maze/ {
    proxy_pass http://localhost:8893;
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_buffering off;    # CRITICAL: enables drip-feed
}

# Optional: link to tarpit from your sitemap/pages
# to lure crawlers in
location /sitemap-extra.xml {
    proxy_pass http://localhost:8893;
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_buffering off;
}

CRITICAL: proxy_buffering off

Without proxy_buffering off, nginx will buffer the entire response before sending it to the client. This defeats the drip-feed mechanism entirely. Crawlers will just get a normal fast page load. This is the #1 misconfiguration mistake.

> Apache

# Enable required modules
a2enmod proxy proxy_http

# Add to your VirtualHost
ProxyPass /maze/ http://localhost:8893/maze/
ProxyPassReverse /maze/ http://localhost:8893/maze/
ProxyPreserveHost On

# Pass real client IP
RequestHeader set X-Forwarded-For "%{REMOTE_ADDR}s"

# Disable buffering
SetEnv proxy-sendchunked 1

> Caddy

example.com {
    handle /maze/* {
        reverse_proxy localhost:8893 {
            flush_interval -1
            header_up X-Forwarded-For {remote_host}
        }
    }

    # Your normal site
    handle {
        file_server
    }
}

> Multiple silos

If using multiple silos, set the X-Silo header per location:

location /maze/ {
    proxy_pass http://localhost:8893;
    proxy_set_header X-Silo 'fast';
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_buffering off;
}

location /deep/ {
    proxy_pass http://localhost:8893;
    proxy_set_header X-Silo 'slow';
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_buffering off;
}

Then reload your web server:

# nginx
sudo nginx -t && sudo systemctl reload nginx

# Apache
sudo apachectl configtest && sudo systemctl reload apache2

# Caddy
sudo systemctl reload caddy

05 // Deploy

> Docker (fastest)

cd nepenthes-py

# Edit config.yml first, then:
docker compose up -d

# Check logs
docker compose logs -f

> Manual

# Start in foreground (for testing)
python -m nepenthes config.yml

# Start as background service
nohup python -m nepenthes config.yml > /var/log/nepenthes.log 2>&1 &

> systemd service (production)

Create /etc/systemd/system/nepenthes.service:

[Unit]
Description=Nepenthes Anti-AI Tarpit
After=network.target

[Service]
Type=simple
User=nepenthes
Group=nepenthes
WorkingDirectory=/home/nepenthes/nepenthes-py
ExecStart=/usr/bin/python3 -m nepenthes config.yml
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable nepenthes
sudo systemctl start nepenthes

# Check status
sudo systemctl status nepenthes

> Lure the crawlers

Crawlers need to find your tarpit. Add hidden links somewhere on your real pages:

<!-- Hidden link that only crawlers follow -->
<a href="/maze/welcome" style="position:absolute;left:-9999px">
  Archive
</a>

<!-- Or in your sitemap.xml -->
<url>
  <loc>https://example.com/maze/index</loc>
  <priority>0.8</priority>
</url>

06 // Verify

> Test the tarpit

# Direct test (should be slow, ~10-65 seconds)
curl -v http://localhost:8893/maze/test/page

# Through nginx (same, but via your domain)
curl -v https://example.com/maze/test/page

# Test that non-tarpit URLs still work
curl -v https://example.com/   # Should be your normal site

> Check statistics

# Overview
curl http://localhost:8893/stats | python3 -m json.tool

# Watch in real-time (refresh every 5s)
watch -n 5 'curl -s http://localhost:8893/stats | python3 -m json.tool'

> What to look for

METRIC	HEALTHY	PROBLEM
cpu_percent	< 10%	> 50% = too many crawlers or zero_delay
unsent_bytes_percent	< 5%	> 30% = crawlers disconnecting early
active	10-100	> 500 = may need rate limiting
bogons	low	high = someone probing the tarpit

> Monitor agents

curl http://localhost:8893/stats/agents | python3 -m json.tool

You should start seeing crawler user-agents within hours. Common ones: GPTBot, ClaudeBot, CCBot, Bingbot, SemrushBot, Barkrowler.

07 // Advanced

> Multiple silos

Run different tarpits for different crawler profiles:

silos:
  - name: fast
    corpus: corpus/tech_articles.txt
    wordlist: /usr/share/dict/words
    default: true
    min_wait: 5
    max_wait: 20
    redirect_rate: 10
    prefixes:
      - /blog

  - name: deep
    corpus: corpus/literature.txt
    wordlist: /usr/share/dict/words
    min_wait: 60
    max_wait: 300
    redirect_rate: 0
    prefixes:
      - /archive

> Custom templates

Create your own Jinja2 templates to make the tarpit look like your site:

---
markov:
  - name: title
    min: 3
    max: 8
  - name: body
    min: 80
    max: 250
link_array:
  - name: nav_links
    min_count: 15
    max_count: 40
    depth_min: 2
    depth_max: 4
booleans:
  - name: show_comments
    probability: 40
---
<!DOCTYPE html>
<html>
<head><title>{{ title }}</title></head>
<body>
  <nav>
    {% for link in nav_links[:6] %}
    <a href="{{ link }}">{{ link.split('/')[-1] }}</a>
    {% endfor %}
  </nav>
  <article>
    <h1>{{ title }}</h1>
    <p>{{ body }}</p>
  </article>
  {% if show_comments %}
  <div class="comments">
    <!-- fake comments section -->
  </div>
  {% endif %}
  <footer>
    {% for link in nav_links[6:] %}
    <a href="{{ link }}">{{ link }}</a>
    {% endfor %}
  </footer>
</body>
</html>

> Export stats to external DB

# Poll the buffer endpoint and insert into PostgreSQL
while true; do
  curl -s "http://localhost:8893/stats/buffer/from/$LAST_ID" | \
    python3 import_to_db.py
  sleep 60
done

> Rate limiting with iptables

If crawlers are too aggressive, add rate limiting at the firewall level:

# Limit connections per IP to port 8893
iptables -A INPUT -p tcp --dport 8893 \
  -m connlimit --connlimit-above 20 -j REJECT

> Monitoring with Prometheus

Poll /stats periodically and export as Prometheus metrics. Track cpu_percent, hits, active, and unsent_bytes_percent over time for alerting.

DEPLOYMENT CHECKLIST

Nepenthes running as non-root user
systemd service enabled for auto-restart
nginx/Apache configured with proxy_buffering off
Hidden links placed on real pages to lure crawlers
Stats endpoint accessible for monitoring
Corpus file is large enough (>10,000 lines ideal)
Seed file configured for deterministic pages across restarts
Log rotation configured for access logs