Crawltest Industries

A synthetic website that exists only to be crawled.

Every page here is generated from a static dataset, so the crawl is finite and deterministic. The site deliberately mixes route shapes (static, dynamic, catch-all, paginated), content types (prose, tables, lists, images, forms, structured data), and crawler edge cases (redirects, canonicals, robots rules, slow responses, client-only content).

Start here

About — long-form prose, headings h1–h6, JSON-LD
Blog — listing, pagination, /blog/[slug], tags, RSS
Products — /products/[id], schema.org Product, categories
Docs — a real catch-all (/docs/[...path]) with nesting
Kitchen sink — one page exercising every HTML content element

Crawler edge cases

/old-page → 301 to /about (de-dup on canonical)
/redirect-me → 307 temporary redirect to /about
/blog/legacy/hello-crawler → 308 to the real post
/private/secret — renders fine but disallowed inrobots.txt and sends X-Robots-Tag: noindex
/slow — delays ~3.5s before responding (tests timeouts /waitFor)
/js-rendered — real content is injected client-side only
/maze/start — links cycle back on themselves (tests visited-set + depth limits)
/this-page-does-not-exist — 404

Machine-readable

/sitemap.xml
/robots.txt
/llms.txt
/feed.xml (RSS 2.0)
/api/products (JSON)
/files/datasheet.pdf (non-HTML resource)

Full page index (61 crawlable HTML pages, excluding the robots-disallowed`/private/*`)

A crawler that follows links from /, obeys robots.txt, and de-duplicates on the canonical URL should converge on this set. (The 307 from /blog to /blog/page/1 and the 301 from /old-page to /about mean a few extra URLs get *visited* before collapsing.)

Crawltest Industries

Start here

Crawler edge cases

Machine-readable

Full page index (61 crawlable HTML pages, excluding the robots-disallowed/private/*)

Full page index (61 crawlable HTML pages, excluding the robots-disallowed`/private/*`)