Crawltest Industries

The Kitchen Sink

If a content extractor handles this page, it handles most of the web. Inline bits: strong, bold, emphasis, italic, underline, strikethrough, highlighted, small print, subscript, superscript, inline code, Ctrl+C, sample output, x, an HTML abbreviation, a real external link, an internal link, a mailto link, a tel link, and a same-page anchor. There is also a line break here →
← and a short inline quotation.

Lists

Unordered, nested

Ordered, with start offset

  1. Third
  2. Fourth
  3. Fifth

Definition list

Crawler
A program that follows links and records pages.
Fixture
A controlled input used for testing.

Table

Quarterly nonsense figures
QuarterWidgetsGizmosTotal
Q112080200
Q215095245
Total270175445

Quotes & code

The web is a worse-is-better system. — Somebody, probably

function crawl(url) {
  const seen = new Set();
  const queue = [url];
  while (queue.length) {
    const next = queue.shift();
    if (seen.has(next)) continue;
    seen.add(next);
    // ... fetch, parse, enqueue links
  }
  return seen;
}

Media & figures

A coloured tile with the number one
Figure 1 — an inline SVG served as an <img> (has alt text).
Lazily loaded tile threeIntentionally broken image

A <picture> element with multiple <source> candidates:

Responsive tile
An <iframe> embedding /about — crawlers should not treat the framed URL as new.

Disclosure widgets

Click to expand a <details> block

Hidden-by-default content. The text is still in the HTML, so extractors should see it.

A small form

Progress & meter

70%

Disk usage: 60%


— end of the kitchen sink.