External & non-HTML links

A crawler scoped to http://localhost:3005 should not follow any of the links below into another origin (it may record them as outbound references). The non-HTTP schemes should be skipped entirely.

example.com (plain) — https://example.com/
example.org with path + query — https://example.org/some/path?x=1
example.net over plain HTTP — http://example.net/
a different subdomain of example.com — https://sub.example.com/
iana.org reserved domains — https://www.iana.org/domains/reserved
a mailto: link — mailto:nobody@example.com
a tel: link — tel:+15550100
an ftp: link (non-HTTP scheme) — ftp://example.com/file.txt
a javascript: pseudo-link (should be ignored) — javascript:void(0)

For contrast: internal links it should follow

A protocol-relative link

//example.com/protocol-relative — resolves to the current scheme, still a different origin.

A link with a fragment to another page

/kitchen-sink#anchor — same page as /kitchen-sink, just a different fragment; should not be a separate URL.

External & non-HTML links

For contrast: internal links it *should* follow

A protocol-relative link

A link with a fragment to another page

For contrast: internal links it should follow