External & non-HTML links
A crawler scoped to http://localhost:3005 should not follow any of the links below into another origin (it may record them as outbound references). The non-HTTP schemes should be skipped entirely.
- example.com (plain) —
https://example.com/ - example.org with path + query —
https://example.org/some/path?x=1 - example.net over plain HTTP —
http://example.net/ - a different subdomain of example.com —
https://sub.example.com/ - iana.org reserved domains —
https://www.iana.org/domains/reserved - a mailto: link —
mailto:nobody@example.com - a tel: link —
tel:+15550100 - an ftp: link (non-HTTP scheme) —
ftp://example.com/file.txt - a javascript: pseudo-link (should be ignored) —
javascript:void(0)
For contrast: internal links it *should* follow
A protocol-relative link
//example.com/protocol-relative — resolves to the current scheme, still a different origin.
A link with a fragment to another page
/kitchen-sink#anchor — same page as /kitchen-sink, just a different fragment; should not be a separate URL.