The @stephenhay is just about perfect despite being the shortest. The subtleties of hyphen placement aren't very important, and this is a dumb place to filter out private IP addresses when a domain could always resolve to one. Checking if an IP is valid should be a later step.
The goal was to come up with a good regular expression to validate URLs as user input, and not to match any URL that browsers can handle (as per the URL Standard).
a.b--c.de is supposed to fail because `--` can only occur in Punycoded domain name labels, and those can only start with `xn--` (not `b--`).
Many of these regexes should not be used on user input, at least not if your regex library backtracks (uses NFAs), because of the risk of ReDoS: http://en.wikipedia.org/wiki/ReDoS
Trying to shoehorn NFAs into parsing stuff that isn't a regular expression is generally a bad idea. (See: Langsec.)
The first one I'm not sure, it looks like a valid FQDN (equivalent to www.foo.bar, where the trailing dot is implicit). The second one I guess it has to do with punycode and internaltional URI encoding?
> The trailing dot makes it an invalid URL, which defines hostname as * [ domainlabel "." ] toplabel.
I disagree. But, this is a tricky one. The relevant specs are:
Spec | | Validity | Definition
URL (RFC1738) | obsolete | invalid | hostname = *[ domainlabel "." ] toplabel
HTTP/1.0 (RFC1945) | current | invalid | host = <A legal Internet host domain name
| | | or IP address (in dotted-decimal form),
| | | as defined by Section 2.1 of RFC 1123>
HTTP/1.1 (RFC2068) | obsolete | invalid | ; same as RFC1945
HTTP/1.1 (RFC2616) | obsolete | valid | hostname = *( domainlabel "." ) toplabel [ "." ]
URI (RFC3986) | current | valid | host = IP-literal / IPv4address / reg-name
| | | reg-name = *( unreserved / pct-encoded / sub-delims )
HTTP/1.1 (RFC7230) | current | valid | uri-host = <host, see [RFC3986], Section 3.2.2>
The only way that URL is invalid is if we are in a strict HTTP/1.0 context.
As a note about RFC1738 being obsolete: these days a URL is just a URI (1) whose scheme specifies it as a URL scheme, and (2) is valid according to the scheme specification.
As the given URL is a valid URI, and is valid according to the current http URL scheme specification (RFC7230), that URL is valid.
The trailing dot makes it an invalid URL, which defines hostname as ∗[ domainlabel "." ] toplabel.
RFC2396§3.2.2 defines hostname as:
hostname = *( domainlabel "." ) toplabel [ "." ]
RFC3986 which obsoletes RFC2396, seems to make fewer claims about the authority section, and specifically says in Appendix D.2 that the toplabel rule has been removed.
The @stephenhay is just about perfect despite being the shortest. The subtleties of hyphen placement aren't very important, and this is a dumb place to filter out private IP addresses when a domain could always resolve to one. Checking if an IP is valid should be a later step.