Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why are http://www.foo.bar./ and http://a.b--c.de/ supposed to fail?

The @stephenhay is just about perfect despite being the shortest. The subtleties of hyphen placement aren't very important, and this is a dumb place to filter out private IP addresses when a domain could always resolve to one. Checking if an IP is valid should be a later step.



As for why the trailing dot is disallowed, see <http://saynt2day.blogspot.com/2013/03/danger-of-trailing-dot....

The goal was to come up with a good regular expression to validate URLs as user input, and not to match any URL that browsers can handle (as per the URL Standard).

a.b--c.de is supposed to fail because `--` can only occur in Punycoded domain name labels, and those can only start with `xn--` (not `b--`).


Many of these regexes should not be used on user input, at least not if your regex library backtracks (uses NFAs), because of the risk of ReDoS: http://en.wikipedia.org/wiki/ReDoS

Trying to shoehorn NFAs into parsing stuff that isn't a regular expression is generally a bad idea. (See: Langsec.)


The first one I'm not sure, it looks like a valid FQDN (equivalent to www.foo.bar, where the trailing dot is implicit). The second one I guess it has to do with punycode and internaltional URI encoding?


You'd only have to filter out xn--, though.


The trailing dot makes it an invalid URL, which defines hostname as *[ domainlabel "." ] toplabel.

"b--c" seems to be a valid domainlabel, though, so I'm not sure why it's on there.


> The trailing dot makes it an invalid URL, which defines hostname as * [ domainlabel "." ] toplabel.

I disagree. But, this is a tricky one. The relevant specs are:

    Spec               |          | Validity | Definition
    URL      (RFC1738) | obsolete |  invalid | hostname = *[ domainlabel "." ] toplabel
    HTTP/1.0 (RFC1945) | current  |  invalid | host     = <A legal Internet host domain name
                       |          |          |             or IP address (in dotted-decimal form),
                       |          |          |             as defined by Section 2.1 of RFC 1123>
    HTTP/1.1 (RFC2068) | obsolete |  invalid | ; same as RFC1945
    HTTP/1.1 (RFC2616) | obsolete |    valid | hostname = *( domainlabel "." ) toplabel [ "." ]
    URI      (RFC3986) | current  |    valid | host     = IP-literal / IPv4address / reg-name
                       |          |          | reg-name = *( unreserved / pct-encoded / sub-delims )
    HTTP/1.1 (RFC7230) | current  |    valid | uri-host = <host, see [RFC3986], Section 3.2.2>
The only way that URL is invalid is if we are in a strict HTTP/1.0 context.

As a note about RFC1738 being obsolete: these days a URL is just a URI (1) whose scheme specifies it as a URL scheme, and (2) is valid according to the scheme specification.

As the given URL is a valid URI, and is valid according to the current http URL scheme specification (RFC7230), that URL is valid.


You forgot the most relevant spec, the URL Standard: http://url.spec.whatwg.org/


The trailing dot makes it an invalid URL, which defines hostname as ∗[ domainlabel "." ] toplabel.

RFC2396§3.2.2 defines hostname as:

   hostname      = *( domainlabel "." ) toplabel [ "." ]
RFC3986 which obsoletes RFC2396, seems to make fewer claims about the authority section, and specifically says in Appendix D.2 that the toplabel rule has been removed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: