It catches most dupes, just not on sites like the NYT that have so many differen...

far33d · on March 24, 2008

Maybe you could check if the title of the page is the same as well, not as an automatic detection, but instead as a reason to ask "are you sure this isn't the same as foo". This might prevent most of the NYTimes dupes.

derefr · on March 24, 2008

Would it be that hard to just take a fulltext index of each page that hits the hot page? From there, just show anything with some >N% similarity (probably 98 or so, as text ads can affect the site a little bit.)

ph0rque · on March 24, 2008

With your comment, I saw for the first time how the semantic web might be useful.