Maybe you could check if the title of the page is the same as well, not as an automatic detection, but instead as a reason to ask "are you sure this isn't the same as foo". This might prevent most of the NYTimes dupes.
Would it be that hard to just take a fulltext index of each page that hits the hot page? From there, just show anything with some >N% similarity (probably 98 or so, as text ads can affect the site a little bit.)