Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It catches most dupes, just not on sites like the NYT that have so many different urls.


Maybe you could check if the title of the page is the same as well, not as an automatic detection, but instead as a reason to ask "are you sure this isn't the same as foo". This might prevent most of the NYTimes dupes.


Would it be that hard to just take a fulltext index of each page that hits the hot page? From there, just show anything with some >N% similarity (probably 98 or so, as text ads can affect the site a little bit.)


With your comment, I saw for the first time how the semantic web might be useful.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: