Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The issue I was having was with the query "term+wikipedia" it then shows the wikipedia article in Czech, Hungarian, Russian, some kind of Arab and other before finally showing the English version. Then also a lot of that occur 2,3,4+ times with the same URL, just differing in crawltime by a few minutes.


It's a difficult problem to fix, you can set an Accept-Language header on crawl requests but his only works if the target website uses "Content Negotiation." Some sites ignore headers and determine language based on the IP address (Geo-IP) or the URL structure (e.g., /es/ vs /en/), basically a mess...


I don't get the problem you claim. You crawl something and get a document in whatever language the site delivers you. You know the language of that document with the lang=... attribute of the document. What results you show for a given language is under your control and not influenced by what the crawled site chose to serve to the crawler.


I'm working on the language improvements presently, but I need to clean out a lot of bad entries in my index. In essence what I am trying to say is many servers ignore "Accept-Language" so you have to rely on other means of detecting the language of the page reliably, e.g. inspecting the body content of the response. It's a non-trivial problem online.


So html lang=... is wrong, or doesn't exist?

> I am trying to say is many servers ignore "Accept-Language"

I wouldn't have expected that to be a hard rule, more like if there are multiple pages to return to have a factor, which one the user most likely wants.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: