Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You'd be correct. The largest portion of all languages in Common Crawl (aka the "whole open internet" training corpus) is English with 43%. No other language even reaches double digit percentages. The next biggest one is Russian at 6%, followed by German at 5%.


I wonder where are you getting your data. According to wikipedia russian is #7 https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

Only place where russian is in top 5 is in Wikipedia views. Russian part of internet steadily goes down, as russian imperialism crumbles.


> The largest portion of all languages in Common Crawl

https://commoncrawl.github.io/cc-crawl-statistics/plots/lang...


Thanks!

I wonder where this discrepancy comes from


probably under-indexing of non-english sources by these crawlers.

would be interesting if yandex opened some data sets!


And lots of people write on the web using English as a second language, which both reduces the presence of their native language and increases the presence of English.


yep not a native english speaker here and yet my online footprint is mostly english due to software pushing me to learn it


My guess is that reference counting at depth=1 only capture non-$LANG content which text parts don't matter a lot, e.g. photo galleries.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: