Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any advice on stripping wiki markup to obtain plain text from the wikipedia dump? A friend is doing linguistic research and could benefit from large bodies of text in different languages. Ideally this would be a C# library, but a simple command line tool in any other language would do as well - accept .xml.bz2, strip the wiki markup, return something that's easily processable by further tools in a single file. Thanks in advance.


Since I am using the MediaWiki extracts API, I never had to find/write my own Wikitext parser. However, I did run into a couple in my research that seemed relatively popular:

- https://github.com/dcramer/py-wikimarkup (coverts wikitext to HTML using Python, would need to extract text with BeautifulSoup or something

- http://wiki.eclipse.org/Mylyn/Incubator/WikiText (also to HTML, but in Java)

- https://github.com/earwig/mwparserfromhell

I'm sure if you did a bit more digging you could find a C# library that does this, or you could roll your own pretty easily using the others as a model.


Easy? Haven't they got a Turing-complete language hiding in there?


Thanks!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: