Any advice on stripping wiki markup to obtain plain text from the wikipedia dump...

jgoldsmith · on Aug 27, 2013

Since I am using the MediaWiki extracts API, I never had to find/write my own Wikitext parser. However, I did run into a couple in my research that seemed relatively popular:

- https://github.com/dcramer/py-wikimarkup (coverts wikitext to HTML using Python, would need to extract text with BeautifulSoup or something

- http://wiki.eclipse.org/Mylyn/Incubator/WikiText (also to HTML, but in Java)

- https://github.com/earwig/mwparserfromhell

I'm sure if you did a bit more digging you could find a C# library that does this, or you could roll your own pretty easily using the others as a model.

krichman · on Aug 27, 2013

Easy? Haven't they got a Turing-complete language hiding in there?

DenisM · on Aug 27, 2013

Thanks!