Any advice on stripping wiki markup to obtain plain text from the wikipedia dump? A friend is doing linguistic research and could benefit from large bodies of text in different languages. Ideally this would be a C# library, but a simple command line tool in any other language would do as well - accept .xml.bz2, strip the wiki markup, return something that's easily processable by further tools in a single file. Thanks in advance.
Since I am using the MediaWiki extracts API, I never had to find/write my own Wikitext parser. However, I did run into a couple in my research that seemed relatively popular:
I'm sure if you did a bit more digging you could find a C# library that does this, or you could roll your own pretty easily using the others as a model.