The data used for commonality in JMDict is a little dated (1998) and biased (exc...

The data used for commonality in JMDict is a little dated (1998) and biased (exclusively based on newspapers, which tend to have specific vocabulary), which I guess is why you mentioned they were problematic. However, there are more recent data sets available on the Monash ftp archive. http://ftp.monash.edu.au/pub/nihongo/ . For example, there's one dataset from 2008 using blog entries from goo.ne.jp, and another with novels. They could be good additions to your corpus (fwiw, for 自問自答 there are 18038 occurrences indicated in the dataset for goo.ne.jp and 68 in the dataset from novels). Certainly, doing some similar work with current data from the net would be useful too. I wish there was some regular scraping done, so that we could always use fresh data. Hell, I wish Google, Bing or any other search engine were just giving out such word frequencies from their spider bots data (and not just for japanese).