On Windows, SysInternals's RamMap is your friend. Also the System Internals book...

On Windows, SysInternals's RamMap is your friend. Also the System Internals book (#5 I think).

Every file on Windows keeps ~1024 bytes of info in the file cache. The more the files the more cache would be used.

Recent finding, that sped up our systems from 15->3sec on 300,000+ files filestamp check was to move from _stat to GetFileAttributesEx.

One would not think of doing such things, after all the C api is nice, open, access, bread, _stat are almost all there, but some of these functions do a lot of CPU intensive work (and one is not aware, until just a little bit of disassembly is done).

For example _stat does lots of divisions, dozen of kernel calls, strcmp, strchr, and few other things. If you have Visual Studio (or even Express) the CRT source code is included for one to see.

access() for example is relatively "fast", in terms that there is just mutex (I think, I'm on OSX right now), and then calling GetFileAttributes.

And back to RamMap - it's very useful in the sense that it shows you which files are in the cache, and what portion of them, also very useful that it can flush the cache, so one can run several tests, and optimize for hot-cache and cold-cache situations.

Few months ago, I came up with a scheme, borrowing idea from mozilla - where they would just pre-read certain DLLs that would eventually came up to be loaded (in a second thread).

I did the same for one our tools, the tool is single threaded, reads, operates, then writes. And it usually reads 10x more than it writes. So while the process operation was able to get multi-threaded through OpenMP, reading was not, so instead I had a list of what to read ahead, in a second thread, so that when it comes to the first thread, and it wanted to read, it was taking it from the cache. If the pre-fetcher was behind, it was skipping ahead. There was not even need to keep the contents in memory, just enough to read, and that's it.

For some other tool, where reading patterns cannot be found easily (deep-tree hierarchy) I've made something else instead - saving in a binary file what was read before for the given set of command-line arguments (filtering some). Later that was reused. It cut down on certain operations 25-50%.

One lesson I've learned though is to let the I/O do it's job from one thread... Unless everyone has some proper OS, with some proper drivers, with some proper HW, with....

Tim Bradshaws' widefinder, and widefinder2 competition had also good information. The guy that win it, has on his site some good analysis of multi-threaded I/O (can't find the site now),... But the analysis was basically that it's too erratic - sometimes you get speedups, but sometimes you get slowdowns (especially with writes).