i still fail to see the enormous fuss about hash tables as if they are some reve...

HelloNurse · on March 19, 2015

Constrained input that is processed according to what the program knows about each "key" it contains, with keys defined in advanced and effectively specifying a data structure or file format, is actually very common. Example: HTML, XML and similar kinds of text markup; tagged file formats, e.g. PNG; the traditional application to programming language keywords.

jheriko · on March 19, 2015

right, that's a good use case, although i'm not sure its terribly important. the slow part of processing a png file is not identifying its contents... although this maybe different with XML and HTML parsers. i'm still not sure i'd call it common, or even desirable.

for lots of chunk based formats using a switch statement on the chunk-ids is clean and gives the optimisation task to the compiler, which tends to do a pretty good job of these things... even if it generates a gigantic conditional statement. since these sorts of things are 4 or 8 bytes long the 'memcmp is slow' argument falls apart

in fact this sort of switch statement usage is such a good optimisation here that it is used by this perfect hash mechanism itself to solve its own problem, effectively offloading some of the cleverness requirement to the compiler...

also, in the context of programming languages this problem falls into the category i described of being beatable at compile-time. usually this is implemented by tokenising (hashing pieces of) all the string input then only ever using the tokens (hashes, indices) outside of that context. whatever lookups need doing can then become indexing into a nice flat array defined at compile-time.

in the other cases where lookups are needed they are for things like variables and constants, which are not known at compile-time, and the perfect hash can not help...

you might also want hash collisions in cases where languages have aliases for keywords

sure, you could lift some of the performance cost of the tokenisation itself outside of run-time with that, but tokenising programming language input is not the dominant cost of compiling code.