They don't mention BM25, which still outperforms much of semantic search. A fun ...

salamo · on June 15, 2024

I've had pretty good success with BM25 + stemming, or even easier, BM25 with trigram tokenization. If the index isn't too big, the whole search can be done client-side and is lightning fast.

mhuffman · on June 15, 2024

Isn't the big problem that BM25 (and friends) will help you find (and rank) exact search terms (or stemmed varieties of that search term), whereas semantic search can typically find items out-of-dictionary but "close" semantically? SPLADE, on my reading of it, seems to do a "pre-materialization" of the out-of-dictionary part.

VivaLaPanda · on June 15, 2024

It's worse than Splade https://i.imgur.com/oGliEIg.png

We've tested various hybrid approaches as well, but that's too much to go into in once post.

mozman · on June 15, 2024

what’s that output from and how can i understand it? i’m happy to rtfm - just want a pointer. thanks!

gregw134 · on June 16, 2024

It's various measures of recall rate. Recall@500 means what percentage of the time does the target document show up in the top 500 results from the retrieval system.

throwaway81523 · on June 15, 2024

I found BM25 and everything resembling it (like TF/IDF) to be near useless. It was (back in the day) really necessary to use external semantic info, or at least data gathered by examining the whole document set for stuff going beyond term frequency. I was excited by the first part of the SPLADE article because I thought it was going to use LLM's to somehow find concept embeddings in documents and let you search for those. But as someone said, it turns out to be a version of synonym search except the thesaurus is generated automatically. I remember someone did that with Word2Vec some years back and it was sort of useful, but generally the problem with search systems is too many results rather than missing some that are relevant.