Vector Search with OpenAI Embeddings: Lucene Is All You Need

moonchrome · on Sept 3, 2023

I'd say Postgres + pgvector is even simpler if you're doing small scale document search (eg. internal knowledge bases, documentation sources, codebase indexing, etc.).

pgvector is even supported out of the box on Azure and AWS RDS.

Just spin up a docker container [1], add a vector column to your table and you're ready for embedding search.

[1] https://hub.docker.com/r/ankane/pgvector

If you're starting out with a prototype - do yourself a favour, steer clear of the chromadb examples with langchain. In fact steer clear of langchain in general :) Just go for OpenAI API and PostgreSQL+PGVector - you'll have to do some boilerplate - but the stuff in langchain is just terrible, you'll have to rewrite it and do the boilerplate at some point anyway and this stack is super simple to deploy.

lmeyerov · on Sept 3, 2023

Now that opensearch has ivfpq, and I assume pgvector will get it within 6mo... the current typical case for most teams, big or small, should be handled. Regular databases are all adding it, further chiseling away at the need for dedicated vector DB startups. This seems consistent with what we guessed would happen at the bottom of https://gradientflow.com/the-vector-database-index/ .

AFAICT standalone optimized vector stores will still have their place, like super latency sensitive or high-throughput scenarios. Unfortunately for many stakeholders, venture-scale returns to justify the megarounds from the peak vc fomo of the last few years seems unclear. The TBD hail mary here may be generative AI: even if most modern vector search workloads are largely fine with regular DB extensions to support a few kinds of vector indexes, the continued growth of generative AI, knowledge graphs, etc., may somehow grow the pie for non-standard DBs here. That's not obvious to me. For example, with Databricks LakehouseIQ, a lot of the use case may be eaten by the data warehouses.

puika · on Sept 3, 2023

There's also pgvector (trivial) examples for different languages, see e.g. github.com/pgvector/pgvector-python

minimaxir · on Sept 3, 2023

"Just spin up a docker container" is a self-contradicting sentence. For non-complex applications, anything needing to touch containers is itself too complicated.

Even with pgvector, there's no good way to write a simple tutorial for embedding newbies.

moonchrome · on Sept 3, 2023

Anything talking about Lucene is non trivial from start.

I'm coming at this from a perspective of a competent dev trying to build a tool with OpenAI APIs - which, from what I can tell, is a growing topic.

If you're experienced with building APIs and new to the LLM stuff - skip the langchain and chromadb nonsense. Just use the OpenAI APIs and pgvector.

Chromadb and langchain are usefull when writing notebook prototypes to get an idea of how this stuff works - but discard immediately after that phase and save yourself the trouble of porting later.

amluto · on Sept 3, 2023

I recently spun up a MySQL instance outside a container for what I hope is the last time.

MySQL is very attached to /etc/mysql, /var/log/mysql and /var/lib/mysql, which makes sense if one thinks of it as a piece of a distribution and makes no sense if one thinks of it as a service that stores data in a filesystem or directory that one sets up for the purpose. Apparmor and (don’t get me started) SELinux dig this in deeper. What if you want two MySQLs on one host? What if you don’t want to mount something on /var/lib/mysql?

If mysqld were invoked by pointing it at a configuration and data and it just worked, I’d be more okay with it.

The fact that I really don’t want to couple upgrades of MySQL to distro upgrades is just icing on the cake.

threeseed · on Sept 3, 2023

> I'd say Postgres + pgvector is even simpler

For development, perhaps. For production, absolutely not.

I wish HN had some bot that would just delete any comment from people recommending installing databases. Because 99.9% of the time it's from those who have no experience in running one in a Production environment. Keeping it secure and ensuring backup/restore cycle works is seriously non-trivial.

andrewmutz · on Sept 3, 2023

What's wrong with Postgres + pgvector in production? pgvector is supported by RDS. Keeping an RDS instance running in production isn't exactly rocket science.

fiedzia · on Sept 3, 2023

It's no match for lucene (in practice ES or Solr) in terms of performance and features, as it's very different model of indexing and operating. Keeping it running while 1000s of customers run search queries AND the app uses db AND users want faceting or other features is not an option.

baz00 · on Sept 3, 2023

Keeping postgres alive is considerably less work than keeping anything Lucene based alive.

threeseed · on Sept 3, 2023

Please clarify.

Would love to know how running a database is considerably less work than adding a library to your existing app.

baz00 · on Sept 3, 2023

Lucene uses a somewhat unreliable storage back end and needs to be rebuilt regularly.

moonchrome · on Sept 3, 2023

Enabling pgvector in Azure was trivial.

I think if you're past pgvector performance you won't be listening to a random guy talking about pgvector but have a good understanding of the space.

If you're new (like I was a few months ago) save yourself the time I wasted on the noobtraps I mentioned. It scales way better than the OpenAI API for my use cases.

Lucene (or rather elastic/open search) is way overkill for my needs

zosima · on Sept 3, 2023

But for an internal knowledge base backup/restore may be irrelevant (as documents are all copies of data and the database reconstructed fast at will) and well security is really not that difficult with a single system user.

uoaei · on Sept 3, 2023

I wish HN had some bot that would just delete any comment from people assuming everyone runs databases exposed to the internet.

doctor_eval · on Sept 3, 2023

Security and backups are important regardless of the technology you’re using.

Managed Postgres is a thing, probably much more common than managed lucerne. I can name three managed Postgres providers off the top of my head (AWS, Vultr, Supabase).

threeseed · on Sept 3, 2023

There is no such thing as managed Lucene. It's a library.

There are search engines e.g. Solr, ElasticSearch but those aren't what this paper is proposing.

doctor_eval · on Sept 3, 2023

Right. So now my existing stack needs a backup strategy that’s independent of the database. That’s really tough in production; Lucerne is adding complexity.

To be clear I object to this:

> I wish HN had some bot that would just delete any comment from people recommending installing databases.

Most applications already use a database. Pooh-poohing Postgres as a solution is silly. For many people, it’s actually going to be less complex to take to production than installing a library, especially one with its own complex storage needs.

Also: technically, isn’t Lucerne a database?

jstx1 · on Sept 3, 2023

Isn't this solved by managed databases? Is there a problem with paying more and having AWS/Azure/GCP take care of the difficult parts for you?

jzombie · on Sept 3, 2023

What about using chromadb without langchain?

I am currently using it this way and it is really easy to just get started.

I don't know how well it compares with others, however.

jkb79 · on Sept 3, 2023

It's an opinionated blog post published on Arxiv, masquerading as research.

IMHO, it's a gigantic self-own and doesn’t promote Lucene in a good way. For example, by demonstrating how they get only 10 QPS out of a system with 1TB of memory and 96 v-cpu's (after 4 warmups).

The HNSW implementation in Lucene is fair, and within the same order of magnitude as others. But, to get comparable performance, you must merge all immutable segments to a single segment, which all Lucene oriented benchmark does, but which is not that realistic for many production workloads where docs are updated/added in near real-time.

chii · on Sept 4, 2023

> but which is not that realistic for many production workloads where docs are updated/added in near real-time.

it really depends on how real time you need the search to be tho.

What i've seen is a green/blue lucene index. The updates happen on one (let's say the blue), while searches happen on the other (green). The segment merging happens periodically for the blue (or even smarter, let's say, after some known amount of time and updates combined), and then the index are switched. Depending on how often new documents come in, and "real time" you need, this may be sufficient.

Version467 · on Sept 3, 2023

I gotta be honest, I find it almost a little disrespectful that everyone started naming their shit „x is all you need“ even for very mundane stuff.

Attention is all you need was a breakthrough paper. It fundamentally changed the ML landscape and got us out of a huge roadblock with rnns.

If you seriously think you have something similarly impactful on your hands, then sure go ahead with that name. But there’s been a bunch of papers where I found it distasteful. At best its just not funny. But this isn’t even really much of a paper. I’ve seen blog posts with more substance. Hell, even YouTube videos.

I don’t know, I guess I just don’t really get the joke.

PheonixPharts · on Sept 4, 2023

You should write a paper: "all you need considered harmful"

I don't think referencing well known papers has ever (well maybe not, ever) implied that the authors feel their work is on par with original. It's a pretty common practice in some academic circles when there is an impactful paper with a catchy name and you simply want to pay a bit homage and have a less boring title than you might have otherwise.

blazespin · on Sept 10, 2023

"Attention is all you need" was a silly name and most people who name their stuff are poking fun at it. Obviously, attention is not all you need. Just like lucene isn't all you need.

flangola7 · on Sept 3, 2023

I'm so tired of complaints like this comment. Just read the paper and ignore the name FFS.

dmezzetti · on Sept 4, 2023

In terms of "All You Need" for Vector Search, ANN Benchmarks (https://ann-benchmarks.com/) is a good site to review when deciding what you need. As with anything complex, there often isn't a universal solution.

txtai (https://github.com/neuml/txtai) can build indexes with Faiss, Hnswlib and Annoy. All 3 libraries have been around at least 4 years and are mature. txtai also supports storing metadata in SQLite, DuckDB and the next release will support any JSON-capable database supported by SQLAlchemy (Postgres, MariaDB/MySQL, etc).

ftkftk · on Sept 3, 2023

I think this depends entirely on scale and performance metrics. For many smaller use cases using Lucene (or postgres, or elasticsearch or whatever else you already have running in your stack) is perfectly adequate as this paper shows. But as soon as you add a large dataset or high index/search volume you are likely better served with an actual vector datastore. The paper even acknowledges slow indexing performance and a low 9.8 queries per second on decent hardware. Will it perform fine for your couple of hundred pages of internal wiki? Sure. But I think your time is likely better spent learning to deploy and manage a new tech in your stack than figuring out how to work around these significant limitations at scale.

TuringNYC · on Sept 3, 2023

The article says:

"We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection......This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure."

Curious why stop there, why even use OpenAI embeddings and not use, say, LLaMA embeddings and create a truly open stack.

BoorishBears · on Sept 3, 2023

LLaMA embeddings would perform terribly.

MS MARCO is a well known benchmark that mirrors a pretty common use case: https://www.sbert.net/docs/pretrained-models/msmarco-v3.html

https://huggingface.co/spaces/mteb/leaderboard

To intentionally oversimplify, embeddings fall into 2 main categories if you do a cursory search:

- Good at semantic similarity: Most common case, check if two strings have similar meaning, even if the words don't match exactly.

- Good at Q/A: Finds text that can answer a question. Sound similar to semantic, but "What is a dog" and "What is a cat" are very similar sentences but very different questions. These models cluster questions closer to their answers and further from other questions. They also handle the difference in length between questions and answers better.

(Some leading embedding models use prefixes during training to let you adjust performance between those two tasks on the fly)

LLaMA embeddings weren't optimized to any specific task, you'd just be hoping that they tangentially align with some arbitrary goal.

OpenAI had to fine tune their embeddings model, and despite being massively oversized you can see it's not at the top of the leaderboard compared to much smaller models.

spullara · on Sept 3, 2023

This is completely obvious and I am surprised that they had to write a paper about it but with $100m investments in vector databases I guess it needed to be officially said. For those companies to be successful they will have to also become either a better search engine or a better database in addition to vector search and compete either with folks like elastic/redis/opensearch/vespa or with postgres/mongodb/oracle/mysql. An independent vector only search system doesn't make sense.

Also, all embeddings are basically equivalent for this use case.

beachy · on Sept 3, 2023

History repeats itself endlessly in the database world.

Late last century Oracle felt urgency to compete with the new hotness back then, object databases. The guys in charge of the Oracle database itself pushed back, not too much was done, and in the end those competitors all flamed and died.

Object databases gave way to graph databases. They broke through a little but RDBMS continued to rule.

Then came the NoSQL movement. RDBMS vendors ended up adding json columns and the pure NoSQL vendors flamed and died (excepting Mongo DB which is web scale (https://www.youtube.com/watch?v=b2F-DItXtZs)).

Thus it will continue forever.

TuringNYC · on Sept 3, 2023

>> but with $100m investments in vector databases

I also do not understand why VCs are investing in this space. The base case is that they are almost completely interchangeable c/o langchain and other intermediaries.

I do understand that some vector databases have strengths over others w/r/t scaling OUT, however, they do not have stickiness and time erodes the scaling advantages via both competitors catching up and via machines getting cheaper allow for scaling UP

The history of commercial DBs was usually supported by a variety of use cases, proprietary hooks to keep customers, choices on the CAP theorem, etc. Almost none of that applies here given the minimal interaction modes we have with vector DBs.

Could anyone speak to the case to invest in vector DBs?

(and to address the parent take, "elastic/redis/opensearch/vespa or with postgres/mongodb/oracle/mysql" - this is one of the most crowded spaces in the marketplace and i have no idea why customers would choose an upstart for the sake of consolidation rather than clear winners in a best of breed solution.

ofermend · on Sept 5, 2023

Yes, plus we certainly have too many independent vector DB products as it is.

PheonixPharts · on Sept 3, 2023

I'm guessing they choose to use OpenAI embeddings since this is the dominant use case for most people.

The embeddings are just the "data" that's in the database. Swapping out getting embeddings from OpenAI with Llama is as trivial as putting information about your own customers in your own database as opposed to using info on someone else's customers.

tomhamer · on Sept 4, 2023

Marqo lets you use state of the art e5 embeddings (which are significantly more performant in retrieval than the openai embeddings), and will handle the embedding generation and retrieval on lucene indexes: https://www.marqo.ai/

It's also available opensource: https://github.com/marqo-ai/marqo

bob1029 · on Sept 3, 2023

Every week I feel like we get a few papers closer to "SQLite is all you need".

A voice in my head seems adamant that the solution to this whole space of problems is neatly managed by one clever schema and minimal computational resources. It has only been growing louder and more confident in this over time.

beachy · on Sept 3, 2023

For decades, banks, insurance companies, airlines, multinationals and generally the entire world ran their entire IT operations on databases less powerful than a single SQLite running on a modern PC. Makes sense that it's all you need for many problems.

nielsole · on Sept 3, 2023

https://stackoverflow.com/questions/42310655/sql-computation...

runeblaze · on Sept 3, 2023

I think I am convinced that Lucene works well for embeddings + retrieval tasks from this preprint. I hope the paper provides direct comparison against Pinecone, Chroma, etc.. With enough budget you can probably do a user study too.

Finer points:

1. I remember FAISS is very accelerated for GPUs. How does Lucene compare there?

2. "we’re not convinced that enterprises will make the (single, large) leap from an existing solution to a fully managed service" --> Fair point but not everyone uses Lucene? I feel it is weird that this "existing solution" (Lucene) is assumed to be already adopted.

zitterbewegung · on Sept 3, 2023

Quick question how much different is what they did is in this blog post : https://medium.com/swlh/fun-with-apache-lucene-and-bert-embe... ?

CuriouslyC · on Sept 3, 2023

I feel like between solr and elastic the reach is pretty deep.

kordlessagain · on Sept 3, 2023

FeatureBase is all you need: https://github.com/FeatureBaseDB/DoctorGPT

We'll be demo'ing an embedding service that uses Instructor Large/XL embeddings + GPT-4 keyterm extraction this next week.

catlover76 · on Sept 3, 2023

Postgres is also a viable vector store.

threeseed · on Sept 3, 2023

The point is that Lucene can be embedded in applications.

So you don't have another component that you need to cost, integrate, provision, manage, secure, backup etc.

catlover76 · on Sept 3, 2023

Sorry, I wasn't trying to argue against the post or anything, I was just trying to say that, indeed, a lot of things people already use can do the vector store job, and I'm not sure anymore what the use-case is for something like Pinecone or Chroma (I would genuinely like to know)

idosh · on Sept 3, 2023

We're using Redis for vector search. It's pretty rad in terms of performance and other capabilities

cpill · on Sept 3, 2023

umm, how is it going to scale? how do you handle millions of vector per client and multiple clients? vector stores, like and DB is to simplify managing large scale data.

acedTrex · on Sept 3, 2023

I never read papers like this so excuse my ignorance but are sentences like this the norm?

"We had to incorporate logic for error handling in our code, given the high-volume nature of our API calls"

This just seems like an asinine thing to add to a technical paper. "We had to handle errors..."

nharada · on Sept 3, 2023

I don't think this is unreasonable since the target audience is other researchers. When you try and reproduce this paper you won't think "wow there's a lot of API errors I must be doing something wrong".

awestroke · on Sept 3, 2023

Researchers are typically not great at software engineering

extasia · on Sept 3, 2023

Research code has different values. If you throw away 90% of the code you used after a week you'd code differently too!

The tricky part is that 10% of your code will become the basis of your thesis / postdoc;)

itronitron · on Sept 3, 2023

That hasn't generally been my experience, unless you qualify researchers as 'university students'.

haolez · on Sept 3, 2023

OpenAI's APIs error out a lot. You get a lot of 502 and other similar errors. This would probably affect someone trying to reproduce the paper. I think that's what they meant.

itronitron · on Sept 3, 2023

That is possibly intended as a useful signal to other researchers that the team has a lot of hard won technical knowledge working with the API which may not necessarily be elaborated upon within the paper.

hereonout2 · on Sept 3, 2023

It seems a little superfluous to me, but the paper itself does a little too? Isn't this akin to publishing a paper on say mongodb vs postgres?

dang · on Sept 3, 2023

"Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead."

https://news.ycombinator.com/newsguidelines.html

extasia · on Sept 3, 2023

This will probably be something that the reviewers bring up. Arxiv papers haven't been peer reviewed yet!

mayank · on Sept 3, 2023

It is in industry, but you may be shocked if you read “research code”

m1117 · on Sept 3, 2023

I think Lucene might be using Pinecone in the backend or something.

threeseed · on Sept 3, 2023

It would be hilarious if Lucene did pull in an entire database as a transitive dependency. But sadly that's not the case:

https://github.com/search?q=repo%3Aapache%2Flucene%20pinecon...

softwaredoug · on Sept 3, 2023

Lucene has its own HNSW implementation…

… actually through codecs Lucene has a whatever-you-want-to-build implementation

tinyhouse · on Sept 3, 2023

Lucene is a java library. No way they are using Pinecone.