I'd say Postgres + pgvector is even simpler if you're doing small scale document search (eg. internal knowledge bases, documentation sources, codebase indexing, etc.).
pgvector is even supported out of the box on Azure and AWS RDS.
Just spin up a docker container [1], add a vector column to your table and you're ready for embedding search.
If you're starting out with a prototype - do yourself a favour, steer clear of the chromadb examples with langchain. In fact steer clear of langchain in general :) Just go for OpenAI API and PostgreSQL+PGVector - you'll have to do some boilerplate - but the stuff in langchain is just terrible, you'll have to rewrite it and do the boilerplate at some point anyway and this stack is super simple to deploy.
Now that opensearch has ivfpq, and I assume pgvector will get it within 6mo... the current typical case for most teams, big or small, should be handled. Regular databases are all adding it, further chiseling away at the need for dedicated vector DB startups. This seems consistent with what we guessed would happen at the bottom of https://gradientflow.com/the-vector-database-index/ .
AFAICT standalone optimized vector stores will still have their place, like super latency sensitive or high-throughput scenarios. Unfortunately for many stakeholders, venture-scale returns to justify the megarounds from the peak vc fomo of the last few years seems unclear. The TBD hail mary here may be generative AI: even if most modern vector search workloads are largely fine with regular DB extensions to support a few kinds of vector indexes, the continued growth of generative AI, knowledge graphs, etc., may somehow grow the pie for non-standard DBs here. That's not obvious to me. For example, with Databricks LakehouseIQ, a lot of the use case may be eaten by the data warehouses.
"Just spin up a docker container" is a self-contradicting sentence. For non-complex applications, anything needing to touch containers is itself too complicated.
Even with pgvector, there's no good way to write a simple tutorial for embedding newbies.
Anything talking about Lucene is non trivial from start.
I'm coming at this from a perspective of a competent dev trying to build a tool with OpenAI APIs - which, from what I can tell, is a growing topic.
If you're experienced with building APIs and new to the LLM stuff - skip the langchain and chromadb nonsense. Just use the OpenAI APIs and pgvector.
Chromadb and langchain are usefull when writing notebook prototypes to get an idea of how this stuff works - but discard immediately after that phase and save yourself the trouble of porting later.
I recently spun up a MySQL instance outside a container for what I hope is the last time.
MySQL is very attached to /etc/mysql, /var/log/mysql and /var/lib/mysql, which makes sense if one thinks of it as a piece of a distribution and makes no sense if one thinks of it as a service that stores data in a filesystem or directory that one sets up for the purpose. Apparmor and (don’t get me started) SELinux dig this in deeper. What if you want two MySQLs on one host? What if you don’t want to mount something on /var/lib/mysql?
If mysqld were invoked by pointing it at a configuration and data and it just worked, I’d be more okay with it.
The fact that I really don’t want to couple upgrades of MySQL to distro upgrades is just icing on the cake.
For development, perhaps. For production, absolutely not.
I wish HN had some bot that would just delete any comment from people recommending installing databases. Because 99.9% of the time it's from those who have no experience in running one in a Production environment. Keeping it secure and ensuring backup/restore cycle works is seriously non-trivial.
What's wrong with Postgres + pgvector in production? pgvector is supported by RDS. Keeping an RDS instance running in production isn't exactly rocket science.
It's no match for lucene (in practice ES or Solr) in terms of performance and features, as it's very different model of indexing and operating.
Keeping it running while 1000s of customers run search queries AND the app uses db AND users want faceting or other features is not an option.
I think if you're past pgvector performance you won't be listening to a random guy talking about pgvector but have a good understanding of the space.
If you're new (like I was a few months ago) save yourself the time I wasted on the noobtraps I mentioned. It scales way better than the OpenAI API for my use cases.
Lucene (or rather elastic/open search) is way overkill for my needs
But for an internal knowledge base backup/restore may be irrelevant (as documents are all copies of data and the database reconstructed fast at will) and well security is really not that difficult with a single system user.
Security and backups are important regardless of the technology you’re using.
Managed Postgres is a thing, probably much more common than managed lucerne. I can name three managed Postgres providers off the top of my head (AWS, Vultr, Supabase).
Right. So now my existing stack needs a backup strategy that’s independent of the database. That’s really tough in production; Lucerne is adding complexity.
To be clear I object to this:
> I wish HN had some bot that would just delete any comment from people recommending installing databases.
Most applications already use a database. Pooh-poohing Postgres as a solution is silly. For many people, it’s actually going to be less complex to take to production than installing a library, especially one with its own complex storage needs.
It's an opinionated blog post published on Arxiv, masquerading as research.
IMHO, it's a gigantic self-own and doesn’t promote Lucene in a good way. For example, by demonstrating how they get only 10 QPS out of a system with 1TB of memory and 96 v-cpu's (after 4 warmups).
The HNSW implementation in Lucene is fair, and within the same order of magnitude as others. But, to get comparable performance, you must merge all immutable segments to a single segment, which all Lucene oriented benchmark does, but which is not that realistic for many production workloads where docs are updated/added in near real-time.
> but which is not that realistic for many production workloads where docs are updated/added in near real-time.
it really depends on how real time you need the search to be tho.
What i've seen is a green/blue lucene index. The updates happen on one (let's say the blue), while searches happen on the other (green). The segment merging happens periodically for the blue (or even smarter, let's say, after some known amount of time and updates combined), and then the index are switched. Depending on how often new documents come in, and "real time" you need, this may be sufficient.
I gotta be honest, I find it almost a little disrespectful that everyone started naming their shit „x is all you need“ even for very mundane stuff.
Attention is all you need was a breakthrough paper. It fundamentally changed the ML landscape and got us out of a huge roadblock with rnns.
If you seriously think you have something similarly impactful on your hands, then sure go ahead with that name.
But there’s been a bunch of papers where I found it distasteful. At best its just not funny. But this isn’t even really much of a paper. I’ve seen blog posts with more substance. Hell, even YouTube videos.
I don’t know, I guess I just don’t really get the joke.
You should write a paper: "all you need considered harmful"
I don't think referencing well known papers has ever (well maybe not, ever) implied that the authors feel their work is on par with original. It's a pretty common practice in some academic circles when there is an impactful paper with a catchy name and you simply want to pay a bit homage and have a less boring title than you might have otherwise.
"Attention is all you need" was a silly name and most people who name their stuff are poking fun at it. Obviously, attention is not all you need. Just like lucene isn't all you need.
In terms of "All You Need" for Vector Search, ANN Benchmarks (https://ann-benchmarks.com/) is a good site to review when deciding what you need. As with anything complex, there often isn't a universal solution.
txtai (https://github.com/neuml/txtai) can build indexes with Faiss, Hnswlib and Annoy. All 3 libraries have been around at least 4 years and are mature. txtai also supports storing metadata in SQLite, DuckDB and the next release will support any JSON-capable database supported by SQLAlchemy (Postgres, MariaDB/MySQL, etc).
I think this depends entirely on scale and performance metrics. For many smaller use cases using Lucene (or postgres, or elasticsearch or whatever else you already have running in your stack) is perfectly adequate as this paper shows. But as soon as you add a large dataset or high index/search volume you are likely better served with an actual vector datastore. The paper even acknowledges slow indexing performance and a low 9.8 queries per second on decent hardware. Will it perform fine for your couple of hundred pages of internal wiki? Sure. But I think your time is likely better spent learning to deploy and manage a new tech in your stack than figuring out how to work around these significant limitations at scale.
"We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection......This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure."
Curious why stop there, why even use OpenAI embeddings and not use, say, LLaMA embeddings and create a truly open stack.
To intentionally oversimplify, embeddings fall into 2 main categories if you do a cursory search:
- Good at semantic similarity: Most common case, check if two strings have similar meaning, even if the words don't match exactly.
- Good at Q/A: Finds text that can answer a question. Sound similar to semantic, but "What is a dog" and "What is a cat" are very similar sentences but very different questions. These models cluster questions closer to their answers and further from other questions. They also handle the difference in length between questions and answers better.
(Some leading embedding models use prefixes during training to let you adjust performance between those two tasks on the fly)
LLaMA embeddings weren't optimized to any specific task, you'd just be hoping that they tangentially align with some arbitrary goal.
OpenAI had to fine tune their embeddings model, and despite being massively oversized you can see it's not at the top of the leaderboard compared to much smaller models.
This is completely obvious and I am surprised that they had to write a paper about it but with $100m investments in vector databases I guess it needed to be officially said. For those companies to be successful they will have to also become either a better search engine or a better database in addition to vector search and compete either with folks like elastic/redis/opensearch/vespa or with postgres/mongodb/oracle/mysql. An independent vector only search system doesn't make sense.
Also, all embeddings are basically equivalent for this use case.
History repeats itself endlessly in the database world.
Late last century Oracle felt urgency to compete with the new hotness back then, object databases. The guys in charge of the Oracle database itself pushed back, not too much was done, and in the end those competitors all flamed and died.
Object databases gave way to graph databases. They broke through a little but RDBMS continued to rule.
Then came the NoSQL movement. RDBMS vendors ended up adding json columns and the pure NoSQL vendors flamed and died (excepting Mongo DB which is web scale (https://www.youtube.com/watch?v=b2F-DItXtZs)).
I also do not understand why VCs are investing in this space. The base case is that they are almost completely interchangeable c/o langchain and other intermediaries.
I do understand that some vector databases have strengths over others w/r/t scaling OUT, however, they do not have stickiness and time erodes the scaling advantages via both competitors catching up and via machines getting cheaper allow for scaling UP
The history of commercial DBs was usually supported by a variety of use cases, proprietary hooks to keep customers, choices on the CAP theorem, etc. Almost none of that applies here given the minimal interaction modes we have with vector DBs.
Could anyone speak to the case to invest in vector DBs?
(and to address the parent take, "elastic/redis/opensearch/vespa or with postgres/mongodb/oracle/mysql" - this is one of the most crowded spaces in the marketplace and i have no idea why customers would choose an upstart for the sake of consolidation rather than clear winners in a best of breed solution.
I'm guessing they choose to use OpenAI embeddings since this is the dominant use case for most people.
The embeddings are just the "data" that's in the database. Swapping out getting embeddings from OpenAI with Llama is as trivial as putting information about your own customers in your own database as opposed to using info on someone else's customers.
Marqo lets you use state of the art e5 embeddings (which are significantly more performant in retrieval than the openai embeddings), and will handle the embedding generation and retrieval on lucene indexes: https://www.marqo.ai/
Every week I feel like we get a few papers closer to "SQLite is all you need".
A voice in my head seems adamant that the solution to this whole space of problems is neatly managed by one clever schema and minimal computational resources. It has only been growing louder and more confident in this over time.
For decades, banks, insurance companies, airlines, multinationals and generally the entire world ran their entire IT operations on databases less powerful than a single SQLite running on a modern PC. Makes sense that it's all you need for many problems.
I think I am convinced that Lucene works well for embeddings + retrieval tasks from this preprint. I hope the paper provides direct comparison against Pinecone, Chroma, etc.. With enough budget you can probably do a user study too.
Finer points:
1. I remember FAISS is very accelerated for GPUs. How does Lucene compare there?
2. "we’re not convinced that enterprises will make the (single, large) leap
from an existing solution to a fully managed service" --> Fair point but not everyone uses Lucene? I feel it is weird that this "existing solution" (Lucene) is assumed to be already adopted.
Sorry, I wasn't trying to argue against the post or anything, I was just trying to say that, indeed, a lot of things people already use can do the vector store job, and I'm not sure anymore what the use-case is for something like Pinecone or Chroma (I would genuinely like to know)
umm, how is it going to scale? how do you handle millions of vector per client and multiple clients? vector stores, like and DB is to simplify managing large scale data.
I don't think this is unreasonable since the target audience is other researchers. When you try and reproduce this paper you won't think "wow there's a lot of API errors I must be doing something wrong".
OpenAI's APIs error out a lot. You get a lot of 502 and other similar errors. This would probably affect someone trying to reproduce the paper. I think that's what they meant.
That is possibly intended as a useful signal to other researchers that the team has a lot of hard won technical knowledge working with the API which may not necessarily be elaborated upon within the paper.
"Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead."
pgvector is even supported out of the box on Azure and AWS RDS.
Just spin up a docker container [1], add a vector column to your table and you're ready for embedding search.
[1] https://hub.docker.com/r/ankane/pgvector
If you're starting out with a prototype - do yourself a favour, steer clear of the chromadb examples with langchain. In fact steer clear of langchain in general :) Just go for OpenAI API and PostgreSQL+PGVector - you'll have to do some boilerplate - but the stuff in langchain is just terrible, you'll have to rewrite it and do the boilerplate at some point anyway and this stack is super simple to deploy.