I've been working on an implementation of graph RAG (GRAG) using Neo4j as the underlying store.
The overall DX is quite nice. The apoc-extended set of plugins[0] make it very seamless to work with embeddings and and LLMs during local dev/testing. The Graph Data Science package comes preloaded with a series of community detection algorithms[1] like Louvain and Leiden.
Performance has been very, very good as long as your strategy to enter the graph is sound and you've structured your graph in such a way that you can meaningfully traverse the adjacent properties/nodes.
We've currently deployed the Community edition to AWS ECS Fargate using AWS Copilot + EFS as a persistent volume. There were some kinks with respect to the docs, but it works great otherwise.
It's worth a look for any teams that are trying to improve their RAG or are exploring GRAG in general. It's not a silver bullet; you still need to have some "insight" into how to process your input data source for the graph to do its magic. But the combination of the built-in graph algorithms and the ergonomics of Cypher make it possible to perform certain types of queries and "explorations" that would otherwise be either harder to optimize or more expensive in a relational store.
My general point on GraphRAG is that it extracts and compresses the horizontal topic-clustering across many documents and makes that available for retrieval.
And that by creating the semantic network of entities, you can use patterns in the graph structure to answer questions that rely on information coming together from different documents. Think the detectives board connecting facts with strings from many different sources.
Feel free to ping me for a deeper discussion: michael at neo4j
During our initial testing, ~1m nodes on a local Docker container with 1G RAM and 1vCPU.
But here I mean "performance" in both retrieval time and the overall quality of the fragments retrieved for RAG compared to a `pgvector` only implementation. It is possible to "simulate" these types of graph traversals in pg as well, you'll have to work much harder to get the performance (we tried it first).
Huh. I've had the opposite experience. Neo4j has a pretty nice interface and package overall, but I was not impressed with the performance, and the developer experience was about on-par with Elasticsearch (not comparing the two databases, just the developer resources and communities). For general purpose use I've still not found anything better than Postgres (and yes, knowledge graphs I would consider general purpose). For my day-to-day work I'm constantly querying a regularly-updated knowledge graph consisting of >10M active, highly-connected nodes - I keep previous versions in the same database so I can traverse backwards through time. This is all on my laptop. No problems with latency or performance.
I'm always curious what people's use cases are with graph databases; do people find Cypher and SPARQL helpful? I've tried several times, but SQL is just so expressive. Postgres is still my favorite graph database (and CRUD RDBMS, and filesystem, and "data conversion tool").
If your performance is poor, try running your query with `PROFILE {your_query}`. It's very easy to write a query that ends up loading way more nodes than expected. Years ago we had one query that progressively performed worse -- turned out one leg was loading the full node space!
What I have found is that "land and expand" using an index to find the landing spots is key for performance. Reason being once you "land" effectively, "expand" is cheap and fast.
Some of it will also come down to your graph design. If you have a lot of super dense nodes (analogous to a large JOIN), it will create a lot of memory pressure which it does not handle well.
But in a RAG use case, I don't see these as being issues.
number of nodes mean nothing. what matters to measure performance is how much interconnected your network is and how complex are the relationships you want to extract.
Right. I created a Neo4j db once with millions of nodes and relationships. Individual queries were very performant for all of my access patterns. Where it failed was with queries/sec. Throw more users at it, and it slowed to a crawl. Yes, read replicas are an option, but I was really discouraged with Neo4j performance with more than a few users.
If you are using community edition - check out the DozerDB plugin which adds enterprise features to Neo4j community such as multi database. Its still in its infancy but has already implemented multi db and enterprise constraints. https://dozerdb.org
The overall DX is quite nice. The apoc-extended set of plugins[0] make it very seamless to work with embeddings and and LLMs during local dev/testing. The Graph Data Science package comes preloaded with a series of community detection algorithms[1] like Louvain and Leiden.
Performance has been very, very good as long as your strategy to enter the graph is sound and you've structured your graph in such a way that you can meaningfully traverse the adjacent properties/nodes.
We've currently deployed the Community edition to AWS ECS Fargate using AWS Copilot + EFS as a persistent volume. There were some kinks with respect to the docs, but it works great otherwise.
It's worth a look for any teams that are trying to improve their RAG or are exploring GRAG in general. It's not a silver bullet; you still need to have some "insight" into how to process your input data source for the graph to do its magic. But the combination of the built-in graph algorithms and the ergonomics of Cypher make it possible to perform certain types of queries and "explorations" that would otherwise be either harder to optimize or more expensive in a relational store.
[0] https://neo4j.com/labs/apoc/5/ml/openai/
[1] https://neo4j.com/docs/graph-data-science/current/algorithms...