Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Replacing Elasticsearch with Rust and SQLite (2017) (nickb.dev)
227 points by nethunters on May 16, 2021 | hide | past | favorite | 48 comments


This article doesn't even get into SQLite's full-text search feature, which really is surprisingly good. It's also very widely used - desktop and mobile apps that offer a search feature often build that using SQLite, so it's a very robust and well-trodden path at this point.

I've written a few tools to help work with FTS in SQLite:

- https://sqlite-utils.datasette.io/en/stable/cli.html#configu... - command-line utilities for enabling full-text search against an existing SQLite table

- https://sqlite-utils.datasette.io/en/stable/python-api.html#... - that same functionality as a Python library API

- https://docs.datasette.io/en/stable/full_text_search.html - Datasette spots full-text search enabled tables and adds a search interface to them

I also put together this article exploring classic search relevance algorithms with SQLite - SQLite FTS5 has relevance built in, but FTS4 leaves it as an exercise for the developer which makes it a really fun tool for understanding how algorithms like BM25 actually work: https://simonwillison.net/2019/Jan/7/exploring-search-releva...


SQLite's FTS5 also has experimental support for trigram indexing[0], which is quite nice for having substring matches instead of just whole-token matches with LIKE and GLOB syntax.

No support yet for regexp searches over trigram indexes, though, unfortunately (Postgres has this.) I personally think that much more expressive indexed search syntax (more so than just regexp, even) is going to be the future and am actively working on a search engine in Zig for this purpose.

[0] https://sqlite.org/fts5.html#the_experimental_trigram_tokeni...


Actually pg_tgrm, like all things Postgres generally, is superior. It can rank searches by similarity (e.g. proportion of query trigrams present in the matching column's trigrams) and SQLite doesn't, which is a dealbreaker when dealing with anything that potentially includes typos or mutable word order; in many situations you wanna match (or rank high) 'Johnny Doe' when the query is 'Doe, John', also considering that languages with clear markings of grammatical case rely less on word order (like most Slavic ones) pg_tgrm is a godsend for properly hitting rows with such user input queries.


Oh cool I didn't know PostgreSQL can use those for regular expressions! I was playing with those indexes for PostgreSQL just this morning: https://til.simonwillison.net/django/enabling-gin-index


If you're using Postgres Trigram indexes at larger scales, I might suggest reading over just the conclusions at the end of my article here:

https://devlog.hexops.com/2021/postgres-regex-search-over-10...

e.g. I discuss the fact that pg_tgrm indexes being built does not take advantage of multiple cores, so at large scales you'll benefit from splitting data into multiple tables.


> SQLite's full-text search feature

Doesn't support languages other than English. At least, that was the case when I last tried it.


Is there a way to easily use SQLite FTS on Django?


Not that I've seen. You'd likely have to use a raw SQL query - or the RawSQL class as part of an expression in the ORM.


I’m always surprised when people haven’t heard of Xapian which is also in this space.

I tell people Xapian is the SQLite of search.

https://xapian.org/


I know it for being the backbone of the equally austere notmuch MUA (https://notmuchmail.org/), where it does a great job of churning instantly through dozens of thousands of mails.


Looks like its GPL not LGPL. Seems not usable for most people


I used this pretty extensively at a startup in 2012 or so. It was very easy to use and modify, but definitely lacked the polish of ElasticSearch at the time. There were SWIG bindings that let us use it with Python.


I'm not surprised. Low-tech website with zero marketing and no big players using it or throwing their weight behind it.

I've perused their docs and the immediate question for me is, why would I use this over another more popular search library or database that already has a wealth of stackoverflow and blog posts?

I'm sure it's a fine tool, but I don't see a reason why an enterprise should hedge their bets on this unknown.


Notmuch[1] is using it. It has a good reputation.

[1]: <https://notmuchmail.org/>


Thanks for the example, I'm aware of Notmuch but had no idea it relied on Xapian. However, back to my original point, there's simply not enough people or large projects/companies that rely on and use Xapian. I'm 99% positive that there is less than a handful of people at any of the last two Fortune 500 companies I've worked for that even knows what Notmuch is. But everyone knows Elastic, MongoDB, etc..

Again, I don't think Xapian is a bad tool or that it's underperforming by any stretch, but OP should not be surprised at all that most people have not heard of it.


I tried, and was never able to get the accuracy of results I expected when searching a SaaS address book. I had better results doing the search in SQL FWIW (though that was slower)


Yeah, I did a POC with Xapian ages ago. You would have a hard time convincing anyone to add a C++ indexer to their stack. I know I know, bindings, and all, but still.


Over one in Java? I’m not a Xapian proponent or anything, but my understanding was that ElasticSearch was slightly annoying to maintain because you need to worry about heap memory usage.


If my app was already running on the JVM, it's a pretty easy decision to run it embedded.


Does it do precise group-by queries? (facets) Can you do custom ranking by providing a class / plugin? (even a c++ plugin would be fine) thanks..



I think replacing Elasticsearch with SQLite is a great idea. Even more if you use the full-text search functions that SQLite includes.

I recommend checking out scout[0], which, I think, can be a good replacement for Elasticsearch in some cases. I'm also working on an Elasticsearch replacement built on top of SQLite for my litements[1] project, but it will still take a few weeks to have a working version.

[0] https://github.com/coleifer/scout [1] https://github.com/litements/


It definitely depends on what you're doing with it. In the article the author is using it to parse only nginx logs to report page hits counts - I probably would have not gone with Elasticsearch to begin with. I find it strange that the issue with InfluxDB was the lack of 'Group by'; you can use 'Distinct' with other aggregation functions with the same result - and without having to write and maintain your own stack.

I appreciate - and often indulge - in writing stacks of software that are better implemented other open source, but I wouldn't say that the article describes a solution that is viable as a replacement for Elasticsearch.


I think it's best put as a replacement for ElasticSearch for those who don't really need ElasticSearch.

Edit - i.e. where ES is overkill for the use case, however for an appropriate use case ElasticSearch is amazing


I didn't read the article, but I think this is a sizable audience.


Similar to litements litecache is DiskCache[0]. It is a pure python Redis/Memcached alternative which is faster than them for reads. Under the hood it uses shared Sqlite.

[0] https://github.com/grantjenks/python-diskcache


Yes, I’m aware of DiskCache. Wonderful library and also very useful, I enjoyed reading the source code too. I have taken a few ideas from it that I would like to implement in the future.


This is quite interesting! How can be a disk based cache be faster than redis? (They claim this on their Github page)


I would say:

1. No network round trips 2. SQLite runs in the same process as your app, so reading from SQLite is basically like calling a function in your program (from a performance point of view).


I checked the benchmarks page [0], they run the Redis in same machine. And also, running Redis in different machine and connecting over networking isn't same compared to using diskcache.

> SQLite runs in the same process as your app, so reading from SQLite is basically like calling a function in your program

But SQLlite needs to write to disk, which will be much costly.

[0] - http://www.grantjenks.com/docs/diskcache/cache-benchmarks.ht...


I wonder how effective this would be with the SQLite with a new virtual file system fetching pages via XHR that someone posted last month. Then you would be able to do full text search on a huge database only sending a few KB per search

https://news.ycombinator.com/item?id=27016630


I’ve been a bit obsessed with that post since it came out as well.

Would it be conceivable to replace Algolia in this way for static data sets (i.e. those that don’t need the write/update API)?


I tried it (FTS5 on sql.js-httpvfs) with a 600mb FTS table containing OSM data. It works but the queries are quite a bit slower and there is quite a large difference between browsers. Some queries that took 1s in chrome took 10s in FF although I'm not sure why yet.


I guess it would be quite slow because of the network latency.


> This query can’t be represented in Diesel’s DSL (a sample of the DSL is demonstrated in Diesel’s getting started).

I feel like most of the time, SQL DSLs end up being bad. SQL is already a very high level language. There are a ton of examples of how to do stuff with SQL queries. DSLs, even if they have a way to express the SQL, are not nearly as widespread as SQL and then you will have to spend time figuring out how to translate the SQL into the DSL.

In addition, with libraries like sqlx for Rust, you can also get the type safety of a DSL using regular SQL.

I would say that as a developer, the time spent getting familiar with SQL is a good investment that will likely pay off across many projects and programming languages.


Its a low traffic website. Simple non distributed systems usually outperform complex distributed systems when the scale is small. Right tool for the right job and all.


There are also two good Elasticsearch alternatives in Rust - Sonic[1] and Toshi[2].

[1] https://github.com/valeriansaliou/sonic

[2] https://github.com/toshi-search/Toshi


Great work.

This is a really excellent exercise for understanding material need for refactoring.

I don't think I share the author's conclusions but it makes a really good test case.

"At idle, Elasticsearch uses:

    1GB of RAM (out of 8GB total)
    30% utilization of a CPU
    1GB of Disk space after a week
This doesn’t sound bad, but considering this server houses a dozen containers and the next CPU intensive container uses only 2% of a CPU – Elasticsearch is too heavy. For a server that gets 15 requests a minute, the resource consumption of Elasticsearch doesn’t justify it’s use. What would happen if I received a spike of traffic? I feel like the server would fall over not because the app couldn’t handle it, but log management couldn’t handle the load."

The thing is 'uses more resources than the thing next to it' is maybe a nice referential starting point, but it doesn't mean much.

What is the actual cost/risk/benefit work out to be? Cost in terms of development, maintenance, hosting, opportunity for future expansion?

Are we going to need to 'do more soon'? Are the hosting costs even material? Can Rust be easily supported?

The bit about 'what if there are more requests' does give pause for thought, but, it could be that there's a threshold/minimum that the service needs to operate, above which there's only marginal, incremental resource consumption. I don't know, I'm only pointing out the possibility.

And the choice of Rust, the author denoted has a few nice attributes ... but is this a personal choice ... or an optimal choice? Would the Python or Java solution be cleaner? Elastisearch is based on Lucene (Java), so it might be possible to do something very fast, powerful and extensible there as well.

Thanks to the author for both 'doing it' and 'writing about it' - but I think the meta issue here hinges is the technical product case.


>What is the actual cost/risk/benefit work out to be? Cost in terms of development, maintenance, hosting, opportunity for future expansion?

Right below where you quoted the author he stated his biggest risk - denial of service not due to his site having performance issues but his logging solution not being able to keep up if he had an unanticipated traffic spike.

That would indeed be a rather nasty case of self-ownage; the benefits of avoiding that should be pretty obvious.


I imagine Lucene would have fit the use case better than Elasticsearch, especially if the author found SQLite to work well. Given that the indexing and searching seems very minimal SQLite seems like the right choice, but more people should really know about Lucene. It's a great library and I've used it directly rather than go with ES for some smaller, or local projects.



For full-text search on Rust (and replacing Elastic) have a look at: https://www.meilisearch.com


also if you can, set up your logging to output JSON/GELF.


For anyone else unfamiliar

"""

Greylog Extended Log Format (GELF) is a log format that avoids the shortcomings of classic plain syslog:

* Limited to length of 1024 bytes – Not much space for payloads like backtraces

* No data types in structured syslog. You don’t know what is a number and what is a string.

* The RFCs are strict enough but there are so many syslog dialects out there that you cannot possibly parse all of them.

* No compression

""" -- https://docs.graylog.org/en/4.0/pages/gelf.html

It can be sent over UDP or TCP. Looks like each log entry is a JSON object, with standardised field names/formats.


I built a full text search functionality with SQLite in Go for searching a sql database of Magic cards, and I was impressed with how well SQLite handled it. I'm not surprised by this finding at all, I've found that even fairly complex queries on different cards are still plenty fast.


> for my nginx access logs

Interesting how we can refine use-cases and come up with simpler solutions that work as well or better. It's not a general replacement for Elasticsearch, notably may not scale for searching all the nginx logs for a distributed multi-tenant system.


Didn't java get play-nice-with-docker support only quite recently, at about 2019 (JDK 8u191, JDK-8146115) ?

My gut feeling says it won't make too much of a difference.


for nginx a Prometheus exporter module can be used https://github.com/vozlt/nginx-module-vts




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: