Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Whoosh – Fast, full-text indexing and searching library in Python (bitbucket.org/mchaput)
146 points by albertzeyer on Nov 19, 2013 | hide | past | favorite | 44 comments


If you are using it for Django, take a look on Haystack [1]. It supports Whoosh as well as Solr, Elasticsearch, etc...

For example, you could use Whoosh for development environment and Elasticsearch for production if you need something more robust.

[1] http://haystacksearch.org/



And Xapian, which I've had good results with.


Solr is just so easy to get up and running, so powerful with so many options for scaling, I don't see the benefit of "pure Python" here. (To be fair, despite living and breathing Python most days, I've never thought "pure Python" is a selling point.)


Pure Python is definitely a selling point for those of us who regularly deploy to multiple platforms. Knowing you can 'pip install pure-python-lib' on any platform and have it work every single time without regards to what c compilers and c libraries you have available is a major boon.


Or, more importantly (in my case), very easily bundle pure Python code with your application. I understand the arguments against it, but in any project that might produce deliverables that will be in play for more than a few weeks the first thing I do is create a "lib" folder where any third party dependencies live and get imported from.

With Python this is usually very easy.

I've maintained my fair share of legacy code and more often than not one of the biggest issues is tracking down ancient libraries which work with my code. There are very few packages I trust to maintain fairly stable interfaces and be available years into the future (things like Apache HTTPd, PostgreSQL and memcached); everything else lives in the project lib folder. I can "hg clone" my app at any time and get the exact version of the third party packages I need.

Every so often I update the packages and test the integration.


This is why good dependency management systems support both specifying compatible semver ranges but also pinning/locking specific releases. You get nice one-step install/update/add/remove for your deps with the tool, but a fresh checkout can get the last known working versions and deploys can be guaranteed to use the same versions as were run in CI, etc.

Looking now, it seems that pip can finally do this with pip-compile.


Solr and its backend Lucene are written in Java. so it runs on any platform arguably even more easily than Python.


If you're assuming a known-good Python stack, then a pure-python solution will just work. Using Solr/Lucene means having a known-good Python stack and a known-good JVM. Python is installed with most (all?) Linux distributions by default, but Java is almost never part of the base install.

Also, a pure-python solution allows for an in-process index/search, rather than adding another external process dependency which needs to be monitored and maintained.


> so it runs on any platform arguably even more easily than Python

I think that would be a very difficult argument to make.


Understood but I was addressing whether 'pure Python' is a selling feature which it undeniably is.

edit: I don't mean to say that it is a selling feature for everyone (i.e. people who aren't already using Python) but for those who are already using Python it definitely is.


Having had nothing but extremely negative experiences with production JVM deployments, anything being not-JVM-based is a huge selling point for me. Furthermore, "pure python" itself means that I can run it in any Python implementation I want, including pypy, without worrying about any C modules.


All of the troubles that I've had have been relative to compiling the JVM Python code and linking it correctly to my Java binary at compile time. Once I've gotten that working, I've found Lucene to be quite simple - and faster than Whoosh!


what kinds of JVM issues did you run into in deployment? for personal projects I have always used C++, Python, RoR, MEAN, etc. but at work where my role is non-dev, our software dev team likes Java and the heavy weighted GWT... frankly shutting me out. I'd love to know how to convince them to switch out of Java.


Memory issues out the ass. Which are obviously all in my head because there are no memory leaks in Java. I must be doing it wrong. Am I sure I know how to multiply? Of course you shouldn't be using version 1.2.3.4.5, our code only works with 1.2.3.4.4, everybody knows that. Well, except for that code, which of course you have to run on 1.2.3.4.4c. And this third-party daemon over here, that only runs on an obscure 6-year-old JVM and you have to put it in /var/deadchicken/ and insert three squirrels before executing.

You won't convince them, they've already jumped off the cliff. It's all perfectly normal to them. If you want to do development, find a company that hasn't been contaminated yet.


> Memory issues out the ass.

As others have posted, this is indicative of bad code. Java can certainly leak memory in the form of retained live references.

> Of course you shouldn't be using version 1.2.3.4.5, our code only works with 1.2.3.4.4, everybody knows that.

That's completely independent of the language or runtime being used, and is purely a project management issue (which breaking changes go into which version).

> that only runs on an obscure 6-year-old JVM

If there is code that imports classes specific to a JVM (com.sun, etc), then that code is doing something pretty much universally agreed on in the Java community to be a 'bad thing'. Otherwise, bytecode from Java 1.0 still runs on the latest JVM without issue.

You can use any language or runtime badly. There is a lot of code out there written in Java, and a lot of 'commodity' Java developers writing it. Of course there is going to be a greater volume of bad code, that's not an indictment of the platform itself.


> That's completely independent of the language or runtime being used, and is purely a project management issue (which breaking changes go into which version).

That passage was about the runtime. At one point, I had to deal with four different JVMs in the same server farm.

> You can use any language or runtime badly.

No other ecosystem has so consistently fucked me over. Not even PHP. I actually prefer Java to PHP for just writing code, but PHP to absolutely anything running on the JVM for operations.

Know when I stopped regularly getting woken up in the middle of the night? 2011, when I stopped supporting anything on a JVM. Ironically, it would have been two years earlier, but I foolishly self-inflicted Cassandra on myself. Never again.


I don't actually write Java professionally.

But it sounds like your problem isn't Java/JVM, but rather really shitty code. If it breaks when you change minor versions of the JVM, the code is broken.


Just from my own personal experience, you've just indicted code from three well-known SV companies, at least two more non-SV fortune-500s, a mess of B2B vendors, the Apache foundation, and a few minor, independent open-source projects. I've heard stories from friends in a few other companies about both internal and third-party code.

At some point you reach the conclusion that the problem with an ecosystem is more fundamental than a few lazy developers. But even if not, your statement isn't actionable. If all the code in a certain set happens to be really shitty, the only rational choice is to not use that code. Any other argument ends up being one only of semantics.


Culture. It is all about culture.

You'll have all seen the pattern. The sort of people who use language X tend to be the same people who use VCS Y tend to be the same people who think doing Z is a good idea tend to be the same people who don't see any problem with IDE W and protocol V.

Culture (by definition) is something that propagates amongst a community, and sometimes it's just poisonous. Or at least seems poisonous from an outsider's point of view. Except if it's java, in which case there's no two ways about it.


I think I've deployed only one java web app (puppet-db) that doesn't have huge memory leak issues. With the lack of process recycling in tomcat to mitigate "bad code", I cringe every time I have to deploy another one.


I am very happy with Maven for tracking dependencies and building and deploying artifacts.

Please educate me, what am I missing out on?


> I am very happy with Maven

I have two responses to that, which may seem contradictory but are both heartfelt:

1) Your company uses Maven? You're lucky!

2) Maven? The prosecution rests.

> Please educate me, what am I missing out on?

You're happy. I'm obviously just another annoying, delusional ops guy who doesn't know what he's talking about. Why would you think you're missing out on anything? Go about your business. Everything is just fine.


Don't be so bitter. Your post above was informative to me, and it's telling that you alluded to the community blaming ops troubles on "bad code", and the two people who replied to you did just that.


Solr, Sphinx and ElasticSearch are all reliable search-at-scale options for those of us without petabyte problems.

Whoosh is fine for very small datasets (megabytes) and low loads, just watch out for file permission issues.


ElasticSearch is actually designed to scale to petabytes and beyond:

http://elasticsearch.com/product/ ("petabytes of data with ease")


Solr is pretty good stuff, but I've had a bit more luck clustering and scaling elasticsearch horizontally with many hundreds of terabytes of data. From what I've read however, the newer solr versions took a lot of ideas from Elasticsearch and they are somewhat comparable at the moment.


Not the most popular, but test suite veracity is my favorite "pure Python" selling point.


You'll have to explain this. Most tests we write in Python are redundant with static typing.


Whoosh is great in that it fits the django philosophy of building sites fast, but I wouldn't use if for anything harder than a small site search. Once my site grew larger than 50 mb of text queries started slowing things down. Indexing (python, single threaded) took a while and the larger the index the slower the queries were returned.

It's probably the best first iteration search app that I've come across, and you can always slip in solr or something with more umph when you need it.


> Indexing (python, single threaded) took a while

The documentation indicates there is a multiprocessing option. (Also configurable memory limits, if you didn't see that.)

http://whoosh.readthedocs.org/en/latest/batch.html


Ooh - handy, I hadn't seen that thanks. I'll try it out.


Can you clarify why you would not use it on anything harder than a small site search? Without context, the comment is not especially helpful.


For very small datasets, Whoosh is fine. For anything larger than 'site search' on a website that has no UGC, you're much better off with SOLR (or ES, Xapian or Sphinx - whatever your poison).

When document counts get into the hundreds you see orders of magnitude faster queries with SOLR (etc), not to mention much more sophisticated querying options.


> Once my site grew larger than 50 mb of text queries started slowing things down. Indexing (python, single threaded) took a while and the larger the index the slower the queries were returned.

I think this makes it pretty clear. Not sure how much more explanation you need.


Wasn't in the original comment... yet another example of why HN should not have edit.


ElasticSearch with Haystack is the next logical move after outgrowing Whoosh.


Based on my own experiences, by the time search is an important enough feature to outgrow Whoosh, you've also outgrown Haystack's API.


Whoosh is fantastic. I'm using it on an ecommerce web project just now for indexing CMS pages, products, datasheets and product hierarchy (categories, product families, etc).

FWIW I'm using Flask, Flask-SQLAlchemy, Whoosh and Flask-WhooshAlchemy [0]. Quick to get going with but the mix-and-match approach lets me easily rip pieces out as the project grows.

[0] http://pythonhosted.org/Flask-WhooshAlchemy/


Sounds cool. Does it scale well? Has anyone used it for a large-ish amount of data (gigs)?


It works well enough for me. I can't say how large my index set is (because I honestly don't have the figures to hand) but on the current project, a search of around 50,000 products, 10,000 product families and a whole lot of associated data (product attributes, datasheets, etc), an uncached search takes around 65ms.

To put it in context, the psycopg2 calls to PostgreSQL take about 100ms to retrieve the associated data once I've found my search results with Whoosh.

(Most of my response time sits with SQLAlchemy ORM, building up matrices of data, which is why in production I'm caching the more complex queries with memcached).

Overall, for a project of this size (I can't imagine having to index more than a few hundred thousand objects) I'm very happy with Whoosh. If I get to the point of indexing millions of objects, I'll optimize it then.


not in my experience. i put id3 tags from ~32K songs in it and a query for an artist was around 700ms. same query in ES was like 10ms, same qry to sqlite db (select * from track where artist=...) was about the same as ES. the docs for whoosh make these claims that it's fast, but I haven't seen anything to backup the claims in unit tests or otherwise. I'm sure it's still useful for some purposes but ES is so easy to setup that I don't bother.


I have found Whoosh to be easy to use, though I haven't put it to use with a large number of documents, so I don't know how well it holds up.


I recommend taking a look at Whoosh -- I worked with it extensively a few years back, adding a keyvalue backend for it.

It seems to be designed for indexing manpages. That is, a medium-sized semi-static database with a few different dimensions per document.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: