Solr is just so easy to get up and running, so powerful with so many options for scaling, I don't see the benefit of "pure Python" here. (To be fair, despite living and breathing Python most days, I've never thought "pure Python" is a selling point.)
Pure Python is definitely a selling point for those of us who regularly deploy to multiple platforms. Knowing you can 'pip install pure-python-lib' on any platform and have it work every single time without regards to what c compilers and c libraries you have available is a major boon.
Or, more importantly (in my case), very easily bundle pure Python code with your application. I understand the arguments against it, but in any project that might produce deliverables that will be in play for more than a few weeks the first thing I do is create a "lib" folder where any third party dependencies live and get imported from.
With Python this is usually very easy.
I've maintained my fair share of legacy code and more often than not one of the biggest issues is tracking down ancient libraries which work with my code. There are very few packages I trust to maintain fairly stable interfaces and be available years into the future (things like Apache HTTPd, PostgreSQL and memcached); everything else lives in the project lib folder. I can "hg clone" my app at any time and get the exact version of the third party packages I need.
Every so often I update the packages and test the integration.
This is why good dependency management systems support both specifying compatible semver ranges but also pinning/locking specific releases. You get nice one-step install/update/add/remove for your deps with the tool, but a fresh checkout can get the last known working versions and deploys can be guaranteed to use the same versions as were run in CI, etc.
Looking now, it seems that pip can finally do this with pip-compile.
If you're assuming a known-good Python stack, then a pure-python solution will just work. Using Solr/Lucene means having a known-good Python stack and a known-good JVM. Python is installed with most (all?) Linux distributions by default, but Java is almost never part of the base install.
Also, a pure-python solution allows for an in-process index/search, rather than adding another external process dependency which needs to be monitored and maintained.
Understood but I was addressing whether 'pure Python' is a selling feature which it undeniably is.
edit: I don't mean to say that it is a selling feature for everyone (i.e. people who aren't already using Python) but for those who are already using Python it definitely is.
Having had nothing but extremely negative experiences with production JVM deployments, anything being not-JVM-based is a huge selling point for me. Furthermore, "pure python" itself means that I can run it in any Python implementation I want, including pypy, without worrying about any C modules.
All of the troubles that I've had have been relative to compiling the JVM Python code and linking it correctly to my Java binary at compile time. Once I've gotten that working, I've found Lucene to be quite simple - and faster than Whoosh!
what kinds of JVM issues did you run into in deployment? for personal projects I have always used C++, Python, RoR, MEAN, etc. but at work where my role is non-dev, our software dev team likes Java and the heavy weighted GWT... frankly shutting me out. I'd love to know how to convince them to switch out of Java.
Memory issues out the ass. Which are obviously all in my head because there are no memory leaks in Java. I must be doing it wrong. Am I sure I know how to multiply? Of course you shouldn't be using version 1.2.3.4.5, our code only works with 1.2.3.4.4, everybody knows that. Well, except for that code, which of course you have to run on 1.2.3.4.4c. And this third-party daemon over here, that only runs on an obscure 6-year-old JVM and you have to put it in /var/deadchicken/ and insert three squirrels before executing.
You won't convince them, they've already jumped off the cliff. It's all perfectly normal to them. If you want to do development, find a company that hasn't been contaminated yet.
As others have posted, this is indicative of bad code. Java can certainly leak memory in the form of retained live references.
> Of course you shouldn't be using version 1.2.3.4.5, our code only works with 1.2.3.4.4, everybody knows that.
That's completely independent of the language or runtime being used, and is purely a project management issue (which breaking changes go into which version).
> that only runs on an obscure 6-year-old JVM
If there is code that imports classes specific to a JVM (com.sun, etc), then that code is doing something pretty much universally agreed on in the Java community to be a 'bad thing'. Otherwise, bytecode from Java 1.0 still runs on the latest JVM without issue.
You can use any language or runtime badly. There is a lot of code out there written in Java, and a lot of 'commodity' Java developers writing it. Of course there is going to be a greater volume of bad code, that's not an indictment of the platform itself.
> That's completely independent of the language or runtime being used, and is purely a project management issue (which breaking changes go into which version).
That passage was about the runtime. At one point, I had to deal with four different JVMs in the same server farm.
> You can use any language or runtime badly.
No other ecosystem has so consistently fucked me over. Not even PHP. I actually prefer Java to PHP for just writing code, but PHP to absolutely anything running on the JVM for operations.
Know when I stopped regularly getting woken up in the middle of the night? 2011, when I stopped supporting anything on a JVM. Ironically, it would have been two years earlier, but I foolishly self-inflicted Cassandra on myself. Never again.
But it sounds like your problem isn't Java/JVM, but rather really shitty code. If it breaks when you change minor versions of the JVM, the code is broken.
Just from my own personal experience, you've just indicted code from three well-known SV companies, at least two more non-SV fortune-500s, a mess of B2B vendors, the Apache foundation, and a few minor, independent open-source projects. I've heard stories from friends in a few other companies about both internal and third-party code.
At some point you reach the conclusion that the problem with an ecosystem is more fundamental than a few lazy developers. But even if not, your statement isn't actionable. If all the code in a certain set happens to be really shitty, the only rational choice is to not use that code. Any other argument ends up being one only of semantics.
You'll have all seen the pattern. The sort of people who use language X tend to be the same people who use VCS Y tend to be the same people who think doing Z is a good idea tend to be the same people who don't see any problem with IDE W and protocol V.
Culture (by definition) is something that propagates amongst a community, and sometimes it's just poisonous. Or at least seems poisonous from an outsider's point of view. Except if it's java, in which case there's no two ways about it.
I think I've deployed only one java web app (puppet-db) that doesn't have huge memory leak issues. With the lack of process recycling in tomcat to mitigate "bad code", I cringe every time I have to deploy another one.
I have two responses to that, which may seem contradictory but are both heartfelt:
1) Your company uses Maven? You're lucky!
2) Maven? The prosecution rests.
> Please educate me, what am I missing out on?
You're happy. I'm obviously just another annoying, delusional ops guy who doesn't know what he's talking about. Why would you think you're missing out on anything? Go about your business. Everything is just fine.
Don't be so bitter. Your post above was informative to me, and it's telling that you alluded to the community blaming ops troubles on "bad code", and the two people who replied to you did just that.
Solr is pretty good stuff, but I've had a bit more luck clustering and scaling elasticsearch horizontally with many hundreds of terabytes of data. From what I've read however, the newer solr versions took a lot of ideas from Elasticsearch and they are somewhat comparable at the moment.
Whoosh is great in that it fits the django philosophy of building sites fast, but I wouldn't use if for anything harder than a small site search. Once my site grew larger than 50 mb of text queries started slowing things down. Indexing (python, single threaded) took a while and the larger the index the slower the queries were returned.
It's probably the best first iteration search app that I've come across, and you can always slip in solr or something with more umph when you need it.
For very small datasets, Whoosh is fine. For anything larger than 'site search' on a website that has no UGC, you're much better off with SOLR (or ES, Xapian or Sphinx - whatever your poison).
When document counts get into the hundreds you see orders of magnitude faster queries with SOLR (etc), not to mention much more sophisticated querying options.
> Once my site grew larger than 50 mb of text queries started slowing things down. Indexing (python, single threaded) took a while and the larger the index the slower the queries were returned.
I think this makes it pretty clear. Not sure how much more explanation you need.
Whoosh is fantastic. I'm using it on an ecommerce web project just now for indexing CMS pages, products, datasheets and product hierarchy (categories, product families, etc).
FWIW I'm using Flask, Flask-SQLAlchemy, Whoosh and Flask-WhooshAlchemy [0]. Quick to get going with but the mix-and-match approach lets me easily rip pieces out as the project grows.
It works well enough for me. I can't say how large my index set is (because I honestly don't have the figures to hand) but on the current project, a search of around 50,000 products, 10,000 product families and a whole lot of associated data (product attributes, datasheets, etc), an uncached search takes around 65ms.
To put it in context, the psycopg2 calls to PostgreSQL take about 100ms to retrieve the associated data once I've found my search results with Whoosh.
(Most of my response time sits with SQLAlchemy ORM, building up matrices of data, which is why in production I'm caching the more complex queries with memcached).
Overall, for a project of this size (I can't imagine having to index more than a few hundred thousand objects) I'm very happy with Whoosh. If I get to the point of indexing millions of objects, I'll optimize it then.
not in my experience. i put id3 tags from ~32K songs in it and a query for an artist was around 700ms. same query in ES was like 10ms, same qry to sqlite db (select * from track where artist=...) was about the same as ES. the docs for whoosh make these claims that it's fast, but I haven't seen anything to backup the claims in unit tests or otherwise. I'm sure it's still useful for some purposes but ES is so easy to setup that I don't bother.
For example, you could use Whoosh for development environment and Elasticsearch for production if you need something more robust.
[1] http://haystacksearch.org/