On parsing: it's actually a somewhat fastidious process that involve digesting a couple of GB of data, but here is the bottom line.
I look for amazon.com links in the content in general - I will broaden that to other publishers and full-text extraction too later on.
The content itself comes from the StackOverflow dump (for SO) and a mixture of a crawler allowed by PG + the previous database dump that was available at some point.
I extract all the books, quotes, users data from both, conform these into a common schema, and index the whole result.
Hope I answered your question properly - feel free to ask again if you'd wish.
On ranking: I know what you mean! I need to find some way to balance number of quotes with textual relevance, which requires me to dive a bit more into solr. I currently use textual relevance first because it gives more useful results so far.
For ranking, instead of textual relevance (which will be hard to achieve :/) and # of quotes (it can be easily hacked/spammed or new announced books will have a huge weight on the ranking), I suggest you to check timelines: if a book is quoted once/twice/... a month regularly, I'm pretty sure it worths reading it
I'll love to read about the architecture behind the site... yep, technically curious :)
I will offer a more advanced search with similar features, definitely!
On the architecture: I'll create a side-blog that will outline all I learned while working on this. It's been a crazy ride actually (especially because I started using chef and vagrant full speed).
Tweaking relevance is not super-easy with Solr. We designed IndexTank to have a very simple way to play with ranking. You can modify your formulas in your dashboard or through the API and see the results order change in real time.
I really love what you got here. I'd be happy to help you try out IndexTank and make it better. It would really take the Solr configuration burden off of you.
well I considered using IndexTank earlier on, especially because I didn't know yet how to deploy Solr. The relevance is mostly done already, I just needed to learn to use formulas :)
One thing that put me off is your cap in queries per day. The smallest paid plan (50k items in index) is capped at 1,000 queries per day.
Isn't that an issue for most sites ? Do people usually cache your results ?
The 1k cap for the 50k doc plan is old, we will be upping that significantly. What would be a number of queries per day that would make you comfortable?
Questions:
Apart from the "Quoted by" section, is any of the content from SO or HN displayed on the site? E.g. are any comments incorporated in the book descriptions, or are those written by you/wife?
Currently, no content from HN/SO is displayed apart from the Quoted by area.
We're not editing anything manually; what I will do is display the actual conversations in the "Quoted by" area, either when you click on a conversation.
I may ove the quotes above to make them stand out more.
Aaah - I see :) The book description is provided by the Amazon API itself; I'm not aware of any issue with publishing it if you are a registered API user.
On parsing: it's actually a somewhat fastidious process that involve digesting a couple of GB of data, but here is the bottom line.
I look for amazon.com links in the content in general - I will broaden that to other publishers and full-text extraction too later on.
The content itself comes from the StackOverflow dump (for SO) and a mixture of a crawler allowed by PG + the previous database dump that was available at some point.
I extract all the books, quotes, users data from both, conform these into a common schema, and index the whole result.
Hope I answered your question properly - feel free to ask again if you'd wish.
On ranking: I know what you mean! I need to find some way to balance number of quotes with textual relevance, which requires me to dive a bit more into solr. I currently use textual relevance first because it gives more useful results so far.