Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Glad you like it! Really :)

On parsing: it's actually a somewhat fastidious process that involve digesting a couple of GB of data, but here is the bottom line.

I look for amazon.com links in the content in general - I will broaden that to other publishers and full-text extraction too later on.

The content itself comes from the StackOverflow dump (for SO) and a mixture of a crawler allowed by PG + the previous database dump that was available at some point.

I extract all the books, quotes, users data from both, conform these into a common schema, and index the whole result.

Hope I answered your question properly - feel free to ask again if you'd wish.

On ranking: I know what you mean! I need to find some way to balance number of quotes with textual relevance, which requires me to dive a bit more into solr. I currently use textual relevance first because it gives more useful results so far.



Thanks :)

For ranking, instead of textual relevance (which will be hard to achieve :/) and # of quotes (it can be easily hacked/spammed or new announced books will have a huge weight on the ranking), I suggest you to check timelines: if a book is quoted once/twice/... a month regularly, I'm pretty sure it worths reading it

I'll love to read about the architecture behind the site... yep, technically curious :)


I will offer a more advanced search with similar features, definitely!

On the architecture: I'll create a side-blog that will outline all I learned while working on this. It's been a crazy ride actually (especially because I started using chef and vagrant full speed).

I'll post it back here in all cases.


Tweaking relevance is not super-easy with Solr. We designed IndexTank to have a very simple way to play with ranking. You can modify your formulas in your dashboard or through the API and see the results order change in real time.

I really love what you got here. I'd be happy to help you try out IndexTank and make it better. It would really take the Solr configuration burden off of you.


Hi!

well I considered using IndexTank earlier on, especially because I didn't know yet how to deploy Solr. The relevance is mostly done already, I just needed to learn to use formulas :)

One thing that put me off is your cap in queries per day. The smallest paid plan (50k items in index) is capped at 1,000 queries per day.

Isn't that an issue for most sites ? Do people usually cache your results ?


The 1k cap for the 50k doc plan is old, we will be upping that significantly. What would be a number of queries per day that would make you comfortable?


That's a good news I think :)

I can't really tell yet, but my guess is that at least around the number of indexed documents could give a more usable subscription.

It would also be nice for people to know what happens if you go beyond the cap: do you offer some tolerance ?


We don't deny the service no matter how far beyond the cap you go. In case the cap is overpassed very often, we just contact the user.


Good to hear! Thanks for all those clarifications.


Looks like a very useful site, thanks!

Questions: Apart from the "Quoted by" section, is any of the content from SO or HN displayed on the site? E.g. are any comments incorporated in the book descriptions, or are those written by you/wife?


Glad you like it!

Currently, no content from HN/SO is displayed apart from the Quoted by area.

We're not editing anything manually; what I will do is display the actual conversations in the "Quoted by" area, either when you click on a conversation.

I may ove the quotes above to make them stand out more.

Did I properly answer your question ?


Ok, so the book description is written manually?

Was justing wondering if the content might be touched by any copyright issues etc.


Aaah - I see :) The book description is provided by the Amazon API itself; I'm not aware of any issue with publishing it if you are a registered API user.


Ok, thanks! :-)


Are you making any revenue from referrals?


The site is just released, so it's currently 0$ :)

We'll see how it goes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: