ArchiveTeam already has a system for this. It's called "the warrior" and it's a ...

bambax · on April 20, 2013

Ok, I'm downloading this as I speak.

Still, it's much more complex than a simple webapp (so the audience willing to do it will be smaller). If there are ~12 million events, we need 10 000 people downloading 1200 pages...

(On the leaderboard, the number of "to do" items goes up instead of down: what does it mean?)

creature · on April 20, 2013

I don't know for sure, but I expect it's a standard thing from any spidering. Let's say you start at one page and find 10 links; your tracker will then have 10 items. If each of those pages has 10 links, then you can then have up to 100 items to download. And so on.

Over time you're going to start seeing duplicates (ie. things you've already archived/spidered), so the number will go down. But while you're still in the 'discovery' stage it's going to go up and up.

bambax · on April 20, 2013

Oh, ok, if you're spidering.

But here's what the OP said: "All of Upcoming's events and venues use autoincremented ids, making it dead simple to generate a list of URLs to scrape."

So if you go from 1 to {current_id} you get them all, and you have the list at the start.

sp332 · on April 20, 2013

Sometimes the "items" can get huge. Usually one user counts as one item, so some users run into multiple gigabytes and take days to finish. How would you handle that with a webapp?

I'm not sure about the details for this project, but sometimes it's not easy to get a dump of a list of the items you want all upfront. So the warrior process catalogs new items it encounters and updates the list on the tracker.

bambax · on April 20, 2013

Ok, I get that now; but in this particular project every event id is known (see my other comment below).

A webapp would simply need to receive a list of ids and download one at a time (with a comfortable delay between each call so as to not trigger Yahoo's security).