Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I really wish that the Internet Archive would provide bulk access to the Wayback Machine dataset. It would allow for a lot of interesting experimentation and research.


Is that even possible? I don't know the latest size of the IA, but it must be ridiculously huge by now, (1 billion pages a week added) bandwidth cost would be massive.

Maybe they could offer a mail-us-a-multi-petabyte-hdd service... Returned a few weeks later full of data :)


It's totally possible, they already have the infrastructure in place and 14PB of data available for download. Unfortunately the Wayback Machine data is not currently exposed publicly.


Why do you think that is? It seems like they are really open with most of their stuff, so why haven't they exposed the wayback with an api?

Then again, wouldn't it be pretty trivial to scrape?

(I say this as I'm working on a hellish scraping project, and the wayback machine seems like it would be a walk in the park to scrape)


> I really wish that the Internet Archive would provide bulk access to the Wayback Machine dataset.

Have you asked them? Did they refuse outright?


I have not personally asked, but I think the Archive Team has. I don't know the reasons behind the policy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: