Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So record actual user input data and generate similar input patterns stochastically.

That said if you try to scale this up beyond what a reasonable, normal user world do in one sitting, you are bound to stand out.

Although that said, I find that I trigger such rate-limiting mechanisms already as a human just when searching Google as a human being and clicking through every last search result page.



You'd have to scrape slowly to mimic a real slow user. Maybe at that point you'd be cheaper to get Mechanical Turk to do it. That should solve IP rate limiting, captchas, and just about everything except the endless arms race. Why are so many people going directly to these same-formatted internal URLs without clicking through from random other places? So the site can change the internal URLs and break it all over again.


You'd use a browser extension, scoped to requests of sites you're interested in, and stream your data back to your infrastructure for processing. You're limited only by your install base and your ingest infrastructure.

Recap [1] does this to extract PACER court documents that are public domain, but access is restricted due to draconian public policy.

[1] https://free.law/recap/


>You'd have to scrape slowly to mimic a real slow user.

Sure, but that's easily mitigated by running multiple scrapers as different users.. You don't need to get all the data from a single scrape.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: