I still think there is an opportunity for a hosted browser vpn type service or a...

cookiengineer · on April 18, 2021

Except tracking services know about centralized VPN IP ranges that do not rotate. That's why everybody in algotrading switched to mobile apps and 3G/4G proxies that are installed on actual smartphones, and usually they have dozens of SIM cards laying around.

KirillPanov · on April 18, 2021

Er, wait, your comment sounds really interesting, but what is the threat model for algorithmic traders? Who are they trying to hide from and why?

Almost none of the web breaks simply because you're using a (very well known, very old) VPN IP. Purchases using a credit card are pretty much the only thing that is likely to get blocked. And you face a few more captchas sometimes.

Do algorithmic traders need to use something on the web that blocks VPN IPs more aggressively than this? They must, because juggling all those SIM cards sounds like a huge headache, and cell phone data has awful latency. I'm wondering what they're scraping that the average person doesn't use, and why they want to look like an average person if that's the case.

cookiengineer · on April 18, 2021

AFAIK mostly details, news, stories and stuff related to public knowledge about a company (which they factor in to the real HFT data) due to - as you already said - high latencies in mobile networks.

The issue that arises is mostly cloudflare-related, due to them having a huge influence on hosting, and the forced recaptchas when they detect anomalies in traffic behaviour makes a web scraping workflow real shitty.

From an algotrader's point of view it's a fix to make their web scrapers work again. I'm not sure how Chrome/ium headless could fix this (if it could). I'm a bit sceptical as it's just yet another cat and mouse game.

But as of late I've seen a huge scene start to develop around extension-building for headless Chrome specifically, so that they can run headless and still get the data as they want it to by integrating a content script that sends the data to another service.

KirillPanov · on April 18, 2021

Ah, thanks, now I get it: it doesn't bother them that fingerprinting lets them be tracked; it bothers them that fingerprinting (and VPN IPs) mean their scrapers get hit with captchas.

FWIW, it takes very little effort to completely conceal the fact that firefox is running headless under marionette/webdriver/geckodriver. Chrom(ium) takes more effort, but these guys have solved it (and built a business around it):

https://intoli.com/blog/making-chrome-headless-undetectable/

https://github.com/intoli/remote-browser

Of course neither of these address fingerprinting -- all your scrape requests will have the same or similar fingerprint, which will lead to captchas pretty quickly. This might help (and might even be part of the reason for buying piles of cellphones):

https://news.ycombinator.com/item?id=25379342

nextaccountic · on April 18, 2021

How to make headless firefox undetectable to fingerprinting?

sixothree · on April 19, 2021

Reason 1 of 1000 why I am not working on any projects related to this.