Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'll bite harder. That's how the public Internet works. If you don't trust clients at all, serve them a login page instead of content.


In fairness this appears to be the direction we are headed anyway


This is how it's going. Half the websites I go to have Cloudflare captchas guarding them at this point. Every time I visit StackOverflow I get a 5 second wait while Cloudflare decides I'm kosher.


Are you using TOR or a VPN, spoofing your User-Agent to something uncommon, or doing something else that tries to add extra privacy?

That kind of user experience is one that I've seen a lot on HN, and every time, without fail, it's because they're doing something that makes them look like a bot, and then being all Surprised Pikachu when they get treated like a bot by websites.


I started having similar experiences when I switched to using Brave browser that blocks lots of tracking. Many websites that didn't show me those captchas and Cloudflare protection layers now have started to pop up on a regular basis.


I use regular Edge with uBlock and get cloudflare crapchas all the time.


I've found Vivaldi to be a much better experience, chrome based, and supports ublock in all its glory


I tried it and didn't like it. Currently migrating to Firefox though, their vertical tab implementation is really decent.


I get this and I assume it's because I clear cookies pretty frequently. It used to be the case that that didn't matter, but nowadays everyone shields their websites using JS.


It sucks that we're living in a landscape where bad actors take advantage of that way of doing things.


The really bad actors are going to ignore robots.txt entirely. You might as well be nice to the crawlers that respect robots.txt.


Even if you want to play nice, robots.txt is a catch-22, as accessing it is taken as a signal you are a 'bot' by malconfigured anti-bot 'solutions'.


It sucks more that Cloudflare/similar have responded to this with "if your handshake fingerprints more like curl than like Chrome/Firefox, no access for you".


I now write all of my bots in javascript and run them from the Chrome console with CORS turned off. It seems to defeat even Google's anti-bot stuff. Of course, I need to restart Chrome every few hours because of memory leaks, but it wasn't a fun 3 days the last time I got banned from their ecosystem with my kids asking why they couldn't watch Youtube.


Where can I learn more about custom bots in JS and Chrome?


Or getting a CAPTCHA from Chrome when visiting a site you've been to dozens of times (Stack Overflow). Now I just skip that content, probably in my LLM already anyway.


Keep in mind that those LLMs are one of the bigger reasons why we see more and more anti bot behaviour on sites like SO.

That aggressive crawling to train those on everything is insane.


It's the same thing as the anti pirate ads, you only annoy legit customers, this agressive captcha campaign just makes Stackoverflow drop down even faster than it would normally by making it lower quality.


There are tools like curl-impersonate: https://github.com/lwthiker/curl-impersonate out there that allow you to pretend to be any browser you like. Might take a bit of trial and error, but this mechanism could be bypassed with some persistence in identifying what is it that the resource is trying to block.


Bad actors will always exploit whatever systems are available to them. Always have, always will.


Because if they play by the rules, they won't be bad actors




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: