You can't create your own link previewer, cloudflare will put a captcha in front...

dmix · on Nov 5, 2020

I’ve long said Cloudflare is a dangerous threat to the open internet and as well as some privacy tools like TOR.

But it doesn’t always get much traction on here because both the founder and employees of cloudflare are quite popular users on HN. Some have given me brief half assed counter answers that conveniently miss other harder questions like a good PR person does (and which you seem to have gotten in your reply).

I hope every web admin gives it serious second thought before adopting Cloudflare. Just like for cellphones OS/operator the one thing I’d dream of is a tool that offers a limited set of what Cloudflare does (DDOS protection, hosting privacy layer) but is pro internet and pro privacy. They seem hostile to it in many ways likely because it directly affects their bottom line.

The bigger question is whether such a tool could be created without all the downsides. The two I listed I think yes. But their web app security system is overly strict and bad for the internet IMO.

And I say that knowing they protect some serious defenders of human rights and face a lot of abuse from the ‘bad guys’. I just wished there was a better middle ground.

hombre_fatal · on Nov 5, 2020

> But it doesn’t always get much traction on here because both the founder and employees of cloudflare are quite popular users on HN.

I don't think it gets much traction because you're barking up the wrong tree. Also, suggesting that YC is out to silence you and that nobody actually has a counter argument isn't very good for traction, either.

Until my website can't get taken off by a $5 rental of an internet-of-shit botnet, Cloudflare gives me and my users recourse against the bad actors of the world. (I also enjoy its host cloaking for my privacy)

You simply gloss over bad actors and attack one of the only solutions that works. The biggest threat to the open internet was its naive "there are no bad actors" design, not the people giving us one of the only bulwarks against bad design.

I agree with your last sentence that it would be nice to have a better middle ground, but notice that's not the "cloudflare bad" thesis of your comment.

The internet needs to be improved so that Cloudflare is redundant. It's not Cloudflare's fault that fundamental design oversights (like optional ISP egress filtering) have created a lucrative niche. And things like faster, unlimited data plans accessible by smart toasters and smart doorbells on top of the internet's naive architecture only entrenches Cloudflare further.

gjs278 · on Nov 6, 2020

I hosted a server that was attacked all the time over a comcast connection and was always able to figure it out without cloudflare proxy blocking for me

KMag · on Nov 6, 2020

Cloudflare even puts multiple captcha challenges for any request from the default browser on the Samsung S7 Edge. Granted it's an old phone at this point, and most users install Chrome on their phones, but I end up skipping a lot of websites on my phone rather than participate in furthering the misconception that "Chrome is the only browser".

pvg · on Nov 5, 2020

because both the founder and employees of cloudflare are quite popular users on HN.

It seems a lot more likely people aren't finding your argument as convincing as you'd like. Plenty of well-known users (and users who identify their employer) around whose companies' HN-perception fortunes change quite a bit over time.

wdb · on Nov 6, 2020

Normally skip those sites that ask for a Cloudflare captcha if the site isn't too important. Luckily this is the case most of the time.

Would be annoying when online banking or governmental sites start asking for them.

ShamelessC · on Nov 5, 2020

Edit: my bad. Misinterpreted your comment.

Can you elaborate on how Tor is a threat to the open internet? That's a non-obvious statement to me. I'm aware that it's compromisable via controlling exit nodes (NSA, various nations) but that's not really the threat profile for the average person. Are there any other reasons?

Because despite its flaws, afaik TOR is an attempt to make the internet _more_ open to those who are being surveiled.

What am I missing?

JosephRedfern · on Nov 5, 2020

I think OP is suggesting that Cloudflare is a threat to TOR, not that TOR is a threat to the internet.

input_sh · on Nov 5, 2020

Website owners can actually whitelist Tor traffic as a "country", but not a lot of them knows/cares/wants to do that.

grumpitron · on Nov 5, 2020

My read of it was that CloudFlare is a threat to both the open internet and TOR, not that TOR was also a threat to the open internet.

m-ee · on Nov 5, 2020

I read that as Cloudflare is a threat to tools like Tor

secfirstmd · on Nov 6, 2020

Hey, try https://deflect.ca if you want an ethical DDoS protection service.

meowface · on Nov 6, 2020

Any company through which a high percentage of web traffic is not only routed through but fully reverse-proxied of course always should be a significant concern and should be subject to extreme scrutiny. But why explicitly do you think they're anti-internet and anti-privacy? To me it seems like being pro-internet and pro-privacy aligns both with their general incentives and their monetary incentives.

I genuinely think they're a net positive for and supporter of Tor users. Before, site owners and security providers who faced issues with abusive/malicious traffic behind Tor connections (spam, illicit content, security scanning, password struffing) nearly always resorted to outright blocking all Tor exit node IPs, because they had no other feasible option. I've been in that position. Cloudflare at least provides any site owner an ability to easily allow the traffic; just with a fairly quick occasional bot check.

Additionally, as of 2018 they now have an "Onion Routing" option which site owners can enable, which results in Tor users being able to access your site 100% through the Tor network. As a result, Tor users no longer experience any captchas, load your site faster, and never have to touch the clearnet.

>But their web app security system is overly strict and bad for the internet IMO.

Their WAF seems to have a pretty low false positive rate, compared to others I've seen. (Though the flipside of that is it also has a pretty high false negative rate and isn't very helpful against a dedicated non-automated attacker, like many other WAFs.)

>But it doesn’t always get much traction on here because both the founder and employees of cloudflare are quite popular users on HN.

They do post a lot here, but I doubt that's really responsible for defensive responses from other HN users. The most common criticism I see here (presenting a captcha for people using Tor, which site owners can now disable) makes me think the majority of people making the criticism have never run large websites or worked infosec for any organization with a large website.

Tor is of course not a threat itself, but anecdotally I'd estimate 90 - 95% of traffic that the average website owner receives from Tor is highly abusive/malicious, and Cloudflare empirically estimated 94% as of 2016 (https://blog.cloudflare.com/the-trouble-with-tor/). And anecdotally, not only is a high percentage of Tor traffic malicious, in many cases a significant percentage of all malicious traffic is Tor traffic. Naturally, due to Tor by design making it impossible to distinguish the ~94% connections from the ~6%, it's extremely difficult to mitigate this without just blocking 100% of Tor traffic. This is obviously not Tor or anyone's fault; it's just a practical reality for website owners. This sort of situation will always be the case for any kind of robust privacy-protecting application.

Cloudflare is possibly the first free service that actually enables anyone to easily allow normal traffic from Tor without much increase in security/abuse risk. They seem explicitly pro-Tor, especially with the explicit Onion Routing feature that lets Tor users access your site 100% through the Tor network without ever experiencing captchas, and statements like in https://blog.cloudflare.com/the-trouble-with-tor/ and https://blog.cloudflare.com/cloudflare-onion-service/

One may certainly have lots of other justified, legitimate concerns regarding the company and their disproportionate control of a huge chunk of the internet and web, but I'm not sure how someone could read those, see how the traffic is handled in practice, and conclude they're anti-Tor or a dangerous threat to Tor.

rainingcatndogs · on Nov 5, 2020

And unfortunately, cloudflare is everywhere. This trend will make it even harder for projects like a new search engine to enter the game.

raverbashing · on Nov 5, 2020

Because if you don't have it some a-hole will go and ddos your site or you want to prevent a hug-of-death because of reasons.

It seems a lot of issues happen because bad players are continued to allowed to thrive, example: everybody uses a big provider because they're the only ones that solved the spam issue.

cblconfederate · on Nov 5, 2020

cloudflare can just allow a fair crawl rate instead of a captcha on first request

beagle3 · on Nov 5, 2020

The problem is that bad actors can masquerade as a lot of independent clients (The first D in DDoS stands for "distributed").

Figuring out whether a site is under a DDoS attack or getting legitimate requests from many sources is a very hard problem, and can just be worded "telling good actors from bad actors" -- no simple solution works; also, who YOU consider a good actor and who the website owner considers a good actor may be at odds.

Most people (and CloudFlare by default) consider FAcebook a good actor; but as far as I'm concerned, Facebook is an evil an actor as one can be.

cblconfederate · on Nov 5, 2020

> sources is a very hard problem

We're talking about virtually unknown blogs that get 1 http request from my server's IP, which is not blacklisted anywhere. It's not hard at all , i just think cloudflare's tech s not that good

hombre_fatal · on Nov 5, 2020

You're really pulling a "how hard could it really be??" to DDoS prevention?

You should at least be humbled by how few services can even offer DDoS protection that works against volumetric attacks and isn't just based on null-routing. The people with skin and money in the game might know something you don't.

cblconfederate · on Nov 5, 2020

here's how simple it is :

    if (!website.underDDoS && website.requestedTimesToday[ip] <10) showCaptcha=0;

beagle3 · on Nov 5, 2020

How do you implement "website.underDDoS"?

Through a proxy - mind you; CloudFlare makes their decision without access to your CPU or DB metrics, and don't know which page load times are legitimately slow and which aren't supposed to be.

cblconfederate · on Nov 6, 2020

how about "haven't had requests for the past 2 minutes". Again, i m talking about links to obscure blogs that barely anyone reads, let alone DDoSes

I think another comment here may be closer to the truth, CF may only be running heuristics on the user agent

beagle3 · on Nov 6, 2020

If hardly anyone reads or DDoSes them, why did they go to the trouble of setting up CloudFlare? It’s free for those obscure blogs, but it’s definitely a non trivial hassle. Usually people set it up only after they experienced their first attack.

I get it that you are upset Google gets to scrape them and you don’t. But bad actors really are making it difficult for everyone to just “be” on the internet.

cblconfederate · on Nov 6, 2020

i dont know! but they do it, everyone does it because everyone else does it. it s not unusual

londons_explore · on Nov 5, 2020

I got round it by just making sure the user agent is set to the latest version of Chrome rather than a version from a few years ago that I had hardcoded before. It seems Cloudflares protection is pretty much "is your user agent in the top 10 user agents?".

Did you try that?

cblconfederate · on Nov 5, 2020

I have, iirc it worked some times, but not always. Is it a reliable solution for you?

londons_explore · on Nov 6, 2020

It's at least a 95% reliable solution, which seems to be about the same as a real user sees.

sfifs · on Nov 6, 2020

Well if you have an easy solution that you think would work, why don't you put up a website, commission a DDOS attack from a skilled actor and try to demonstrate mitigation?

Companies pay big money to CloudFlare. If a simpler and cheaper solution is workable, they'll pay you instead.

SiempreViernes · on Nov 6, 2020

Just like telling if it's raining is easy but stopping rain once has started is hard, the claim is that it's not hard to detect if a site is being ddosed.

beagle3 · on Nov 7, 2020

It is not at all easy to tell the difference between a DDoS and the slashdot effect (or HN hug of death, depending on your age). At least not without a man in the loop.

Kaze404 · on Nov 5, 2020

I use Zoho.com and I rarely get spam, if ever.

snazz · on Nov 5, 2020

Zoho isn't Google-size, but it isn't irrelevant, either. Sending mail from a self-hosted email server is far harder since the big providers might put it in spam or drop it even earlier.

dzhiurgis · on Nov 5, 2020

To add to sibling - running your own mail server is the only way to ensure your email is not read by someone else which is so messed up.

londons_explore · on Nov 5, 2020

> running your own mail server is the only way to ensure your email is not read by someone else

But any mail you send to someone else probably ends up read by Google/Microsoft anyway, since that's where their mailbox is.

Also, email security is a joke. It's 2020, and even TLS encrypted SMTP connections tend not to check for a valid certificate, making them trivial to MITM.

55555 · on Nov 6, 2020

Practically speaking how does one MITM an SMTP connection? For example, from Google to Microsoft. They connect directly to the IP addresses they get from MX records + lookup. What's the actual threat vector/execution here?

londons_explore · on Nov 6, 2020

Anyone with hardware on the network path can do it... Or anyone who can inject BGP routes can do it too.

pvorb · on Nov 5, 2020

I use it as well and I get sooo much more spam than I git on Gmail.

bgirard · on Nov 5, 2020

Long term, a new HTTP META method would be interesting. I wonder if something like that has ever been considered. Providers like Cloudflare would hopefully be more lenient with these requests.

nbadg · on Nov 5, 2020

Huh. It's certainly an interesting idea! Strictly speaking, individual people could implement this today, since nonstandard HTTP verbs don't break anything that doesn't know to request with them. (It wouldn't be of much use, because clients wouldn't know to use it, but still -- something that could easily be prototyped).

I don't think FAAANG (or any other big players) would have much interest in making it happen in the standard though, since it would undercut their big-player advantage.

fiddlerwoaroof · on Nov 5, 2020

I wonder if

    Accept: application/json

Would be a reasonable alternative? Wasn't this supposed to be the point of content negotiation?

stingraycharles · on Nov 5, 2020

Maybe, but not really; seems like this thread is more about intent (“I just want a preview”) while content type is more about representation (“I want the content as json”). I can imagine that there will be websites that are actively using the accept parameter to distinguish between “regular visitors” and have their APIs at the same paths (didn’t Reddit do this at some point?), and thus your approach would break in this case.

I guess what this is really about is, I hate to say it, but something in the direction of the semantic web, where web servers (and in this case, CloudFlare et al) actually gain a deeper understanding of the content they serve, and a web browser / crawler being able to query this content directly.

fiddlerwoaroof · on Nov 5, 2020

It seems to me that what "previews" really want is an API for the page's content in a structured format: OpenGraph tags and other microformats are one representation, but it's annoying to have to load _all_ the HTML just to grab title and the OG tags.

stickfigure · on Nov 5, 2020

Accept: text/preview

stingraycharles · on Nov 6, 2020

In what content type? Json? Xml? Html?

mtberatwork · on Nov 6, 2020

Doesn't the oembed spec [1] already solve this? I think the OP could solve their problem by simply creating an oembed endpoint with all the necessary meta data.

[1] https://oembed.com/

67868018 · on Nov 7, 2020

Yea but when your request to fetch the oembed data is blocked by a CAPTCHA...

This is a real problem, we experience it in the Fediverse

sergiotapia · on Nov 5, 2020

The cloudflare and google catcha are terrible. It's so bad that at this point I just close the tab if they challenge me with it. I use Brave and always have Shields UP, it seems having it up makes the captchas extremely difficult. Mission accomplished I guess.

coderholic · on Nov 6, 2020

At https://host.io we scrape every registered domain once a month, and make the meta data available freely over an API. You could use that to get a title for a domain (although not for a URL that's not the main domain), eg:

    $ curl https://host.io/api/web/facebook.com?token=$TOKEN
    {
      "domain": "facebook.com",
      "rank": 2,
      "url": "https://www.facebook.com/",
      "ip": "157.240.11.35",
      "date": "2020-08-26T17:39:17.981Z",
      "length": 160817,
      "encoding": "utf8",
      "copyright": "Facebook © 2020",
      "title": "Facebook - Log In or Sign Up",
      "description": "Create an account or log into Facebook. Connect with friends, family and other people you know. Share photos and videos, send messages and get updates.",
      "links": [
        "messenger.com",
        "oculus.com"
      ]
   }

See https://host.io/docs for more details about the API and what else you can do with it (eg. find backlinks to domains, domains with the same adsense ID etc)

neilparikh · on Nov 5, 2020

> Frankly, i wish facebook or cloudflare offered their previewer as a free service, since most websites have them whitelisted.

Yup, and exposing just a key pieces of information (title, and some of the meta/og tags) without the body would limit the potential for abuse, while still being fairly useful for legitimate uses.

qwerty456127 · on Nov 5, 2020

There hardly are any "illegitimate" uses. The web is meant to be machine-readable (we wouldn't have Google or anything nearly as convenient in the first place if it wasn't). Whatever have been published is public and should not come with artificial limitations on how do you read and process it. Blocking crawling should be outlawed as it clearly is a monopolistic practice. E.g. I want to build my own crawler to index and categorize the web subset I choose for me. I believe this is a perfectly legitimate use. But they will probably try to stop me.

cblconfederate · on Nov 5, 2020

> Blocking crawling should be outlawed

That's overly broad. But maybe it should be illegal to have exceptions only for major monopolies.

sokoloff · on Nov 6, 2020

Turn it around at least for a few minutes. Does a website operator have to handle whatever arbitrary traffic you want to throw at them from your crawler?

They’re the ones choosing to use tech that’s blocking you. Proposing to make it illegal for them to make that choice or to speak to you differently than they speak to other users of their site may give you some idea of the resistance you’re likely to face to this proposal.

samoa42 · on Nov 7, 2020

i think there is a line somewhere along with beeing accountable in a business sense.

ie. if your internet host just hands out information, you are free to block/throttle as you please.

as soon as you are taking money (operate as a business), you are accountable and must not discriminate.

so: > Does a website operator have to handle whatever arbitrary traffic you want to throw at them

absolutely, yes!

seniorgarcia · on Nov 5, 2020

I don't get what value link previews add. Someone shares a link with me (on skype, slack, teams... whatever) and I care about the content because the person sharing it with me thinks I could/should care about it, or someone shares a link on an aggregator and then I don't think it is too much to ask for that someone to write a summary. If the link is worth sharing writing 1 sentence to explain why isn't too much to ask.

What is the value a link preview adds? And why should I, as a content provider care about the value you add? Cloudflare does something for me, what is your service doing for me and why should I whitelist you (or care about you)?

dasil003 · on Nov 5, 2020

They're sending you traffic.

Imagine Twitter or Facebook without link preview, it's much harder to use and overall reduces the change I'll click on a link. Do you think only Twitter and Facebook should be allowed publish previews?

def_true_false · on Nov 5, 2020

Half the time the link preview picks the wrong picture and sometimes even the quote. Twitter and Facebook would both be improved by disabling it. Hell, it might even stop people from thinking they need a hero image for their 2 paragraph medium shitpost.

input_sh · on Nov 5, 2020

I'd place that blame towards website owners. Both Facebook and Twitter are pretty open where they read that info from, and an owner can pretty easily pass those fields (it's just some <meta> tags in the <head> element).

They also have their own validators: https://cards-dev.twitter.com/validator and https://developers.facebook.com/tools/debug/

The only issue I'm aware of is that Facebook's crawler breaks about every two months or so.

seniorgarcia · on Nov 5, 2020

What meta tags do I have to fill and why is Twitters/FBs preview suddenly my problem?

>https://developer.twitter.com/en/docs/twitter-for-websites/c...

So, I should have to include twitter specific meta tags even though I personally don't care about twitter? Maybe twitter should make it clear which tags they read? Maybe it's SEO bullshit I don't care about? Maybe even even the OG: tags don't work all the time and result in dumb previews?

thatfunkymunki · on Nov 6, 2020

If you don't want to fill them out, don't... Filling them out lets you customize your link preview on twitter. If you don't care about Twitter, why would this affect you at all?

67868018 · on Nov 7, 2020

They're used by instant messengers too: slack, iMessage, WhatsApp, telegram, signal, ...

einpoklum · on Nov 5, 2020

> it's much harder to use

Actually it's easier to use, in that the preview doesn't take up screen real estate. Perhaps you mean the experience is less pleasant?

seniorgarcia · on Nov 5, 2020

>They're sending you traffic.

Irrelevant traffic for every metric I care about.

>Imagine Twitter or Facebook without link preview

That's exactly what I'm saying. Either I care about what that person thinks might interest me or I don't. The link preview abstract is shit anyway. Does the site title and the 2 sentence abstract really sway you? If someone wants to send traffic my way, writing an interesting abstract is not too much to ask.

>it's much harder to use and overall reduces the change I'll click on a link

Maybe you should re-evaluate who you follow on twitter? I frankly could care less about facebook.

>Do you think only Twitter and Facebook should be allowed publish previews?

I think previews are worthless regardless, I thought I made that clear. Either you care about me linking it to you or you do not.

*EDIT: And just for fun, here is the link preview stuff from my latest skype call with my brother: https://imgur.com/a/yO5OP36

Look at all the value those previews added.

cblconfederate · on Nov 5, 2020

when you paste a link on reddit and it autocompletes the title

update a bookmark title, or check if it exists.

is it not self-evident that a link being crawlable is useful?

seniorgarcia · on Nov 5, 2020

>when you paste a link on reddit and it autocompletes the title

Oh no, you have to copy/paste the title?

>update a bookmark title, or check if it exists.

I can access the site without a captcha, my browser can fetch the title.

>is it not self-evident that a link being crawlable is useful?

No, it is not. Maybe a site owner does not want crawlers to index the site?

Me being able to access the title and any html meta tags is not the same as some crawler being able to access it. It seems like your beef is with cloudflare and that is fine but please state that that is your issue and don't try to frame it as something else. What I don't get is how everybody places the blame at cloudflares feet. It is my choice as a host to use cloudflare and to use their protection features.

cblconfederate · on Nov 6, 2020

i m not sure if you re being serious

CF is so widespread that it breaks a significant part of the web for simple things like getting the page title. That's all. The End.

seniorgarcia · on Nov 6, 2020

'I' can get the page title though. That's all. The End. I don't care about your crawler. Or your ability to post the link to my site to twitter/fb and if I did maybe I'd revise my cloudflare settings.

skybrian · on Nov 5, 2020

Is this the case for any web crawler?

cblconfederate · on Nov 5, 2020

not sure but it s a very common problem: https://www.google.com/search?q=cloudflare+attention+require...