I've run sites that have a lot of pages where 80%+ of the traffic is web crawler...

dredmorbius · on Aug 4, 2021

You might want to ask yourself, or your readers, what it is people are trying to access on your site that they cannot by other means.

The interfaces for many sites actively and with brutal effectiveness deny ready access to any content not currently featured on the homepage or stream feed. Search features are frequently nonexistent, crippled, or dysfunctional.

Last week I found myself stumbling across a now-archived radio programme on a website which afforded exceedingly poor access to the content. The show ran weekly for over a decade, with 531 episodes. Those are visible ... ten at a time ... through either the Archive or Search features.

Scraping the site gives me the full archive listing, all 11 years, in a single webpage, that loads in under a second. I can search that by date, title, or guest to find episodes of interest.

The utility of this (a few hours work on my part) is much higher than that of the site itself.

Often current web sites / apps are little more than wrappers around a JSON delivery-and-parsing engine. Dumping the raw JSON can be much more useful for a technical user. (Reddit, Diaspora, Mastodon, and Ello are several sites in which this is at least somewhat possible.)

Much of the suck is imposed by monetisation schemes. One project of mine, decrufting the Washington Post's website, resulted in article pages with two percent the weight of the originally-delivered payload. The de-cruftified version strips not only extraneous JS and CSS, but nags and teasers which are really nothing but distraction to me. Again, that's typical.

I'm aware that many scrapers are not benign. More than you might think are, and the fact that casual scraping is a problem for your delivery system reflects more poorly on it than them.

PaulHoule · on Aug 4, 2021

Adjunk, Sidebarjunk, Javascriptjunk, Popupwindowjunk, and the outlook that the most precious resource in the world are a few seconds when you are distracted are what motivates the Washington Post and most of the commercial web.

What you are doing stripping out the junk threatens those organizations at the core.

dredmorbius · on Aug 4, 2021

Good.

https://news.ycombinator.com/item?id=26893033

https://news.ycombinator.com/item?id=27803591

PaulHoule · on Aug 4, 2021

On mobile ads, trackers and all that crap cost the consumer more than the the ads make.

If mobile phone companies kicked back a fraction of the revenue they get to content creators they'd be better paid than they are now and Verizon would get the love that it has sought in vain. (e.g. who would say a bad word about the phone company?)

dredmorbius · on Aug 4, 2021

That's my argument.

Gobal ad spend, which mostly accrues to the wealthiest 1 billion or so, is about $600 billion. Some complex maths tells us that's $600 per person in the industrialised countries (G-20 / OECD, close enough). Global content spend is somewhere around $100 -- 200/year per capita. That's roughly the annual online ad spend.

Bundled into network provisioning, that's about $30--40 per household per month, all-you-can-eat. Information as a public good.

(My preference is for higher rates in more affluent areas, ideally by income.)

Trying to figure out WCPGW.

PaulHoule · on Aug 4, 2021

My personal model (emerging, there is a manifesto but I am rewriting it as we speak) is to rigorously control costs, focus on quality, stay small.

Think of the old phone company slogan "reach out and touch someone." If I can accomplish that and spend less than I do on food or clothes or my car then I win.

dredmorbius · on Aug 4, 2021

I'd be interested in seeing what you're developing.

The challenge, as I see it is that information is a public good (in the economic sense: nonrivalrous, nonexcludable, zero marginal cost, high fixed costs), and provision at scale requires either a complementary rents service (advertising, patronage, propaganda, fancy professional-services "shingle") or a tax. Busking or its public-broadcasting is another option, though that's highly lossy.

Any truthful publishing also requires a strong self-defence mechanism (protection against lawsuits, coercion, intimidation, protection rackets, etc.), a frequently underappreciated role played by publishers.

Charles Perrow's descriptions of the music industry (recorded and broadcast) circa 1945 -- 1985 is informative here (see his Complex Organizations https://www.worldcat.org/title/complex-organizations-a-criti...), notably the roles of publishers vs. front-line and studio musicians.

Kiro · on Aug 4, 2021

Not sure why you are attacking a poster specifically talking about Google, Bing and Baidu doing massive scraping. What you are talking about is something entirely different.

PaulHoule · on Aug 4, 2021

I don't feel attacked. I also don't blame him for being inflamed about the problem he's been inflamed at because I am inflamed about it too!

dredmorbius · on Aug 4, 2021

Fortunately I think we both managed to realise that before too many rounds of this ;-)

HellsMaddy · on Aug 4, 2021

Bing powers other search engines like DuckDuckGo, Ecosia, and Yahoo!. But I’m sure that even cumulatively the numbers are still small.