Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Interview with Viktor Lofgren from Marginalia Search (nlnet.nl)
100 points by luu on Nov 30, 2023 | hide | past | favorite | 21 comments


> I can for example look at the HTML of a document and if it has too many ads or if it has too many tracking elements I can downrank the website for example. Or enable a user to have a check box to say I don’t want to many ads. I prefer content that does not have ads, for example. It is hard to get it perfectly right, but even to remove 75% of the ads that’s still a huge improvement.

That's a pretty big opening for a search engine that Google basically cannot fight since it is against their core interests.


> Ten, fifteen years ago you needed a large budget to be able to play in this space. You needed to demonstrate that you were going to be making a profit. Because nobody is going to throw tens of millions of dollars at something just for fun. But now we are at the point where regular human beings can dabble in this space.

Great quote. Enabling human-scale experimentation supercharges creativity and discovery.


"We don't have to have a Google and a Twitter and a Facebook."

Many years ago, before Facebook or Twitter existed, I recall being promised that a project called "Nutch" would allow web users to crawl the web themselves.

Perhaps that promise is similar to the promises being made about "AI" today.

The project did not turn out to be used in the way it was predicted (marketed), or even used by web users at all.


> ... a project called "Nutch" would allow web users to crawl the web themselves. Perhaps that promise is similar to the promises being made about "AI" today. The project did not turn out to be used in the way it was predicted (marketed), or even used by web users at all.

Actually Nutch is used to produce the Common Crawl[0] and 60% of GPT-3's training data was Common Crawl[1], so in a way it is being used by a lot of web users today.

I did very briefly look into the possibility of self-hosting a search based on the Common Crawl data, but the latest size is 390Tb, and given the largest consumer drives are around 10-14Tb, that would be a lot of drives and a lot of money. Plus by all accounts most of Common Crawl is poor quality spam type of content, which OpenAI, Google etc. try to filter out before using it to train their LLMs. That's why I don't think the solution is a search engine for the whole internet any more, but one which just indexes the relatively small minority of good bits.

[0] https://commoncrawl.org/

[1] https://en.wikipedia.org/wiki/GPT-3


Common Crawl's data is a bit inflated due to how it's stored. WARC is the opposite of a compression format ;-)

I think you are correct in that one of the primary problems faced in search is selecting interesting documents. I think I throw out like 75% of what I crawl because it's not very likely to ever be a good search result. You can probably filter it even more aggressively.

It's a smart avenue to approach search not in the sense of "how can I make this computer cluster big enough to fit the data" but "how can I make this data small enough to fit my computer"..


Think like a www user. Scale down, not up.

For me, the best improvments that have occurred have been sitemaps XML. When paired with HTTP/1.1 pipelining I can peruse and download all pages that I want from a site in a single TCP connection. No CSS, Javascript, images, etc. (Won't find that cruft in Common Crawl either.) This is not "crawling", it's bulk text retrieval. It's starting with a list of URLs from a sitemap XML file and pipelining a series of pages, usually MIME type text/html, in a sequential order, into a compressed archive. Like the KC and the Sunshine Band song, that's the way I like it.

Almost nothing on the www is universal and neither sitemaps nor pipelining are exceptions. Some sites do not even have a robots.txt. (One reason I like HTTP headers they are arguably one of the only universals on the www. All sites have them. Not much variation, either. If "Big Tech" has their way, who knows, that may change, too. HTTP/2 and HTTP/3 seems to give short shrift to HTTP headers. If I'm wrong about that, then that's good to hear.)

Web search engines are opinionated. They are created by people who must make decisions about what to include and what exclude, what to favour and what to disfavour. Obviously, Google's opinion is influenced by advertisers and it hides behind a veil of secrecy as it stacks the deck in its own favour; the company completely abandoned the "search engine in the academic realm" idea Page and Brin wrote about in 1995. It's commercial, but in an insidious, passive-aggressive, "haha no one will notice" sort of way. Marginalia.nu (or wiby.me) might be one of the few non-commercial projects.

Anyway, AFAIK, downloading sitemap XML and then HTML pages in bulk does not give so-called "tech" companies much data to use for advertising or other commercial purposes. It's not like sending one's www search queries to a "search engine".


speaking as a sysadmin: random agressive crawlers are annoying and usually get blocked/filtered out. I think the best solution would be some sort of selfcrawler which uploads your site to a search provider. uploaders have to pay and bad uploads/seo-spam gets punished somehow (thus seo-abuse becomes expensive)


>I think the best solution would be some sort of selfcrawler which uploads your site to a search provider. uploaders have to pay and bad uploads/seo-spam gets punished somehow (thus seo-abuse becomes expensive)

I had the same idea; the website owners and the webmasters can crawl their own sites and work together with search engines in order to get revenue share from ads (sort of licensing deal, similar to YouTube partner revenue share program). But the problem is that the webmasters can not efficiently crawl their own sites because they do not have the money to do it every day like Google. Maybe we can have like 1 month old index for a search engine, sort of outdated and not that practical but still practical for some casual users.

This idea would be better for a commercial web archiving service where website owners and webmasters crawl/archive their own website and license it to the web archiving service.


I've listen to Viktor's podcast with this organization, I think 1 month ago but since then I got bad COVID so I still feel dizzy....but my take is, he might be right.

The biggest problem with Web today perhaps is the search index; if we want to fight the problems of SEO spam, fraud and scams on the web, we need open transparent crawling index so we get better understanding of what websites are and what are they doing. On top of that, open web index would bring the opportunity of better discovery of smaller unknown websites too.

I'm not quite optimistic that we can compete with Google and Bing because with their enormous computing resources, they got gigantic scale but we can start with smaller index first and then try to scale up.


I would if the arrival of ChatGPT has sort of taken the wind out of Marginalia's sails as a human-directed search engine. It seems likely that the future of answering one's questions using the internet, is an LLM giving a straightforward, concise answer that is free of the quirks of a human author. Therefore, there is less motivation to search for personal websites and read website makers' own writing.

For example, imagine an old-school hobbyist website that contains information about some obscure band or author that can't be found elsewhere on the web, and doesn't readily show up in a Google search. Yet at the same time the author writes terrible prose, uses annoying HTML/CSS, or goes into tiresome political rants, etc. Instead of using a Marginalia-like search engine to discover those sites and read them directly, wouldn't it be a superior experience to have an LLM gorge on all those sites and then tell you just the facts that you care about?


Based on my experience with Large Language Models (LLM) so far, I see more value in niche sites that marginalia will help the user find.

An LLM is regurgitating what it reads, true or false. There's a bias for things it sees more often. There's randomness in the response. Sure, humans are error-prone but a problem with known computer responses is that humans tend to think "a computer did this so it must be right." And LLMs are not pumping out true thoughtful answers but simply putting a string of probably words together.

The internet is vast and is increasingly filled with garbage, stolen and duplicated data, and monetization. It's nice to be able to look in the nooks and crannies of the internet for data.

I'm actually more interested in sites that do personal curation of interesting links that are personally vetted and of interest to someone than what's found through google or chatgpt.


I do agree that the question-answering search engine is probably headed the way of the dodo, but on the other hand, no search engine ever did question-answering particularly well in the first place (even beyond that, such projects are deeply problematic from an epistemological point of view).

LLMs aren't search engines though. They offer a new way of interacting with information, but they need to be provided with the appropriate context to answer questions well (e.g. as in RAG), which is where a traditional search engine still has a role even in an LLM maximalist future.

Though I'd note there are other aspects of the web that are criminally under served these days. The itch I'm trying to scratch with Marginalia Search is just letting humans with websites find each other.


> ChatGPT has sort of taken the wind out of Marginalia's sails as a human-directed search engine.

Not at all. ChatGPT (I've never used it) seems to me to be about answering questions or generating some text.

But for me the real pleasure and enjoyment of the web in general is reading essays and posts with a persons individual voice, and organically exploring topics, skipping from one web-site to the next. In that regard, Marginalia is fast becoming for me an essential part of the web-experience, as much as Wikipedia or the Internet Archive is.


> Not at all. ChatGPT (I've never used it) seems to me to be about answering questions or generating some text.

You have never used chatGPT but you are convinced it hasn't taken the wind out of Marginalia's sails... I would suggest you try it so you can see their perspective as well? I get that your use case might be different, but I don't think we know if that is the main use case of most users.


Is it functionally any different to Bard? Because I've used that.

If so, my reasoning remains entirely unchanged.


> that is free of the quirks of a human author

Uh, is that supposed to be a bad thing? Humans don't want to read uncanny-valley output devoid of personality and subjectivity. Even a Wikipedia article tends to have a small amount of flavor in the prose.

LLM output is currently designed to be entirely without personality, because its developers are well aware of how disturbing it would be if the AI tried to sound like a person. But human people generally want to read and watch and talk to other people, genuine people--which is the problem new-old-web projects like Marginalia are trying to solve, what with most of the Internet nowadays being so full of SEO junk and for-profit websites and corporate-curated walled gardens... and now machine-generated "AI" content.

The Internet can be (used to be) much more than just machine-generated summaries of raw data, and content that's just there to sell you things you don't need.

(Also, if everyone just asks an AI instead of searching the Web, why would anyone bother posting new content for those AIs to digest? No human would ever visit their site...)


> free of the quirks of a human author

Sometimes those quirks are what matters.


The pendulum always swings.

10, 15 years later and there will be a widespread “rediscovery” of the human made web by those who grew up with LLMs.


If that were true, they would be well advised to get out; nobody's going to ride out a fifteen year cycle.


LLMs have no concept of facts. They can hide this to an extent with alignment when it comes to sacred platitudes, but they certainly can't tell you if a singleton website with archaic HTML/CSS is accurate or not.


No




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: