It is reasonable. It is also likely that whatever meta information reddit is sen...

reaperducer · on Jan 19, 2020

It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

That doesn't explain why Google lists the old search results as being from this month, while Duck correctly lists them as being from years past.

klingonopera · on Jan 19, 2020

Google does cache results, and could via comparison with cache notice changes, and claim, the page was updated sometime between.

I've always wondered, how search engines get a hold of timestamps. Locally with a cached sample, like I explained above, parsing a page's content or some metadata? It's not like the HTTP protocol sends me a "file created/last modified date" along with the payload, does it?

Izkata · on Jan 19, 2020

> It is reasonable. It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

That could explain the first screenshot, but definitely not the second, where google has it tagged as years old.

ma2rten · on Jan 20, 2020

That's DDG, not google.

capableweb · on Jan 19, 2020

Seems like an easy solution to this problem would use two functions.

One function that takes the output of the page, and renders it so only what's user visible, actually gets indexed. So no headers, no JSON data, no nothing, unless it's actually in the final outcome of the page when rendered. This would require jsdom or some other DOM implementation. Hardly hard for Google (Chrome) to achieve this, and been done multiple times.

Second function is a function that does the same call twice, passing the page to function one each time, then compare them two. If you make two calls right next to each other, and some data is different, you discard that from your search index. Instead you only index data that appears in both calls.

Now you don't have the issue of "dynamic content" anymore...

rhacker · on Jan 19, 2020

Typically dynamic content doesn't change from second to second, it changes after 5 minutes or an hour or 1 day, actually it is extremely site specific too.

But I do like your idea.

To go a bit further on your idea - you could apply machine learning to analyse the changes. So for example, ML could determine what is probably the "content area" of the page simply by having built out a NN for each website that self-expires the training data at about 1 month (to account for redesigns over time).

The major problem will still be "ads" in the middle of the content, especially odd scroll designs ads that have a different "picture" at each scroll position, as well as video ads that are likely to be different at each screen shot.

Another form of ad being the "linked" words like when you see random words of the paragraphs becoming links that go to shitty websites that define the word but show a bunch of other ads.

I suppose Google could simply install UBlock in it's training data collector harness to help with that stuff. >()