I don't understand why Yahoo or whoever is orchestrating this shutdown is so hostile to the archival process. People are offering their own time and storage space, how hard would it be for whoever is working on the shutdown to just allow people to easily download all the raw data, instead of having them manually scrapping. Wouldn't the latter be much worse for both parties?
This is data going back decades, there has to be someone in power somewhere on that team that realizes the value of this data? We're not asking them to keep it up, just to make it easy to archive.
Also serving content for scraping would use more resources on Yahoo's side, as opposed top providing a single downloadable tar or equivalent.
But as others have said, this no doubt comes back to something the legal team said could not happen.. RIP
sometimes people want to let things die and for good reason. i can only image what a burden it has been to police something like a newsgroup server over the years. i can also imagine that they haven't caught everything, so they just want it to go away.
remember that not everything is precious and needs to be reserved.
I completely agree, but the way they handled this was appalling:
(a) The "We're deleting all your content" message was only visible on group home pages, not when viewing messages, meaning a lot of the people who used the web-interface didn't know about the shutdown.
(b) The email sent out to all members about the deletion was sent weeks after the initial notice, often didn't arrive, and was inaccurate (dates had changed by the time people received the email). So many folks that used the email interface also didn't know about the deletion.
(c) The amount of time given was far too short. Yahoo Groups had existed as a brand since January 2001, and was based off Yahoo Clubs / eGroups, both of which had existed for several years before that. Two months notice for a 20+ year old service is laughable.
(c) The site has been unreliable during the last few months, due to the high traffic of people trying to save their stuff. This has significantly complicated efforts to archive it.
(d) Yahoo's solution to the data destruction, their "Get My Data" feature, is poorly thought out and often doesn't work. No photos are included in the download unless uploaded by that account. Many groups had members who uploaded photos / scans / schematics / etc.. and since left the group. Those photos are now lost. We've also found the downloads to be missing content that necessitated the request be made multiple times until you could patch together a complete version from all the incomplete ones.
Some don't want their old posts archived for eternity, for privacy reasons.
Yahoo had a nice interface to allow everyone that was a member of any group to download everything for their personal and private use, and it worked fine. Random external parties republishing other people's old posts though is not OK at all and is a massive breech of privacy, and is a violation.
Because of the threat that one's opinions may be archived and forevermore propagated by random third parties, at this point in history one would have to be insane to express a genuine opinion publicly. Therefore these efforts have a chilling effect on free speech and are inimical to democracy.
If Yahoo Groups wasn't going down your posts would be forever accessible anyways. What does it matter if they are archived by first party or third party?
Mostly because you've entered into an agreement with the first party (however dubious, that's another argument) and have consented to their terms to use their platform. There wasn't a checkbox that said "I consent to third parties storing all of my information after we decide to shutdown the platform".
Consent is paramount to privacy, and archiving everything without consent is a breaking a huge privacy and trust boundary.
Going into this, we did understand that this could be a concern. We are committed to working with group moderators as we work to make what we've saved available, and will be honoring valid take-down requests for content that people that prefer the content to not be online.
Would you mind clarifying your objection for me? Don't authors of posts assume the risk that third parties will copy and reuse the content when they publish it?
Beginning March 2nd, 2020 the Mailing Lists functionality on RootsWeb will be discontinued. Users will no longer be able to send outgoing emails or accept incoming emails. Additionally, administration tools will no longer be available to list administrators and mailing lists will be put into an archival state.
Administrators may save the email addresses in their list prior to March 2nd. After that, mailing list archives will remain available and searchable on RootsWeb.
As an alternative to RootsWeb Mailing Lists, Ancestry message boards are a great option to network with others in the genealogy community. Message boards are available for free with an Ancestry registered account.
Thank you for being part of the RootsWeb family and contributing to this community.
There's also the issue that archives may seemingly be up, but not at all complete.
For example, many of the mid-90s posts about Jim Bell's "Assassination Politics" proposal are missing from public archives of the cypherpunks list. The explanation there seems to be proactive evidence destruction that occurred as news about the ongoing federal investigation spread.
Also, it seems like some old Usenet stuff has disappeared from Google Groups. There's stuff that I archived years ago that's no longer findable.
>Also, it seems like some old Usenet stuff has disappeared from Google Groups. There's stuff that I archived years ago that's no longer findable.
>Anyone else noticed that?
Yeah, like everyone who even casually uses usenet realized that Google Groups search has been broken for some years now, somewhere in one of the G+ refactors it broke like 60% of the search and nobody at Google seems to care.
Google Groups at this point mostly exists as a frontend for clueless people to stumble on and bump decade old threads that active subscribers often can't even pull. It's honestly worse than the spam problem used to be.
I guess that it's been that many years since I used Usenet.
If stuff is really gone, that would suck. Because I've read that Google has the only comprehensive Usenet archive, acquired with Deja News, supposedly going back to 1982.
It would. Though it's probably not. It's more likely they have their search tuned not to burn extra cycles looking for archived shit not in RAM like the web archive does.
That's my guess at least, there have been HN threads in the past where people run obscure searches and it takes some time for Google to crunch it.
I found myself wondering, more versus future efforts, about the writing already being on the wall for Yahoo Groups years ago (e.g. the multi-day outage in 2017, IIRC). Why wait until the official shutdown notification before getting a scrape of the public groups? As someone who had a modicum of presence still on the site, it had grown increasingly prone to random errors, and I even wondered whether one day it would go down and refuse to come back up no matter how much effort Yahoo put forth attempting to revive it. That could've easily been an even worse "Yahoo-geddon."
(Not that I expect the effort Yahoo would put forth would amount to much, since in the now-defunct Yahoo help forums a Yahoo team member revealed Groups had been "deprioritized" a few years back, i.e. no official support, no help for groups with deceased mods, etc.)
> Why wait until the official shutdown notification before getting a scrape of the public groups?
Archive Team does have projects to scrape some sites that are looking like they might not survive the long term (e.g. right now there's a LiveJournal project) but they're all volunteers and I think they just don't have the time to do everything all the time.
A member of ArchiveTeam did a fairly comprehensive crawl of the messages of public groups between 2015 and 2018. And there had been discussion about how to archive Yahoo Groups for a few years prior to that as well.
The reality is that in order to archive a site as large as Yahoo Groups, you need a critical mass of people willing to commit time and energy to the project. And people are more likely to commit to an "urgent" project, where there is a set deadline.
Contrary to the title, the post doesn't contain anything specific about the progress of archiving. Like how much of the content is saved and how much isn't.
That's because it's not 100% clear, even to us. Think of it a bit like a bunch of people pulling stuff out of a burning building. [0] We know there's more to be saved, because we can see more boxes in the building, but we haven't taken the time to see what's in the boxes we already have.
Yahoo aren't making this easier, either. (Blocking our efforts, unreliable servers, limiting what can be downloaded, providing incomplete archives, etc...)
Once the dust has settled we'll be able to start going through what was saved. There have been a lot of really interesting aspects to this project that I hadn't encountered before, and I hope to document them when I have the time.
OP here. Ok, thank you, this is the kind criticism I crave!
I suppose by "vague" you mean there weren't any hard numbers.
Well I kind of linked the archive team tracker but you had to do some clicking to find it, and unfortunately I didn't have a statistic on the SYG's progress at the time of posting. However since posting I do have a rough idea of things:
Archive Team's own Tracker reports 2.76 TB of data to have been saved. The SYG team hasn't fully tallied up their data yet, but have counted the number of groups they've retrieved and/or are retrieving from to be around 123K.
I have since inserted this data into the post and would like to thank you for your brutal honesty :)
I don't want to say something I can't substantiate, however in 2013 Yahoo rolled out the "Neo" interface to Groups that was vastly unpopular with existing users [0], and many groups chose that point to switch away.
This is data going back decades, there has to be someone in power somewhere on that team that realizes the value of this data? We're not asking them to keep it up, just to make it easy to archive.