The problem with data.gov and sites like it is that they are built on faulty premises about data:
1. Fiction: Data doesn't require lots of work to make it useable, so we can just upload whatever we have and it will be useful to somebody. Fact: the big useable datasets (census, ipums, nlsy, all the private marketing datasets) have armies of people cleaning and integrating them. It costs money, it takes time, and it is easy to screw up.
2. Fiction: Links are worth something. Fact: links are worthless.
3. Fiction: XML adds values. Fact: aascii tab delimited in consistent formats add value, while XML SUBTRACTS value.
4. Fiction: a good dataset is easy to use. Fact: even a good dataset (google IPUMS for an example) takes a lot of work to get to know how to manipulate, presuming one can use some sort of statistical programming language in the first place.
5. Fiction: simple summaries of common data data are useful. Fact: everybody has already done the simple summaries. (This is just a bonus item, and doesn't apply to data.gov, but does apply to faulty thinking about data in general.)
6. Fiction: Federated data is just fine. Fact: Data that is curated, cleaned, and integrated into one big monolithic package is FAR better, because an analyst can then learn the conventions and names and such in one piece, and parallel categories are more likely to align.
7. Fiction: Good data is easy for a layperson to use. Fact: good data still requires a lot of skill. Well, maybe in nations with decent public schools a layperson can do something with data, but not in the US.
What I WOULD like is the following (taken from another post, now deleted):
An ideal data.gov would have a lot of staff who put together a few integrated and curated datasets from the agencies. These would be hierarchies of data in a few formats (shp, txt, raster, SQL text dump, and ...?), along with well written codebooks and narrative READMEs. They would be distributed using git or subversion. The staff would have the expertise to make such nice data packages for you and me, and they would have the political oomph to demand that the agencies release the data to them. The staff would also give classes on how to use the data along in some open source statistical packages to do useful work.
Good examples of curated data that I know are IPUMS and the Portland Metro's RLIS (both google-able).
I don't understand what you're getting at with this list.
1. Yes, datasets need to be cleaned. But you need to have the dataset before you can clean it, and different people will want to clean it in different ways. Get it up there first, and keep the political debates confined to the gathering methods. Griping about raw datasets only gives them an excuse to keep delaying putting anything out (in other words, this critique is actively harming the movement, please stop making it).
2. I don't understand what you mean by this. If a link points to a high-quality dataset that's otherwise hard to find, then it's very valuable.
3. Not all data is expressible in tab-delimited ascii tables. I'd like my SEC filings in well-structured XML, for instance.
4. This is a strawman. Nobody serious has ever said a good data set is easy to use and understand.
5. Ironically, this is the one point you make I agree with, and then you claim it doesn't apply to data.gov. I think this is actually the worst thing about data.gov right now, that they think they're giving us anything when they post their little summaries. Give us the raw data, please.
6. Isn't this just restating a combination of #1 and #3? Yes, big clean monolithic data sets are nice, but the priority is getting access to the data in the first place.
Well structured XML is almost impossible to beat as a data interchange format (since it was designed for that)...if you can't load XML, a format that's been around since the 1990's, you are using the wrong tools.
OK, we disagree. Except that #4 IS sort of redundant, though I want to make the point that data is almost impossible for a layperson, and still really hard for a practiced analyst.
I actually meant it when I said I didn't understand what you were getting at. I initially read it as you saying that there shouldn't be a data.gov at all (because raw data's useless, curated data's expensive and difficult, and simplified data summaries are likely to be misinterpreted by lay people), but that can't be right, so I'm really curious what you were actually trying to say. What would an ideal data.gov look like, to you?
That's in a perfect world. The first step is to get the data out there. They we can start wondering about structure and presentation. It takes a long time to build a data infrastructure, but you should stop people who have the skills and interest from getting their hands on it. Hopefully data.gov will work with data producers to become a quality data resource, but at least there is a resource in the first place, a place to go and find material.
Perhaps a model might be people starting from data.gov and then creating different views into the data for different purposes. They don't have to reside on data.gov, and personally that's what I hope the site evolves to. Making data available in reasonable formats that can then be converted into information by other people.
I moved my reply to my top comment and screwed up the reply tree here. So this is for the comment that follows this one:
data.gov is fundamentally flawed, and won't be anything but annoying until it is reworked into something along the lines of what I suggest. Or so I think...
Responding to your edit: I don't think data.gov should be in the business of curating too aggressively. It delays data being released, and it brings in a lot of questions about the politics of how that curation is done. I agree, though, on the need for a variety of release formats, detailed descriptions of the data, open and versioned repositories, and the political clout to get the data all in one place.
I think of data.gov as a layer below IPUMS - you go to data.gov to source your IPUMS-like project (or your company that's built around doing the detailed curation you're talking about).
Interesting point, though I more or less disagree about the level of curation necessary to make a minimally useful product. I think data.gov is FAR below that level, and is just noise so far. But these are practical questions, and I think we agree in principal about a lot.
I mean that for all practicing data analysts that I know, XML is a pain in the ass (parsers, xpaths, etc, all to get it into a csv that you can import), while nice ascii text is easy to work with.
If you want metadata, a well written narrative paragraph along with a code book is INFINITELY better than embedding the metadata in the data.
Furthermore, a lot of supposed metadata in XML is just dross like "<column>blah</column>".
Finally, all the crap in XML way ups the signal/ noise ratio; if you do need something that maps to a complex data structure use JSON or something rational. Such needs are not very common in data analysis, in contrast to web applications; data analysts use multiple tables and are usually pretty close to relational databases and SQL (even if they don't call it that).
Also, its incredibly difficult to deal with large (>100mb) datasets in XML format. Loading that thing into RAM for an XML parser is ridiculous. Tab delimited data is really the best format possible as you can easily build MapReduce scripts if needed to manipulate it.
I almost always write my own stream parser with regular expressions to deal with large XML files (especially very regular ones), though it should be noted that there are stream XML parsers.
To be honest, I just kind of think I know that there are stream XML parsers? I've used cElementTree when I have small XML documents and written my own regex for larger ones. (cElementTree is definitely not a stream parser)
I can imagine some circumstances where the hierarchical structure of XML would be useful, but in just about every data processing job I've undertaken that involved XML my first step was to get rid of XML and convert it to something like .csv or ascii tab delimited.
If you need hierarchies, use JSON, IMHO, or keys that reference between tables (the census PUMS data does this with persons nested within households, using two tables).
1. Fiction: Data doesn't require lots of work to make it useable, so we can just upload whatever we have and it will be useful to somebody. Fact: the big useable datasets (census, ipums, nlsy, all the private marketing datasets) have armies of people cleaning and integrating them. It costs money, it takes time, and it is easy to screw up.
2. Fiction: Links are worth something. Fact: links are worthless.
3. Fiction: XML adds values. Fact: aascii tab delimited in consistent formats add value, while XML SUBTRACTS value.
4. Fiction: a good dataset is easy to use. Fact: even a good dataset (google IPUMS for an example) takes a lot of work to get to know how to manipulate, presuming one can use some sort of statistical programming language in the first place.
5. Fiction: simple summaries of common data data are useful. Fact: everybody has already done the simple summaries. (This is just a bonus item, and doesn't apply to data.gov, but does apply to faulty thinking about data in general.)
6. Fiction: Federated data is just fine. Fact: Data that is curated, cleaned, and integrated into one big monolithic package is FAR better, because an analyst can then learn the conventions and names and such in one piece, and parallel categories are more likely to align.
7. Fiction: Good data is easy for a layperson to use. Fact: good data still requires a lot of skill. Well, maybe in nations with decent public schools a layperson can do something with data, but not in the US.
What I WOULD like is the following (taken from another post, now deleted):
An ideal data.gov would have a lot of staff who put together a few integrated and curated datasets from the agencies. These would be hierarchies of data in a few formats (shp, txt, raster, SQL text dump, and ...?), along with well written codebooks and narrative READMEs. They would be distributed using git or subversion. The staff would have the expertise to make such nice data packages for you and me, and they would have the political oomph to demand that the agencies release the data to them. The staff would also give classes on how to use the data along in some open source statistical packages to do useful work. Good examples of curated data that I know are IPUMS and the Portland Metro's RLIS (both google-able).