Happily, this bill actually does care about the format of the data:
"The policy and procedures shall include ... technical requirements with the goal of making electronic data sets available to the greatest number of users and for the greatest number of applications, including, whenever practicable, the use of machine readable, non-proprietary technical standards for web publishing"
"The Treasurer shall consider various means by which to develop a set of universal data formatting standards to effectuate the purposes of this act, including working with other State departments and agencies, and contracting, if deemed necessary, with nonprofit organizations, commercial vendors or third party groups for this purpose. If such standards are developed and adopted by the Treasurer, they shall be the format that each State department and agency will use to provide existing and future electronic data sets"
> the use of machine readable, non-proprietary technical standards for web publishing
Wouldn't that cover PDF, even though it might be deeply inappropriate for it's content? PDF has several ISO standards, and is most definently machine readable.
PDF should also meet the goal as well:
> available to the greatest number of users and for the greatest number of applications
Huge number of PDF clients, such as PDF.js that powers Firefox's reader.
> If such standards are developed and adopted by the Treasurer, they shall be the format that each State department and agency will use to provide existing and future electronic data sets
Knowing how much government uses PDF... It is probably incoming.
But it makes little/no sense for spreadsheet data. Great for graphs and prose, however.
While deplorable from an accessibility standpoint, at least the data is available and out in the open to the public. One small step in the right direction.
EDIT: Tabula[1] is an open source tool to extract tables from PDFs into CSV/Excel format in case one really needs to process data trapped in PDF documents.
> Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
> My city's school district makes images of the spreadsheets and then publishes them in a pdf. USELESS
You should be able to sue them for misuse of public funds. Dedicating tax money to convert a useful format (CSV, spreadsheet, etc) into a useless or at least less useful one (ex: PDF) is a clear violation.
Had the same problem with NJ. I FOIA'd Chris Christie's schedules and got some of the original PDF reports (which was fine, they were readable with some Python library) and some SCANS of the original PDF reports, even with OCR they were useless.
Yay: “…the bill requires the Department of the Treasury to establish an unique, dedicated, easily navigable Internet website which will offer to the public all available appropriate[0] existing and future electronic data sets maintained by each State department and agency.”
Not so yay: generally nobody’s going to be liable for incomplete or inaccurate data, with an exception of “gross negligence, willful and wanton misconduct, or intentional misconduct”.
Understandable: “No personally identifiable information would be posted online”
Unclear: “State departments and agencies would not be required to make data sets available upon demand”
[0] It seems that the availability and appropriateness is where things might stall. What if a department gives the Treasurer a dataset full of personally identifiable information? They thus fulfill their duty per this bill, the dataset is available, and yet public sees nothing, unless Treasury goes an extra mile and takes care of anonymizing the dataset.
I'm involved with Bath Hacked (UK) a community interest company set up in partnership with the local council which actively hunts out council data sets that can be published in the data store.
We run hackathons and also reach out to various organisations for data sets that although we may not be able to publish in the datastore, can result in very interesting tools.
I'm unsure why I'm being downvoted, but I think it's probably to do with qualifying why Bath Hacked is important and had been mentioned in Parliament and is used as a national example.
Bath and North East Somerset council now have a policy to open as much data as possible to the public, directly as a result of BH.
There are zero accountability clauses in this bill. If administrators can't be held accountable for providing complete, accurate, and parsable data then what the public could expect is cherry-picked, bottom of the barrel information that is of little to no value.
Iowa Code Chapter 22 is the most transparent I know of. There was one uproar when concealed carry permits were published online, but the underlying problem is that Iowans have a blacklist of those banned from having a weapon, so the whitelist is a poll tax. My only problem so far was getting pics of my house doxxed from property records and sent to me in a threatening email from a bitcoin paid account in Sweden. "Mr. Dendler" had poor opsec and logged a profile view to my LinkedIn. One request for LinkedIn to archive "Dendler"'s IP logs and it abruptly stopped. Must not have been using TOR.
"No personally identifiable information shall be posted online unless the identified individual has consented to the posting or the posting is necessary to fulfill the lawful purposes or duties of the department or agency."
My city's school district makes images of the spreadsheets and then publishes them in a pdf. USELESS