Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
New Jersey Open Data Initiative (nj.us)
121 points by rmason on Jan 11, 2017 | hide | past | favorite | 22 comments


It's all about the format:

My city's school district makes images of the spreadsheets and then publishes them in a pdf. USELESS


Happily, this bill actually does care about the format of the data:

"The policy and procedures shall include ... technical requirements with the goal of making electronic data sets available to the greatest number of users and for the greatest number of applications, including, whenever practicable, the use of machine readable, non-proprietary technical standards for web publishing"

"The Treasurer shall consider various means by which to develop a set of universal data formatting standards to effectuate the purposes of this act, including working with other State departments and agencies, and contracting, if deemed necessary, with nonprofit organizations, commercial vendors or third party groups for this purpose. If such standards are developed and adopted by the Treasurer, they shall be the format that each State department and agency will use to provide existing and future electronic data sets"


> the use of machine readable, non-proprietary technical standards for web publishing

Wouldn't that cover PDF, even though it might be deeply inappropriate for it's content? PDF has several ISO standards, and is most definently machine readable.

PDF should also meet the goal as well:

> available to the greatest number of users and for the greatest number of applications

Huge number of PDF clients, such as PDF.js that powers Firefox's reader.

> If such standards are developed and adopted by the Treasurer, they shall be the format that each State department and agency will use to provide existing and future electronic data sets

Knowing how much government uses PDF... It is probably incoming.

But it makes little/no sense for spreadsheet data. Great for graphs and prose, however.


PDF is really a container and can have just about anything inside of it.

https://www.w3.org/2013/04/odw/odw13_submission_52.pdf

PDF with csv attachments is a good compromise but I rarely see that. Usually it is PDF or Excel and the PDF is a nightmare to munge.


While deplorable from an accessibility standpoint, at least the data is available and out in the open to the public. One small step in the right direction.

EDIT: Tabula[1] is an open source tool to extract tables from PDFs into CSV/Excel format in case one really needs to process data trapped in PDF documents.

[1] http://tabula.technology


Does Tabula do OCR? That's what would be required for tables embedded as images.


Unfortunately, it doesn't. From https://github.com/tabulapdf/tabula:

> Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.


Tabula is great!


In their defense, that's still better than most academic publications!


> My city's school district makes images of the spreadsheets and then publishes them in a pdf. USELESS

You should be able to sue them for misuse of public funds. Dedicating tax money to convert a useful format (CSV, spreadsheet, etc) into a useless or at least less useful one (ex: PDF) is a clear violation.


It isn't a violation if there isn't any policy or law against it.


Had the same problem with NJ. I FOIA'd Chris Christie's schedules and got some of the original PDF reports (which was fine, they were readable with some Python library) and some SCANS of the original PDF reports, even with OCR they were useless.


Yay: “…the bill requires the Department of the Treasury to establish an unique, dedicated, easily navigable Internet website which will offer to the public all available appropriate[0] existing and future electronic data sets maintained by each State department and agency.”

Not so yay: generally nobody’s going to be liable for incomplete or inaccurate data, with an exception of “gross negligence, willful and wanton misconduct, or intentional misconduct”.

Understandable: “No personally identifiable information would be posted online”

Unclear: “State departments and agencies would not be required to make data sets available upon demand”

[0] It seems that the availability and appropriateness is where things might stall. What if a department gives the Treasurer a dataset full of personally identifiable information? They thus fulfill their duty per this bill, the dataset is available, and yet public sees nothing, unless Treasury goes an extra mile and takes care of anonymizing the dataset.


Agreed. Given how things work in NJ, I'm pretty skeptical real insight will come of this, much more likely to just be a PR thing for state agencies.


I'm involved with Bath Hacked (UK) a community interest company set up in partnership with the local council which actively hunts out council data sets that can be published in the data store.

The site: https://www.bathhacked.org/ The data store: https://data.bathhacked.org/

We run hackathons and also reach out to various organisations for data sets that although we may not be able to publish in the datastore, can result in very interesting tools.

The most recent data we got our hands on was the bike share hire programme. https://www.youtube.com/watch?v=4AfocuESnSg


I'm unsure why I'm being downvoted, but I think it's probably to do with qualifying why Bath Hacked is important and had been mentioned in Parliament and is used as a national example.

Bath and North East Somerset council now have a policy to open as much data as possible to the public, directly as a result of BH.


Nice to see one of my company's (Socrata) customers on HN!


There are zero accountability clauses in this bill. If administrators can't be held accountable for providing complete, accurate, and parsable data then what the public could expect is cherry-picked, bottom of the barrel information that is of little to no value.


Has google won and coerced a government to make personal data available to them to harvest?


Iowa Code Chapter 22 is the most transparent I know of. There was one uproar when concealed carry permits were published online, but the underlying problem is that Iowans have a blacklist of those banned from having a weapon, so the whitelist is a poll tax. My only problem so far was getting pics of my house doxxed from property records and sent to me in a threatening email from a bitcoin paid account in Sweden. "Mr. Dendler" had poor opsec and logged a profile view to my LinkedIn. One request for LinkedIn to archive "Dendler"'s IP logs and it abruptly stopped. Must not have been using TOR.


From the text of the law:

"No personally identifiable information shall be posted online unless the identified individual has consented to the posting or the posting is necessary to fulfill the lawful purposes or duties of the department or agency."


Always read the T&Cs!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: