Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This sounds neat. Thanks for the work, vortex_ape and others. When I last needed this, I used tabula via tabula-py. Tried camelot on the PDF [0] I worked on and unfortunately the default option returned less-workable dataframe than tabula-py. I think it's just the area detection of stream and you are working on it anyway so I'm really looking forward to see the results.

btw, I think the pip install requirements missed opencv-python (on Windows?). And in this doc [2], it should be "top left and bottom right" instead of "left-top and right-bottom".

[1] https://www.boj.or.jp/en/statistics/set/kess/release/2018/ke...

[2] https://camelot-py.readthedocs.io/en/master/user/advanced.ht...



Hey squaresmile! Yes, right now table detection with Stream doesn't work nicely if the table is not present on the full page, for which you can use the table_area kwarg from [2].

You should use "pip install camelot-py[all]" to install Camelot (which will install opencv-python too). I had to take it out of the requirements since it wasn't available in any conda channels while I was creating the conda package. I'm looking to remove opencv as a requirement altogether by either vendorizing the opencv code that is being used inside Camelot or reimplementing the code using something lightweight like pillow.

Thanks for the catch in [2], I'll correct it!


Are you also working on extracting tabular data from scanned image files?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: