Extracting Data from PDFs using Tabula

PDFs can be all forms and shapes – if you’re facing a nicely formatted PDF that is not scanned give Tabula a shot to extract the information. How? read the short walkthrough below:

You’ll need:

Waltkthrough: Extracting data from PDF tables

  1. Download the PDF at:: http://www.unhabitat.org/pmss/getElectronicVersion.aspx?nr=3387&alt=1

  2. Start Tabula (most likely by double clicking on the tabula icon)

  3. point your browser tof http://127.0.0.1:8080

  4. Choose the file you want to upload and click Submit

    https://i2.wp.com/farm6.staticflickr.com/5484/9500458533_91f9a6cdb4_o_d.png?w=616

  5. Wait until the PDF is fully loaded

  6. Scroll down to page 167 – we’ll extract that table.

  7. Click and pull a selection box over the table

    https://i1.wp.com/farm4.staticflickr.com/3726/9500458669_96dbc7f6e5_o_d.png?w=616

  8. A window will pop up to show how Tabula would extract the data.

    https://i1.wp.com/farm4.staticflickr.com/3703/9500458729_333885f7a3_z_d.jpg?w=616

  9. Now download the Data as CSV

    https://i0.wp.com/farm8.staticflickr.com/7397/9500458755_4e9e802e54_o_d.png?w=616

  10. Fantastic you liberated the table from the PDF. Quick and easy wasn’t it?

Any questions? Got stuck? Ask School of Data!

Last updated on Sep 02, 2013.

Theme by Anders Norén