Extracting Data from PDFs using Tabula

PDFs can be all forms and shapes – if you’re facing a nicely formatted PDF that is not scanned give Tabula a shot to extract the information. How? read the short walkthrough below:

You’ll need:

Waltkthrough: Extracting data from PDF tables

  1. Download the PDF at:: http://www.unhabitat.org/pmss/getElectronicVersion.aspx?nr=3387&alt=1

  2. Start Tabula (most likely by double clicking on the tabula icon)

  3. point your browser tof http://127.0.0.1:8080

  4. Choose the file you want to upload and click Submit

    https://i2.wp.com/farm6.staticflickr.com/5484/9500458533_91f9a6cdb4_o_d.png?w=616

  5. Wait until the PDF is fully loaded

  6. Scroll down to page 167 – we’ll extract that table.

  7. Click and pull a selection box over the table

    https://i1.wp.com/farm4.staticflickr.com/3726/9500458669_96dbc7f6e5_o_d.png?w=616

  8. A window will pop up to show how Tabula would extract the data.

    https://i1.wp.com/farm4.staticflickr.com/3703/9500458729_333885f7a3_z_d.jpg?w=616

  9. Now download the Data as CSV

    https://i0.wp.com/farm8.staticflickr.com/7397/9500458755_4e9e802e54_o_d.png?w=616

  10. Fantastic you liberated the table from the PDF. Quick and easy wasn’t it?

Any questions? Got stuck? Ask School of Data!

Last updated on Sep 02, 2013.

Leave a Reply

Theme by Anders Norén