Liberating HTML Data Tables

Section author: Tony Hirst (psychemedia on Twitter)

It’s not uncommon to see small data sets published on the web using an HTML table element. If you have a quick click around Wikipedia, you’re likely to find a wide variety of examples. Some sites will use Javascript libraries to enhance the presentation or usability of a table, for example, by making columns sortable; but most of the time, we are faced with a flat HTML table, and the data locked in it.

In this section, we look at some quick tricks for liberating data from HTML tables on public webpages and turning them into something more useful.

Screenscraping HTML Tables Using Google Spreadsheets

The Google spreadsheet formula:

=importHTML("","table",N)

will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 1) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for UK city population):

https://i1.wp.com/farm9.staticflickr.com/8303/7850933084_b188c02992_o_d.jpg?w=616

Grab the URL, fire up a new Google spreadsheet, and start to enter the formula =importHTML into one of the cells:

https://i2.wp.com/farm9.staticflickr.com/8284/7850932578_b5db80ed9d_o_d.jpg?w=616

Autocompletion works a treat, so finish off the expression and add in the URL and table number:

=importHTML("http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population","table",2)

The table numbers are not always obvious – start with 1 and increment the table number until you get the correct one.

https://i2.wp.com/farm9.staticflickr.com/8438/7850932674_ef1514b761_o_d.jpg?w=616

As if by magic, a data table appears in the spreadsheet, pulled in directly from the Wikipedia page:

https://i1.wp.com/farm9.staticflickr.com/8425/7850932816_b5598830e0_o_d.jpg?w=616

If the data in the HTML table is updated, the data in the spreadsheet will also be updated when you refresh or call the spreadsheet page.

Any questions? Got stuck? Ask School of Data!

Last updated on Sep 02, 2013.

11 Comments

  1. Great recipe. Only one remark. For some reason (change in GoogleSpreadsheets interface?), the working formula is now =importHTML(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population”,”table”,2). As the table is still goes second at the page, I suspect it’s because Google decided to start counting at 1. Or maybe I’m wrong. Any ideas?

  2. NRP

    thanks for sharing, liked it..!!! 😀

  3. David Meyers

    Is it possible to do this if you don’t know the URL? For example, if I want to enter baseball player names into the cells in column A, and run a script to put their batting average in column B. Since I don’t know beforehand which players will be in the list, I can’t preset the URL for that player’s page of stats. Any suggestions?

    • Michael Bauer

      David,

      If you know the name-url combinations that would be possible. The URL doesn’t have to be hard-coded it can also be in a cell that is referenced in the formula.

  4. Britney Muller

    Amazing! Can you do this through multiple pages?

    Thanks so much!

    • Britney Muller

      To clarify: Have it auto go through multiple pages?

    • SchoolofData

      Hello Britney, not it cannot, you can only get one table per formula.

      If you want to scrape several pages, you should take a look at http://webscraper.io/ or https://import.io/. Webscraper is a Google Chrome extension and import.io is a dedicated program.

      Cédric Lombion
      School of Data Community Manager

  5. Amrapalli

    Thank you so much, this is simply awesome. Work got reduced a lot.

Leave a Reply

Theme by Anders Norén