Scraping

  • Tools:
    • OutWit Hub
    • Needlebase
    • Scraperwiki
    • Google Spreadsheets
    • Formulae
  • Walkthru using Google Docs (=import)
    1. Open a spreadsheet
    2. In A1, type the URL of a page with a table.
    3. In cell A2, type: =ImportHTML(A1, “table”, 1)
    4. Function importHTML($source, $element, $index)
      • Source = Where you’re getting data from. Can be a spreadsheet cell.
      • Object = Which type of object in the HTML document you want to parse. Likewise.
      • Index = Which object? Ditto.
  • Use Google News RSS; Google Alerts
  • Set up a regular supply of data:
    • RSS for regulators, campaigns, gov, EU, ONS, data.gov.uk
    • RSS feeds for WDTK, OpenlyLocal, OpenCorporates, OpenCharities, disclosure logs
  • Advanced spreadsheet stuff:

    • “filetype:”, “site:” do what you expect.
    • “~” is for synonyms
      lunchbreak
  • Using importXML($url, $xpath)

    • Useful xpaths:
      • “//div[starts-with(@class, ‘jobWrap’)]”
      • “//p[starts-with(@style, ‘font-size: 10pt’)]”
    • =transpose($range) changes from rows to columns.

For next class:

  • Play around with some scraping tools and write a blog about it.
  • Start shaping your project.