Archive for July 2008
Learning page scraping and mashups
As a side project, I’ve been extending my programming skills with a novel mashup project.
“Mashing up” is the art of taking two elements and combining them to produce something new. A musical mashup can create something as interesting as Bootystition (Destiny’s Child vs Stevie Wonder). But mashups of online data exist, too, and can be very useful or beautiful.
Some examples of data mashups:
- Visual Headlines grabs Flickr images related to keywords in CNN headlines
- BBC News Map, another headlines service, plots news headlines on a map
- Housing Maps, another map service, plots Craigslist listings on a map
In some cases these mashups are created using API technology, which supports data interoperability. But sometimes, where websites aren’t set up to export data in easy-to-manipulate formats, other methods must be used.
Page scraping is one such method. Many programming languages support methods of downloading a web page’s contents and extracting particular data from it. The ways in which this can be done are too numerous to go into here, but the upshot is that page scraping allows a programmer to acquire structured data for the purposes of integrating it with another system.
My expedition into page scraping and mashups was inspired by my discovery of the recently-released NSW Food Authority Register of Penalty Notices, a list of penalty notices served to food establishments for breaches of food safety. I want to take this data – which includes business names and addresses – and plot it on a Google map, to allow consumers to see which of their neighbourhood restaurants have been issued a penalty notice.
Using the instructions in the article “PHP: Write a Web Page Scraper” I have been able to grab a link to each penalty notice’s details page. I can then grab the contents that page using cURL, and then use a regular expression search (thanks for the suggestion, Kunaal) to pick out the offence details. (I love government websites; they’re so structured, orderly, and predictable.) At the moment, after a few hours’ programming, that’s all I have achieved. Here is an example of the script output. The next steps involved will be:
- Saving the information to a database
- Setting up periodic crawling for new information
- Building an interface to attach the penalty notice data to labels on a Google map
What other potential uses does this data have? I am not sure, yet. A running tally of the amount of fines issued is one idea. Perhaps a tag cloud of common keywords in the offence descriptions. Chronic Infoholic suggested a possible domain name, which I’ll keep between us for now, but I’m not sure whether I want to develop this into more than a curiosity at the moment. We’ll see.
Back to blogging
After a long hiatus, I’ve decided to switch the blog back on this website and keep a few notes about what I’m up to. I keep having to trim my thoughts down to 140 characters or less (on Plurk and Twitter) and some things just don’t reduce to that size.
I’m assuming most of my readers are people who know me in real life or through a social networking website. You can keep up with what I’m writing by subscribing to my RSS feed. Follow me elsewhere or contact me by following the links on my ‘About’ page.
