For a project that I’m currently working on I need a list of all the US National Parks in XML format. Google didn’t come up with anything so I decided that I would need to somehow grab the data from this list on Wikipedia. The problem is that the list is in messy HTML but I want some nice clean XML ready for parsing with E4X in Flash.
There are a number of ways I could parse the data. If I knew Ruby and had an environment set up I’d probably use hpricot. Or I could get my hands dirty again with PHP and DOMDocument. Or if the Wikipedia page was XML or could be converted into XML easily then I could use an XSL transform. Or I’m sure there are hundreds of other approaches… But in this case I just wanted to very quickly and easily write a script which would grab and translate the data so I could get on with the rest of the project.
That’s when I thought of using jQuery to parse the data – it is the perfect tool for navigating a HTML DOM and extracting information from it. So I wrote a script which would use AJAX to load the page from Wikipedia. And that’s where I hit the first hurdle: “Access to restricted URI denied” – you can’t make crossdomain AJAX calls because of security restrictions in the browser :(
At this point I had at least a couple of ways to proceed with my jQuery approach:- Copy the HTML file from Wikipedia to my server thus avoiding the cross domain issues.
- Write a quick serverside redirect script to live on my server and grab the page from Wikipedia and echo it back out.
The YQL platform provides a single endpoint service that enables developers to query, filter and combine data across Yahoo! and beyond.After a quick flick through the documentation and some testing in the YQL Console I put together a script which would grab the relevant page from Wikipedia and convert it into a JSONP call which allows us to get around the cross-domain AJAX issues. As an added extra it’s really easy to add some XPath to your YQL so I’m grabbing only the relevant table from the Wikipedia document which cuts down on the complexity of my javascript. Here’s what I ended up with:
If you run this code in the console you’ll see that it grabs the relevant table from wikipedia and returns it as XML or JSON. From here it’s easy to make the AJAX call from jQuery and then loop over the JSON returned creating an XML document. If you are interested in the details of that you can check out the complete example.
I was really impressed with how easy it was to quickly figure out YQL and I think it’s a really useful service. Even if you just use it to convert a HTML page to a valid XML document then it is still invaluable for all sorts of screen scraping purposes (it’s always much easier to parse XML than HTML tag soup). One improvement I’d love to see the addition of a CSS style selection engine as well as the XPath one. And the documentation could maybe be clearer (I figured out the above script by checking examples on other blogs rather than by reading the docs). But overall I give Yahoo! a big thumbs up for YQL and look forward to using it again soon…
23 Comments, Comment or Ping
Really cool, but is Firefox the only supported browser? I can't get it to work in IE or Chrome.
February 1st, 2009
Hey Gil, glad you like it. I didn't test it on any other browsers because it was just a quick and dirty thing and I'm lazy! I kind of assumed that jQuery would "just work" as well...
Took a look at it now though and it turns out that amoungst the browsers I have installed only FF3 supports "for each" loops. So I changed that to a normal for loop which got it working in everything except IE. And then I made another change (commented in the code) which got it working in IE...
So take another look at it and it should now be cross-browser :) Let me know if you notice any other issues,
Cheers,
Kelvin :)
February 2nd, 2009
Hey there, Kelvin. Just saw your post here, and wanted to let you know that CSS style selection is now possible thanks to YQL Execute... check out the second example at this link:
"http://ajaxian.com/archives/yql-execute-now-allows-you-to-convert-scraped-data-with-server-side-javascript"
Hope this helps with any future YQL work. Thanks for the post and sharing your work with YQL with others.
May 15th, 2009
Hey :)
Thanks for the comment :) I've seen the stuff with YQL Execute and I'm really keen to try it out some time... It looks like it makes the whole YQL concept many times more powerful... I'm looking forward to finding an excuse to play with it!
May 15th, 2009
CSS selectors have been added with "YQL Execute" http://developer.yahoo.net/blog/archives/2009/04/yql_execute.html
May 25th, 2009
Aha, see that Execute has already been mentioned! Also found YQL works well with Pipes to serve up customized RSS feeds.
May 25th, 2009
Hello Kevin,
Thank you for so interesting and useful info.
I knew nothing about YQL and jQuery before, and when I needed to do data scraping, I wrote the program in VB.net and this took a lot of time.
I am gonna to learn YQL & jQuery now, thank you for your article!
Warmest wishes,
Alexey.
October 24th, 2009
Is there any max usage of YQL ?
Thanks.
October 31st, 2009
Yes. See the YQL intro page:
http://developer.yahoo.com/yql/
* Per application limit (identified by your Access Key): 100,000 calls per day
* Per IP limits: /v1/public/*: 1,000 calls per hour; /v1/yql/*: 10,000 calls per hour
November 2nd, 2009
Kelvin,
Thank you for this article.
I also work with jQuery + YQL integration. I was working on one client's site and YQL was the only one possible option to make
custom Facebook Fan Box Widget. I found that my script becomes obsolete immediately after any single DOM change in scrapped data, e.g. you queue Wikipedia with xpath="//table[@class='wikitable sortable']", but suddenly Wikipedia change their structure and renames its tables. Or as happen with me: Facebook has transferred their widgets from http://www.facebook.com to http://www.connect.facebook.com, so you need keep an eye on your scrapped data.
May 8th, 2010
Really nice article.. Found a interesting jQuery & YQL plugin that helps with YQL queries.
jQuery YQL
jQuery YQL plugin download
August 1st, 2010
You can also run jQuery on YQL's servers using YQL-Execute.
Here's an example: http://chiarg.com/?p=403
December 4th, 2010
That is an interesting approach. Generally you'll want a server-side programing language to do the heavy lifting, and just let JavaScript grab the output of the server-side script.
You're right, Xpath is an amazing way to transverse web pages, especially if you're wanting to manipulate objects further. I wrote some about Advanced Data Scraping using cURL and Xpath.
Writing your own PHP, Python, etc. script using cURL and XPATH would get around the API max requests of YQL,
December 18th, 2010
Thanks for the example. After reading about your scraping project I started using YQL via a python library that I found and have had great success with it.
March 10th, 2011
Reply to “Data scraping with YQL and jQuery”