I've been doing quite a bit of Ruby programming lately and I'm always surprised by how much I can get done with such a small amount of code. Ruby is not my primary development language, but for many of the things I need to do, Ruby is the perfect tool for the job.
For instance, I'm in the middle of looking for a home to buy so I've spent a lot of time on real estate websites waiting to see the newest listings in my area as they become available. Logging on to each real estate site every day is a repetitive and time consuming task. Worse still is the fact that there aren't typically too many new listings in a given week. With Ruby, I've been able to write some small screen scraper applications to log onto each site, check for new listings, and then report back to me with the results. This has saved me a lot of time and effort.
Since most web pages are simply made of HTML it is easy for a computer to parse and store the information contained within these documents. Each programming language commonly has a host of libraries to assist in the screen scraping/parsing process and Ruby is no exception. To create simple screen scrapers in Ruby I have been using a library called Scrubyt. Scrubyt provides a simple DSL to access a given website and scrape its content. All the programmer needs to do is provide the XPath string to the desired information.
As an example, let's say that I want to extend a friendly challenge to all of my co-workers to see who can write the most blog entries in 2009. Some of the information I would need to grab would be a list of my co-workers who have contributed to the blog, a list of their blog entries, and the date their blog entries were written.
Without a doubt all of this information is stored in the database that is responsible for managing the content of this blog and could easily be accessed by running a few simple SQL queries. The problem however, is that I don't have access to the database. This is the beauty of screen scraping - any webpage that is exposed over a network can be scraped by a computer program.
Too see how this is done, let's dive into some code.
To begin with, we'll need to create some simple classes to store the information that we scrape from the Platinum Solutions Blog. We'll start with a BlogEntry class that will store the url, title, date, and number of reads for each blog entry:
class BlogEntry
attr_accessor :url, :title, :date, :num_of_reads
def initialize(blog_hash)
@url = blog_hash[:url]
@title = blog_hash[:blog_title]
@date = Date.parse(blog_hash[:blog_date])
@num_of_reads = blog_hash[:number_of_reads].to_i
end
end
Next, we'll store information about the blog author including a unique id, a name, and an Array of entries written by that author. We'll store this information in a class called BlogEntryAuthor:
class BlogEntryAuthor
attr_accessor :id, :name, :entries
def initialize(blog_hash)
@id = blog_hash[:blog_author_id]
@name = blog_hash[:blog_author]
@entries = []
end
def num_of_entries
@entries.size
end
def total_reads
@entries.inject(0){|sum,item| sum + item.num_of_reads}
end
def reads_per_entry
total_reads / num_of_entries
end
end
To parse the Platinum Solutions blog, we need to make sure that we have the Scrubyt Ruby Gem properly installed. For the example we're going to be using I actually modified the Scrubyt library to add in some functionality and I am hosting the code on my Github page (Side note: Github is awesome and you should definitely host your open source project code there - but more on that in another blog post). To install the gem make sure that you have Github set up as one of your gem source repositories:
gem sources -a http://gems.github.com
Then run the following gem install command:
gem install jspradlin-scrubyt
Now we're all set up and ready to start scraping. If we look closely at the Platinum Solutions blog we can see that all of the information we are trying to gather is contained within the header of each blog entry:

If we view the source of the page we will notice that the HTML responsible for generating this header would look like this:
<div class="entry-head">
<h2 class="entry-title">
<a href="/Book-Review-Programming-Groovy" ...>
Book Review: Programming Groovy</a>
</h2>
<small class="entry-meta">
<span class="chronodata">
<abbr class="published" title="Fri, 2008-09-19 13:31">
Fri, 2008-09-19 13:31</abbr>
</span>
...
<ul class="links inline">
<li class="first blog_usernames_blog">
<a href="/blog/72" ...>Justin Spradlin's blog</a>
</li>
...
<li class="last statistics_counter">
<span class="statistics_counter">580 reads</span>
</li>
</ul>
</small>
</div>
Finally, we can see the Ruby code responsible for scraping this information from the website:
blog_url = "http://blog.platinumsolutions.com"
blog_data = Scrubyt::Extractor.define do
fetch blog_url
blog_entry '//div[@class="entry-head"]' do
blog_title "//h2/a"
blog_url "//h2/a/@href",
:format_output => lambda {|x| blog_url + x }
blog_date "//span/abbr",
:format_output => lambda {|x| x.split(' ')[1]}
blog_author "//ul/li/a[@class='blog_usernames_blog']",
:format_output => lambda {|x| x.split('&')[0]}
blog_author_id "//ul/li/a[@class='blog_usernames_blog']/@href",
:format_output => lambda {|x| x.split('/')[2]}
number_of_reads "//span[@class='statistics_counter']",
:format_output => lambda {|x| x.split(' ')[0]}
end
next_page "//a[@title='Go to next page']"
end
If you look at the header, the HTML code, and the Ruby code I've tried to color coordinate each separate piece of information to illustrate how it is parsed and then stored (sorry for the horrible colors, they didn't turn out quite as expected). The Scrubyt library will "fetch" the given url, locate the HTML elements by the given XPath, and then store the data in an Array of Hashes using the method call (blog_title, blog_url, etc.) as the key. Once a page has been completely scraped the "next_page" method finds the URL for the next page, loads it, and begins parsing that page until there are no longer any pages left. That's it. Simple, right?
When we have all of the data we can shuffle it around a bit to conform to the data model we defined above by executing the following code:
blog_entries = {}
blog_data.to_hash.each do |bh|
#move on if there is no author
next unless bh[:blog_author_id]
blog_entries[bh[:blog_author_id]] =
BlogEntryAuthor.new(bh) unless blog_entries[bh[:blog_author_id]]
blog_entries[bh[:blog_author_id]].entries << BlogEntry.new(bh)
end
When the data is in our defined format we can find out some interesting information about the Platinum Solutions blog. For example we can get a list of the all time top contributing current employees:
Top 10 Contributors of All Time (By Number of Entries)
| Name | Entries | Total Reads | Reads/Entry |
|---|---|---|---|
| Mike McKinney | 10 | 21172 | 2117 |
| Rick Witter | 10 | 25065 | 2506 |
| Duane Taylor | 10 | 11949 | 1194 |
| Mike Marmen | 8 | 20427 | 2553 |
| Brian Rosenthal | 8 | 13527 | 1690 |
| Justin Spradlin | 6 | 37180 | 6196 |
| Christopher Pierce | 6 | 56474 | 9412 |
| Randy Avis | 6 | 6879 | 1146 |
| William Hunt | 5 | 8402 | 1680 |
| John Howard | 4 | 52224 | 13056 |
| Maliha Nowrouz | 4 | 6756 | 1689 |
| Bunni Bates | 4 | 11651 | 2912 |
We can also get a sense of the quality of each blog post by sorting the entries by the number of reads per entry:
Top 10 Contributors of All Time (By Reads per Entry)
| Name | Entries | Total Reads | Reads/Entry |
|---|---|---|---|
| John Howard | 4 | 52224 | 13056 |
| Christopher Pierce | 6 | 56474 | 9412 |
| Justin Spradlin | 6 | 37180 | 6196 |
| Bunni Bates | 4 | 11651 | 2912 |
| Bob Barry | 2 | 5544 | 2772 |
| Mike Marmen | 8 | 20427 | 2553 |
| Rick Witter | 10 | 25065 | 2506 |
| Ryan Hamerski | 3 | 7077 | 2359 |
| Mike McKinney | 10 | 21172 | 2117 |
| Brian Rosenthal | 8 | 13527 | 1690 |
To be sure, there are a number of factors that could contribute to how many reads a blog entry gets so these numbers should be taken with a grain of salt.
As far as 2009 is concerned there have only been a few entries:
Top Contributors in 2009
| Name | Entries | Total Reads | Reads/Entry |
|---|---|---|---|
| Christopher Pierce | 1 | 1517 | 1517 |
| Randy Avis | 1 | 373 | 373 |
Since there have only been a few blog posts so far this year we are all pretty much on a level playing field. I'd like to extend the challenge to the rest of my co-workers to see who can write the most blog entries this year and who can get the most reads per entry. I'll re-run the script at the end of the year and announce the winners in another blog post.
To download and/or view the code examples from this blog post please click here.
Comments
I'm very new to coding and have never used Ruby, but the way you have explained everything and put it makes perfect sense to me.
I'm gonna give Ruby a try
Great read!
Post new comment