Platinum Solutions Corporate Website


Ruby, Screen Scraping, and a Little Friendly Competition

I've been doing quite a bit of Ruby programming lately and I'm always surprised by how much I can get done with such a small amount of code. Ruby is not my primary development language, but for many of the things I need to do, Ruby is the perfect tool for the job.

For instance, I'm in the middle of looking for a home to buy so I've spent a lot of time on real estate websites waiting to see the newest listings in my area as they become available. Logging on to each real estate site every day is a repetitive and time consuming task. Worse still is the fact that there aren't typically too many new listings in a given week. With Ruby, I've been able to write some small screen scraper applications to log onto each site, check for new listings, and then report back to me with the results. This has saved me a lot of time and effort.

Since most web pages are simply made of HTML it is easy for a computer to parse and store the information contained within these documents. Each programming language commonly has a host of libraries to assist in the screen scraping/parsing process and Ruby is no exception. To create simple screen scrapers in Ruby I have been using a library called Scrubyt. Scrubyt provides a simple DSL to access a given website and scrape its content. All the programmer needs to do is provide the XPath string to the desired information.

As an example, let's say that I want to extend a friendly challenge to all of my co-workers to see who can write the most blog entries in 2009. Some of the information I would need to grab would be a list of my co-workers who have contributed to the blog, a list of their blog entries, and the date their blog entries were written.

Without a doubt all of this information is stored in the database that is responsible for managing the content of this blog and could easily be accessed by running a few simple SQL queries. The problem however, is that I don't have access to the database. This is the beauty of screen scraping - any webpage that is exposed over a network can be scraped by a computer program.

Too see how this is done, let's dive into some code.

To begin with, we'll need to create some simple classes to store the information that we scrape from the Platinum Solutions Blog. We'll start with a BlogEntry class that will store the url, title, date, and number of reads for each blog entry:

class BlogEntry
  attr_accessor :url, :title, :date, :num_of_reads
  
  def initialize(blog_hash)
    @url = blog_hash[:url]
    @title = blog_hash[:blog_title]
    @date = Date.parse(blog_hash[:blog_date])
    @num_of_reads = blog_hash[:number_of_reads].to_i
  end 
end

Next, we'll store information about the blog author including a unique id, a name, and an Array of entries written by that author. We'll store this information in a class called BlogEntryAuthor:

class BlogEntryAuthor
  attr_accessor :id, :name, :entries
 
  def initialize(blog_hash)
    @id = blog_hash[:blog_author_id]
    @name = blog_hash[:blog_author]
    @entries = []
  end
 
  def num_of_entries
    @entries.size
  end
 
  def total_reads
    @entries.inject(0){|sum,item| sum + item.num_of_reads}
  end
 
  def reads_per_entry
    total_reads / num_of_entries
  end
end

To parse the Platinum Solutions blog, we need to make sure that we have the Scrubyt Ruby Gem properly installed. For the example we're going to be using I actually modified the Scrubyt library to add in some functionality and I am hosting the code on my Github page (Side note: Github is awesome and you should definitely host your open source project code there - but more on that in another blog post). To install the gem make sure that you have Github set up as one of your gem source repositories:

gem sources -a http://gems.github.com

Then run the following gem install command:

gem install jspradlin-scrubyt

Now we're all set up and ready to start scraping. If we look closely at the Platinum Solutions blog we can see that all of the information we are trying to gather is contained within the header of each blog entry:

ps blog header

If we view the source of the page we will notice that the HTML responsible for generating this header would look like this:

<div class="entry-head">
  <h2 class="entry-title">
    <a href="/Book-Review-Programming-Groovy" ...>
    	Book Review: Programming Groovy</a>
  </h2>
  <small class="entry-meta">
    <span class="chronodata">
      <abbr class="published" title="Fri, 2008-09-19 13:31">
      	Fri, 2008-09-19 13:31</abbr>
    </span>
    ...
    <ul class="links inline">
      <li class="first blog_usernames_blog">
      	<a href="/blog/72" ...>Justin Spradlin's blog</a>
      </li>
      ...
      <li class="last statistics_counter">
      	<span class="statistics_counter">580 reads</span>
      </li>
    </ul>
  </small>
</div>

Finally, we can see the Ruby code responsible for scraping this information from the website:

blog_url = "http://blog.platinumsolutions.com"

blog_data = Scrubyt::Extractor.define do
  fetch blog_url

  blog_entry '//div[@class="entry-head"]' do
    blog_title "//h2/a"  
    blog_url "//h2/a/@href", 
    	:format_output => lambda {|x| blog_url + x }
    blog_date "//span/abbr", 
    	:format_output => lambda {|x| x.split(' ')[1]}
    blog_author "//ul/li/a[@class='blog_usernames_blog']", 
    	:format_output => lambda {|x| x.split('&')[0]}
    blog_author_id "//ul/li/a[@class='blog_usernames_blog']/@href", 
    	:format_output => lambda {|x| x.split('/')[2]}
    number_of_reads "//span[@class='statistics_counter']", 
    	:format_output => lambda {|x| x.split(' ')[0]}
  end
  
  next_page "//a[@title='Go to next page']"
end

If you look at the header, the HTML code, and the Ruby code I've tried to color coordinate each separate piece of information to illustrate how it is parsed and then stored (sorry for the horrible colors, they didn't turn out quite as expected). The Scrubyt library will "fetch" the given url, locate the HTML elements by the given XPath, and then store the data in an Array of Hashes using the method call (blog_title, blog_url, etc.) as the key. Once a page has been completely scraped the "next_page" method finds the URL for the next page, loads it, and begins parsing that page until there are no longer any pages left. That's it. Simple, right?

When we have all of the data we can shuffle it around a bit to conform to the data model we defined above by executing the following code:

blog_entries = {}
 
blog_data.to_hash.each do |bh|
    #move on if there is no author
    next unless bh[:blog_author_id]    
    blog_entries[bh[:blog_author_id]] = 
      BlogEntryAuthor.new(bh) unless blog_entries[bh[:blog_author_id]]     
    blog_entries[bh[:blog_author_id]].entries << BlogEntry.new(bh)
end

When the data is in our defined format we can find out some interesting information about the Platinum Solutions blog. For example we can get a list of the all time top contributing current employees:

Top 10 Contributors of All Time (By Number of Entries)

Name Entries Total Reads Reads/Entry
Mike McKinney 10 21172 2117
Rick Witter 10 25065 2506
Duane Taylor 10 11949 1194
Mike Marmen 8 20427 2553
Brian Rosenthal 8 13527 1690
Justin Spradlin 6 37180 6196
Christopher Pierce 6 56474 9412
Randy Avis 6 6879 1146
William Hunt 5 8402 1680
John Howard 4 52224 13056
Maliha Nowrouz 4 6756 1689
Bunni Bates 4 11651 2912

We can also get a sense of the quality of each blog post by sorting the entries by the number of reads per entry:

Top 10 Contributors of All Time (By Reads per Entry)

Name Entries Total Reads Reads/Entry
John Howard 4 52224 13056
Christopher Pierce 6 56474 9412
Justin Spradlin 6 37180 6196
Bunni Bates 4 11651 2912
Bob Barry 2 5544 2772
Mike Marmen 8 20427 2553
Rick Witter 10 25065 2506
Ryan Hamerski 3 7077 2359
Mike McKinney 10 21172 2117
Brian Rosenthal 8 13527 1690

To be sure, there are a number of factors that could contribute to how many reads a blog entry gets so these numbers should be taken with a grain of salt.

As far as 2009 is concerned there have only been a few entries:

Top Contributors in 2009

Name Entries Total Reads Reads/Entry
Christopher Pierce 1 1517 1517
Randy Avis 1 373 373

Since there have only been a few blog posts so far this year we are all pretty much on a level playing field. I'd like to extend the challenge to the rest of my co-workers to see who can write the most blog entries this year and who can get the most reads per entry. I'll re-run the script at the end of the year and announce the winners in another blog post.

To download and/or view the code examples from this blog post please click here.

Comments

Gray (not verified) Wed, 1969-12-31 19:00

I'm very new to coding and have never used Ruby, but the way you have explained everything and put it makes perfect sense to me.

I'm gonna give Ruby a try

Great read!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.

More information about formatting options