Close scrapes with Ruby / Mechanize
A few years ago I agreed to help with a side project that needed to scrape some websites and also talk to some webservices.
Various front ends would generate the requests and my process would go through the db and process them.
The core of the engine was not a webapp – but Rails/ActiveRecord (AR) provided a good way of interacting with MySQL.
I tried a few threading strategies, but hit (seemingly) issues with accessing the db and MRI threading, although in retrospect, I think the issues were more of my own making – overlapping threads/poor design.
Moving to JRuby seemed to address some of the threading issues.
I initially used the Parallel gem, which seemed to do largely what I wanted. However I still was getting AR issues, so I switched to JRuby. It then seemed more appropriate to use a Java based parallelisation gem – so I went for ActsAsExecutor, which is thin wrapper around the Java concurrency features. This was used to manage a pool of threads that can handle the scraping/webservice calling
[code]
ActsAsExecutor::Executor::Factory.create 15, false # 15 max threads, schedulable-false?
[/code]
To do the polling for any work to be done, the Rufus scheduler gem was used. It was setup to check for pending records every few seconds, like so:
[code]
scheduler = Rufus::Scheduler.start_new
scheduler.every “2s”, :allow_overlapping => false do
# do some work
end
[/code]
This was kicked off in a Rails initializer.
One mistake I made was to not have the allow_overlapping flag – which meant that if any job was slow, the next one would be started and would try to do the same work again.
Another trick which I think helped was to wrap db accessing sections like so:
[code]
ActiveRecord::Base.connection_pool.with_connection do
# db work…
end
[/code]
The scheduled task analyses each request and generates work items to be done in the thread pool. Another mistake I made was to set the work items status to NEW and then separately, re-queried the db for NEW items to queue up for the thread pool. Only when they were picked off the queue did their status advance. This led to a window of opportunity for a subsequent scheduled job/task analysis to re-queue the same items. The change to address this was to not re-query the db – I am generating the items and so know them without going to the db. Thus subsequent runs will only work on items that they generate themselves.
Each work item in the thread pool did the above trick to re-connect to the DB and then loaded their work item.
To separate the various webscraping/webservice calls out, the code for that is held in a text field in the DB. This is then loaded dynamically as each call is required. This is “instance_eval”‘d into the work item object – so it has access to work item details.
There are largely 2 kinds of work items – web scrapers and webservice calls. The scraping is done via Mechanize and the webservices via Savon.
For Mechanize, the various pages/frames/forms are navigated to achieve the desired results.
For Savon, the message body is constructed and the service called.
The results are then saved back to the db.