So you have a million objects in memory, you’ve got them in a Map – so you can access them by key very quickly, but now you want to find all that match on another field. Well, they are in memory, so you can go through them all. However that is going to be slow.
[code]
static Collection
You could stick them in a DB (like MySQL or MongoDB) however it will still be pretty slow to query/convert to/from or you could write your own query engine from scratch. Admittedly its optimised for that, but it’s still a separate process and the data has been converted into your
If only you could index the collection by that other field…
Well you can – just put the collection into a Boon Repo and add an index on that field and it will take care of ensuring the index is kept up to date.
[code]
Repo<Integer, Map> dbRepo = Repos.builder()
.primaryKey(“name”)
.lookupIndex(“colour”)
.build(int.class, Map.class);
…
dbRepo.addAll(database);
…
List
For these simple queries the search changes from O(n) to O(1) – bucket loads quicker!
It provides a rich set of criteria facilities that allows you to query in a variety of ways – equality, in, between, etc and these can be composed using and/or …
[code]
List
It effectively provides an In memory object database, searchable, you define the indexes you want to use – doesn’t have to be unique.
– its pure Java (1.7) – although I have a 1.6 backport.
Ideal for the case where you have low frequency updates but lots of varied queries.
Its a little quirky when it comes to updating items – to update an item, you need to either delete/re-add it or get the existing object, clone it, amend it and re-add it.
[code]
managers = dbRepo.query(eq(“job”, “manager”));
Map updatedMgr = Maps.copy(managers.get(0)); // Maps.copy is a boon cloning util
updatedMgr.put(“colour”,”blue”);
dbRepo.modify(updatedMgr);
[/code]
I have found performance to vary somewhat on more complex queries – but I am sure that is something that will improve greatly.
As with all libraries, you need to use it and see if it helps with what you need.
Links to similar data repo articles
– Boon’s Data Repo by Example
– Boon Data Repo Indexed Collections and more
– Background on Data Repo
– What if Java Collections and Java Hierarchies were Easily Searchable?
– Unrelated, but… Boon beer
Thanks to Rick Hightower for giving us this boon!
A few weeks ago I attended Uncle Bob’s TDD and Refactoring course – here are my highlights…
Number one for me was this quote “TDD is double entry bookkeeping for programmers!”
Skills You Need When Looking at code:
– able to identify problems
– knowing whats better
– transform in small steps towards the better version
When the code makes you do something ugly – the design is wrong!
An interesting presentation style point was the intermission discussions – completely unrelated. A few of us were trying to work out how they related to the course, until someone asked and found out they didnt :)
Thanks again, Mr Martin.
As mentioned previously – its live!
The app is built on nodejs, using coffeescript with a MongoDB database.
I chose coffeescript, as I prefer the Ruby-like syntax and really don’t like Javascript curly-brace’s. I also found NodeJS to be very fast and lightweight. Also, the callback model in node takes some getting used to.
Libraries like async – help make the callbacks more manageable and avoid the pyramid of death.
I used the rapidly growing expressjs framework – which was great at keeping out of your way and letting you just do what you need.
HTML was put together via Jade templates and Stylus stylesheet helper.
The core of the app accesses the Armory via a great little library – node-armory.
The data from the armory is saved directly as JSON to MongoDB and as further updates come in, the jsondiffpatch library – is used to determine whats changed.
There are some tests written with mocha and sinon for stubs/mocks. My testing style is to use them when there is a problem – so they are probably broken at the moment.
I found a few features missing with the node-armory library and so forked the project to address them, such as support for using a proxy (for debug purposes) and using Mike Reinstein’s version of the http lib request that supports compressed request/replies.
See package.json for details of all the libraries used.
A couple of MongoDB’s features that came in handy, like:
Definitely the worst bit of the code is the RSS feed entry formatting – need to think of some ways to refactor it sensibly.
To make the search feature nice and responsive, backbone was used in a very basic way.
The main thing I missed from Rails was the asset pipeline, which lets you combine all the client side assets (javascript and css files) into just a few minified files.
One of the most interesting parts of the app (at least to me), is the scheduled job that checks the armory for updates. The core of which uses an async queue to kick off many calls to the WoW API and collect the results.
There have been quite a few Rails issues over the last few weeks… so I better upgrade mine too :( Note I use rvm with gemsets to separate gem versions – probably could stop using gemsets given the latest bundler, but havent got on that bandwagon yet.
link Not currently live on a public site, but good basic example to test the upgrade. This is currently rails 3.2.8 (rails 1.9.3/sqlite), which is not too far behind the latest, 3.2.11 – so it should be easy… First step, rvm implode and re-install – I have lots of rvm/gemset cruft and now seems like a good time to tidy that up. Then run bundle to get the gems that should currently work. Then realised that rvm wasnt loaded properly and so the gem installs did not go into my gemset – so re-started the terminal session and ran bundle again (probably could have source’d rvm, but probably better this way). Make sure the db is up to date (rake db:migrate). Then try running the app (rails server) [yup – no tests…]. Seems to be working fine – thought for a moment I was done, but then realised I have not upgraded rails yet – doh! bundle update Now I am on rails 3.2.11. Lets try again – rake db:migrate; rails server. And seems to be working ok. Checked the log files – one error about binary data in a string field (encrypted_password) – but then we had that previously. Also used the rails_admin gem for a quick built in db-viewer – thats working ok too. Time to see if it works on Heroku too…
link Largely a javascript based site – but does use Devise, so that upgrade might make it problematic. Rails 3.2.1 (ruby 1.9.2/sqlite/postgres)
gem ‘twitter’, :git => ‘https://github.com/sferik/
twitter.git‘, :tag => ‘v1.6.0’
Largely a backend site, but has a few public urls with stats on. Use Mechanize and Savon, will probably need to update these too.
Not deployed, so perhaps leave… last version used – 3.0.9 …
Other useful links:
Hi,
A few years ago, the WoW Armory had a way (unofficially) to access a feed of updates for your in game character, eg gaining a level etc. (or so I seem to remember, but maybe my mind is playing tricks…).
Then along came the new Armory site and that feed wasn’t available anymore :( ..
Last year, Blizzard came out with an API to access WoW character progress data and it was thought that an official feed would be produced. However 12 months later, there is no sign of a feed.
Recently I wanted such a feed, had a look around and finding none – decided to write one. And its now live http://wowactivity.kimptoc.net/ :)
As users of the WoW API have to be open source – the code is here on github.
The feed providers both character and guild RSS feed. Its largely based on the news/feed items that come with character and guild lookups but it also does track changes, eg when level changes. This RSS feed can be used in many ways, from piping it into a feed reader to sending it to twitter, facebook or guild websites.
Its currently tracking over 150 characters and guilds.
Its built using coffeescript (aka javascript) and MongoDB – will probably do a separate post on that.
Enjoy!
Chris
There is (still) a general issue with Timeout’s in Ruby, relating to its use of Thread.kill etc, see this link for details.
Unfortunately its an issue in JRuby too. The solutions suggested in the link are quite low-level (and no other high option seems to be exposed).
I have some code that uses Net::HTTP that we need to set timeouts on (we need to know if the other end has a problem) which under load, hits the above issue – we start running out of resources (too many open files). Being a lazy/pragmatic programmer and having the advantage of working with JRuby – I decided to cheat and switch to a Java library that solves the timeout problem properly. Meet my friend, httpcomponents (formerly httpclient)
I tried to use it directly in Ruby, but that gets messy, so I wrapped it up a bit:
[code]
package com.x;
import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.HttpResponse;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.AuthCache;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.protocol.ClientContext;
import org.apache.http.conn.params.ConnRoutePNames;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.auth.BasicScheme;
import org.apache.http.impl.client.BasicAuthCache;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.params.CoreConnectionPNames;
import org.apache.http.params.HttpParams;
import org.apache.http.params.SyncBasicHttpParams;
import org.apache.http.protocol.BasicHttpContext;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
/**
* User: kimptonc
* Date: 09/01/13
* Time: 12:05
* Thin wrapper around HttpClient lib – used by the ruby code
*/
public class HttpClient {
private String host;
private int port;
private String user;
private String pass;
private DefaultHttpClient httpclient;
private HttpHost targetHost;
private BasicAuthCache authCache;
private int conn_timeout;
private int so_timeout;
private static Object lock = new Object();
public HttpClient(String host, int port, String user, String pass, int conn_timeout, int so_timeout) {
this.host = host;
this.port = port;
this.user = user;
this.pass = pass;
this.conn_timeout = conn_timeout;
this.so_timeout = so_timeout;
while (httpclient == null)
{
synchronized (lock) {
connect();
}
}
}
public Map post(String url, String requestBody) throws Exception
{
try {
while (httpclient == null)
{
synchronized (lock) {
connect();
}
}
HttpPost post = new HttpPost(url);
StringEntity requestEntity = new StringEntity(requestBody);
requestEntity.setContentType(“application/xml”);
post.setEntity(requestEntity);
// Add AuthCache to the execution context
BasicHttpContext localcontext = new BasicHttpContext();
localcontext.setAttribute(ClientContext.AUTH_CACHE, authCache);
HttpResponse response = httpclient.execute(targetHost, post, localcontext);
HttpEntity entity = response.getEntity();
Map responseMap = new HashMap();
String responseBody = EntityUtils.toString(entity);
responseMap.put(“body”,responseBody);
responseMap.put(“status”,response.getStatusLine());
EntityUtils.consume(entity); // needed?
return responseMap;
} catch (Exception e) {
httpclient.getConnectionManager().shutdown();
httpclient = null;
throw e;
}
}
private void connect() {
if (httpclient != null)
return; // looks like we are connected, so ignore
targetHost = new HttpHost(host, port, “http”);
HttpParams params = new SyncBasicHttpParams();
params
.setIntParameter(CoreConnectionPNames.SO_TIMEOUT, so_timeout)
.setIntParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, conn_timeout);
httpclient = new DefaultHttpClient(params);
httpclient.getCredentialsProvider().setCredentials(
new AuthScope(targetHost.getHostName(), targetHost.getPort()),
new UsernamePasswordCredentials(user, pass));
// Create AuthCache instance
authCache = new BasicAuthCache();
// Generate BASIC scheme object and add it to the local
// auth cache
BasicScheme basicAuth = new BasicScheme();
authCache.put(targetHost, basicAuth);
// for testing against a proxy in dev (eg Charles)
// HttpHost proxy = new HttpHost(“localhost”, 8888);
// httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY, proxy);
}
public static void main(String[] args) throws Exception {
HttpClient hc = new HttpClient(“localhost”, 8080,”one”, “two”,1,1);
Map resp = hc.post(“/v1/bond”,”<bond/>”);
System.out.println(“Resp:”+resp);
}
}
[/code]
Then to use it in the Ruby, code its just this:
[code]
@http_client = com.x.HttpClient.new(@svc_host, @svc_port.to_i,@svc_user,@svc_pass, open_timeout.to_i, read_timeout.to_i)
resp = @http_client.post @url_path, xml
[/code]
And it works – when we put this under load, we dont run out of resources :)
For the last few weeks I have been trying to tune a rails app… getting to the point where we are wondering which way to scale – bigger box or more boxes…
I have tried to summarise the problem on Stackoverflow here.
If you have any ideas/suggestions, please add them there.
Thanks in advance :)
A DSL is a Domain Specific Language – that is, small language that should help frame solutions to particular problems.
In this case, we had some config for a report writer program that was parsed using custom code, things like so:
[code]
<report name>
{
FORMAT = CSV
RECORD = A.PATH.TO.RECORD
MODE = streaming
KEYFIELD = TradeId
FIELDS {
Id
TradeId
RecordType
Date
Time
}
WHERE Date = “20120924”
}
[/code]
The parser was getting hairier and we need to add more features – more complex WHERE clauses etc. So, it seemed an opportunity to throw in a DSL. This is the new DSL format for the above:
[code]
report “<report name>” do
format “CSV”
record “A.PATH.TO.RECORD”
mode “streaming”
keyfield “TradeId”
fields do
column “Id”
column “TradeId”
column “RecordType”
column “Date”
column “Time”
end
where do
Date == “20120924”
end
end
[/code]
When I first saw things like this, it seemed like magic – Ruby must be doing really complex stuff to handle it – but it isnt :)
Its all down to defining methods with the above names and handling the blocks passed to them.
For example there is a “report” method, like so:
[code]
def report(name, &block)
# save report name and &block of code for later use
end
[/code]
The “&block” bit is little funky – its a way of capturing the code passed between the “do … end” block above.
To handle the next level down – the code within the report block, I defined a class with those methods and the block is “call”‘d within the context of that block.
[code]
class WriterDefInRuby < Java::com.WriterDefBase # to make things fun, the class needs to implement a Java interface :)
def where(&block)
@where_clause = block
end
def setup(&block)
self.instance_eval &block
end
...
end
[/code]
So, the above class is used like this:
[code]
def report(name, &block)
writer_def = WriterDefInRuby.new
writer_def.name = name
writer_def.setup(&block)
end
[/code]
And thats it. We had a third level down for the fields stuff - but its just more of the same, another class, eval the block in the context of that class.
PS Thanks to several sites explaining DSL’s in much better detail and several example gems, like Rose and Docile
I thought I’d try one of those link bait articles…
I am still in the “honeymoon” phase – its benefits seem to outweigh the disadvantages. Lets see what happens over the next 6-12 months…
A few years ago I agreed to help with a side project that needed to scrape some websites and also talk to some webservices.
Various front ends would generate the requests and my process would go through the db and process them.
The core of the engine was not a webapp – but Rails/ActiveRecord (AR) provided a good way of interacting with MySQL.
I tried a few threading strategies, but hit (seemingly) issues with accessing the db and MRI threading, although in retrospect, I think the issues were more of my own making – overlapping threads/poor design.
Moving to JRuby seemed to address some of the threading issues.
I initially used the Parallel gem, which seemed to do largely what I wanted. However I still was getting AR issues, so I switched to JRuby. It then seemed more appropriate to use a Java based parallelisation gem – so I went for ActsAsExecutor, which is thin wrapper around the Java concurrency features. This was used to manage a pool of threads that can handle the scraping/webservice calling
[code]
ActsAsExecutor::Executor::Factory.create 15, false # 15 max threads, schedulable-false?
[/code]
To do the polling for any work to be done, the Rufus scheduler gem was used. It was setup to check for pending records every few seconds, like so:
[code]
scheduler = Rufus::Scheduler.start_new
scheduler.every “2s”, :allow_overlapping => false do
# do some work
end
[/code]
This was kicked off in a Rails initializer.
One mistake I made was to not have the allow_overlapping flag – which meant that if any job was slow, the next one would be started and would try to do the same work again.
Another trick which I think helped was to wrap db accessing sections like so:
[code]
ActiveRecord::Base.connection_pool.with_connection do
# db work…
end
[/code]
The scheduled task analyses each request and generates work items to be done in the thread pool. Another mistake I made was to set the work items status to NEW and then separately, re-queried the db for NEW items to queue up for the thread pool. Only when they were picked off the queue did their status advance. This led to a window of opportunity for a subsequent scheduled job/task analysis to re-queue the same items. The change to address this was to not re-query the db – I am generating the items and so know them without going to the db. Thus subsequent runs will only work on items that they generate themselves.
Each work item in the thread pool did the above trick to re-connect to the DB and then loaded their work item.
To separate the various webscraping/webservice calls out, the code for that is held in a text field in the DB. This is then loaded dynamically as each call is required. This is “instance_eval”‘d into the work item object – so it has access to work item details.
There are largely 2 kinds of work items – web scrapers and webservice calls. The scraping is done via Mechanize and the webservices via Savon.
For Mechanize, the various pages/frames/forms are navigated to achieve the desired results.
For Savon, the message body is constructed and the service called.
The results are then saved back to the db.