Wednesday, November 05, 2008

ActiveRecord lessons learnt: #1 Never forget there's a database

We have been using Castle's Active Record very heavily on my current project we have recently spent a lot of time performance tuning the performance of the application and its use of ActiveRecord.  We learnt a lot over the last few of weeks and in the next few posts I'll try to capture some of these lessons.

ActiveRecord is a great abstraction of the database for a .NET application.  NHibernate and Hibernate are OK, but leave you with a legacy of XML mapping files in your application, which is no fun at all.  The appeal of ActiveRecord over vanilla NHibernate was the chance to shed all that XML cruft.  Now, with a minimal few simple attributes and a simple repository class I can transform my plain old C# object into a persistent domain object with ease.  It is almost as if the database was not there at all...

[ActiveRecord]
public class MyAggregate
{
private int id;
private readonly IList<SomeObject> items;

[PrimaryKey(Access
= PropertyAccess.FieldCamelcase, UnsavedValue = Entity.IdUnassignedString)]
public int Id{get { return id; } }

[HasMany(
typeof(SomeObject), Access=PropertyAccess.FieldCamelCase)]
public IList<SomeObject> SomeItems{ get { return items; } }
}

Look at that!  Just three Attributes and my object is ready for the database.  Marvelous!

This has been the source of many problems for us.  It was just too easy to work with the database.  In the green field days of the project, you could almost forget that the database existed at all.  We had ActiveRecord auto generate the database, and we wrote test fixtures which could create test data in memory or in the database.  This was superb, as we could write standalone unit tests to drive out the behaviour of the domain and application components, soon we had hundreds of fast database free unit tests.  We could also write wired tests for our repositories to prove that we could persist domain objects to and read them back from SQL server.  The same test fixtures could be reused in the automated acceptance tests to modify the application data as required.

Fantastic!  So what went wrong?  Well, we built a slow app, at least it started out slow when it hit UAT (it's much, much faster now).  We fell victim to Joel's Law of Leaky Abstractions which states that "All non-trivial abstractions, to some degree, are leaky".  In this case, we have ActiveRecord abstracting NHibernate which in turn abstracts the database access.  The minute you forget this, you will get bitten.  The Hibernate team certainly acknowledge this.   Just three pages into the first chapter of Java Persistence with Hibernate Christian Bauer and Gavin King remind you that:

"To use Hibernate effectively, a solid understanding of the relational model and SQL is a prerequisite.  You need to understand the relational model and topics such as normalization to guarantee the the integrity of your data, and you'll need to use your knowledge of SQL to tune the performance of your Hibernate application" 

In other words, try to remember that you are still talking to a relational database; we're just making it easier for you to do so from your OO world.

I really like ActiveRecord and I would recommend using it on .NET projects when there is the need an OR Mapping tool.  The fault lay was us, the developers.  We didn't pay enough attention to the fact that there was a database at the end of the calls that the repository was making.  We didn't spot that some of the calls which our test code was making were potentially very expensive in real usage scenarios.  And so we got bitten by the law of leaky abstractions, which manifest itself through a number of bad usage patterns which caused immediate performance problems.

Avoidable Mistake - Absence of good quality test data

We took too long time to recognise we had performance problems because we didn't work hard enough to obtain truly representative test data for our app.  If we had made more of an effort to use 'real' data sooner, then we would have spotted much more, much earlier in our project's lifecycle.  There is no substitute for realistic data.  Get or generate some and plan to integrate it into your application build process.  Make it easy for developers to use realistic data.  We now have a nant target which restores a UAT or production database onto our developer machines and upgrades it to the latest version using dbdeploy.NET.  This is great as it means we can also automate testing our data and schema migration scripts with a production-like database.

Avoidable Mistake - Adoption of the "select all and filter with a bit of LINQ" anti pattern

This is a real killer.  The code snippet below illustrates the problem

public Book FindBooksWitTitle(string title)
{
return repository.AllBooks().Select(book=>book.Title == title).FirstOrDefault();
}

This is fine from a C# perspective, and is nice, clean, expressive code.  The problem is that we had to pull all the books out of the database to run the query.  Fine with a small test database, but very bad with the Amazon book catalogue.

Provide a richer API through the repository instead and push the filtering to the database:

public class BookRepository : IBookRepository
{
//Snip

public IList<Book> GetFirstBookWithTitle(string title)
{
DetachedCriteria crits
= DetachedCriteria.For(typeof(Book));
crits.Add(Expression.Eq(
"Title", title));
return ActiveRecordMediator<Book>.FindFirst(crits);
}

//Snip
}

This is a specific example of a general problem - making the application do work it doesn't need to do.

I'd generally be very suspicious of repositories which return unfiltered lists of any big aggregates.  They can be very useful for test purposes, but can really kill you if they get into the production code.  Keep an eye out for filtering being performed in C# code which would be simpler and more efficient as a relational database query.  These tasks should be pushed down into your repository classes and rewritten in HQL or Criteria queries.

Avoidable Mistake - Loading in data that you don't need

Bear in mind that calls to the database are expensive, especially if said database is on a remote machine.  Every time you go to the database for something that you don't need, you are going to slow down your app and needlessly waste resources.  Here are some examples of mistakes we made:

  • Loading objects from the database to perform unnecessary validation.
  • Eagerly loading objects on page load "just in case" the app might use them.
  • Repeated loads of objects from the database due to page lifecycle events.
  • Loading aggregates when a report query or a summary object would be better

Avoidable Mistake - Forgetting to check the SQL being generated

Make sure you check the SQL statements being used by NHibernate to satisfy your requests.  Sometimes you will get quite a surprise.  To enable logging of the SQL statements to the console when running your tests, add 

<add key="show_sql" value="true" />

to the activerecord section in your app.config file .   If you want to log application output, Ken Egozi has a post showing you how to do it here.

4 comments:

Alexandre Piau said...

Very good post Stuart! :)

Well done,
Bernardo Heynemann

Colin Jack said...

> Keep an eye out for filtering
> being performed in C# code which
> would be simpler and more
> efficient as a relational
> database query. These tasks
> should be pushed down into your
> repository classes and rewritten
> in HQL or Criteria queries.

I basically agree but this path can lead to premature optimizations. Performance testing is important so do it early/often, if you do you can avoid getting too caught up in worrying about performance safe in the knowledge that you can fix any issues cheaply.

Samnang Chhun said...

It's very interesting post. I have been some experiences on NHibernate, so where should I start Active Record?

Stoofer said...

@samnang: I'd start with the getting started giude at the castle website: http://www.castleproject.org/activerecord/gettingstarted/index.html

@colin: I agree that premature optimisation is not a good thing, but when developing it is worth keeping an eye out for obviously expensive calls the the database.
The point I was trying to get across was that if you know that you are likely to have tens of thousands of a particular type of object, then you should avoid requesting huge graphs of them from the database.