Thursday, November 13, 2008

Bowling Cards

Recently, my team decided we had to tackle a number of problems with our aging application code base which were causing us pain. There were a number of technology and design problems which needed sorting out. The challenge for us was to continue to deliver new features while taking measures to not only stop the signs of rot, but also to make an effort to improve the code base.  If we could improve the code where we were most active we would be able to deliver features more rapidly. 

The first stage was a technical retrospective, facilitated by Liv, our project manager. The whole team got together and we brainstormed the problems with the code base which were slowing down active development and causing us pain.  With everything hurting us out in the open, the team voting for the issues they thought were causing us the most pain. The focus was delivery, we weren't looking to make the code more beautiful for aesthetic reasons; we were trying to reduce friction to help us deliver more code more rapidly.  With the key problems agreed, we moved on to discuss how we could fix them. So far so good. 

Some of our problems could be fixed the traditional agile way write a technical card, estimate the cost and agree who would do the work and when.

Others were ongoing "try not to repeat this anti pattern" or "remember to do X" type solutions, so there was no action as such, other than to remember to not do "X" or to do "Y".

We also had a third class of problem. Long running repair or rework tasks. For example, the team has switched mocking libraries from NMock to Rhino Mocks.  While we find we are more productive when we mock using Rhino mocks, we still have a legacy of old unit tests which periodically bite us when we need to refactor.  The idea of scheduling a task to rewrite all our unit tests using Rhino Mocks is just plain wrong for a number of reasons.  On humanitarian grounds, I couldn't bring myself to ask a member of my team to spend a week rewriting all our tests - we'd need to put her on suicide watch.  Also, from a business value perspective, we only need to rewrite those tests which cover the active areas of the code.  The application has been around for a while, and we are only adding features in some areas of the code. If we are not going to be slowed down by NMock riddled tests in a remote area of the code, why waste client money fixing them?  Enter the bowling card.

in-progress-card

The anatomy of a bowling card is very simple. The title should be a simple statement of intent.  Here, we seek annihilation of all NMock calls in our unit tests. 

In the middle is a grid, with a box for each step we need to take to win. In this case, we have a box for each test class which includes the NMock library.

Anyone gets to work on the bowling card, whenever the time is right. As a team you can decide when this should be.  For this card the rules were that if you find yourself maintaining or extending an NMock based class, take the time to fix it by rewriting the tests to use Rhino mocks, or no mock at all.

There may or may not be a deadline.  For this card, we don't have a deadline.  There is no pressing need to replace all the NMock calls, only those in code hotspots.

You may or may not schedule the card to be played.  For this card we didn't schedule work - there is no real business justification to do that.  For another technical card which tracked removing inappropriate methods from a very large legacy repository, we were so slowed down by the problem that we did 'play' the card whenever we had capacity during an iteration.  In this case there was clear business value completing the card would not only speed us up, but would allow bigger architectural refactorings to take place later on in the project.

The boxes are also very important.  Each box is a step towards the goal.  We take the steps in two stages.

Spares and Strikes

When a developer or a pair have fixed one of the steps locally, they get to draw a single line over the box, or a 'spare'.  It is not converted to a 'strike' until the code change is checked in, built and tested if appropriate. We have found this to be really useful.  The spare is a reminder to check in!  With a lot of these tasks, we found it was easy to get carried away and bite off too much.  Invariably this led to trouble, broken code and despair, leading to reverting the code and feelings of hopelessness.  Once a pair has more than a couple of spares on the board, they start to feel the pressure to 'bank' and tend to check in more frequently.

We really like our bowling cards, they are great to remind us to make the world better, and to celebrate progress, no matter how small it feels at times.  In fact Bernardo and Sinan started another card today to track legacy classes with inadequate unit test coverage - and there are a lot of boxes on that sucker!

complete-cardCompleting a bowling card is a celebration event.  Navya was so happy, she made our first one a party hat! Huzzah!

Bowling cards have helped us improve our code.  They remind us that we chose to fix something, rather than allow our code to tip. It is going to potentially take a long time,  but we can all help by getting strikes when we have the chance.  We also have a measure of progress that we can track, and proof that we are making things better one step at a time.

Saturday, November 08, 2008

ActiveRecord lessons learnt: #2 Use lazy loading and batching to improve performance

By default Castle ActiveRecord eagerly loads your object data.  For a small set of connected objects, this default behaviour is fine.  The problems start when it is possible to navigate from the object that you have loaded to other potentially large object graphs.  As an example let's imagine that we are writing some code to track books in our house.  Say I want to track which books I have by author, along with the bookshelf that I'm keeping them on.  I'd also like to track chapters of interest and which editions of the book that I have.  I also want to tag my bookshelves with keywords describing the types of books on the shelf. 

image

The Bookshelf and the Author classes both contain collections of Books.  Nominally, the Author owns the books and the shelf has a reference to the object.  I've also decided that I need to navigate bi-directionally between the book and the author classes.   Here's a basic implementation using ActiveRecord.

using System.Collections.Generic;
using Castle.ActiveRecord;

namespace BooksAndStuff.Models
{

[ActiveRecord(
"Bookshelf")]
public class Bookshelf
{
[PrimaryKey(PrimaryKeyType.HiLo)]
public int Id { get; set; }

[HasMany(
typeof(string), Table = "Tags", ColumnKey = "BookShelfId",
Cascade
= ManyRelationCascadeEnum.AllDeleteOrphan, Element = "TagName",
Lazy
= Switches.LazyCollections, BatchSize = Switches.BatchSize)]
public IList<string> Tags { get; set; }

[HasMany(
typeof(Book), Table = "Books", ColumnKey = "BookShelfId",
Cascade
= ManyRelationCascadeEnum.None,
Lazy
= Switches.LazyCollections, BatchSize = Switches.BatchSize)]
public IList<Book> Books { get; set; }
}

[ActiveRecord(
"Books")]
public class Book
{
public Book() {}

public Book(string title, Author author, string isbn,
IList
<string> chapters, IList<string> editions) {
Title
= title;
Author
= author;
Chapters
= chapters;
ISBN
= isbn;
Editions
= editions;
}

[PrimaryKey(PrimaryKeyType.HiLo)]
public int Id { get; set; }

[Property(NotNull
= true)]
public string Title { get; set; }

[HasMany(
typeof(string), Table = "Editions", ColumnKey = "BookId",
Access
= PropertyAccess.Property, Element = "Edition",
Lazy
= Switches.LazyCollections, BatchSize = Switches.BatchSize)]
public IList<string> Editions { get; set; }


[HasMany(
typeof(string), Table = "Chapters", ColumnKey = "BookId",
Access
= PropertyAccess.Property, Element = "Chapter",
Lazy
= Switches.LazyCollections, BatchSize = Switches.BatchSize)]
public IList<string> Chapters { get; set; }

[Property(NotNull
= true)]
public string ISBN { get; set; }


[BelongsTo(NotNull
= true)]
public Author Author { get; set; }
}


[ActiveRecord(
"Authors", Lazy = Switches.LazyAuthor, BatchSize = Switches.AuthorBatchSize)]
public class Author
{
[PrimaryKey(PrimaryKeyType.HiLo)]
public virtual int Id { get; set; }

[Property(NotNull
= true)]
public virtual string Name { get; set; }

[HasMany(Cascade
= ManyRelationCascadeEnum.AllDeleteOrphan,
Lazy
= Switches.LazyCollections, BatchSize = Switches.BatchSize)]
public virtual IList<Book> Books { get; set; }

}

public static class Switches
{
public const bool LazyCollections = false;
public const bool LazyAuthor = false;
public const int BatchSize = 1;
public const int AuthorBatchSize = 1;
}
}

I wouldn't normally have a switches class, but it is very useful for the next part.  We are going to change the lazy loading and batch size properties of the active record attributes to understand how it influences the way NHibernate fetches the data from the database.  To get a feel for the behaviour of NHibernate, we will need some test data.  Say that we have a single shelf with 3 books on it, each with a different author.  We have another 2 books by one of our authors, but they are not on a shelf (I lent them to Steve). 


To test this we need to start up active record:

[TestFixtureSetUp]
public void FixtureSetUp()
{
var connectionString
= @"Data Source=localhost;initial catalog=BooksAndStuff;
user=BooksAndStuff_user;password=Password01
";
var properties
= new Dictionary<string, string>
{
{
"show_sql", "true"},
{
"connection.driver_class", "NHibernate.Driver.SqlClientDriver"},
{
"dialect", "NHibernate.Dialect.MsSql2005Dialect"},
{
"connection.provider", "NHibernate.Connection.DriverConnectionProvider"},
{
"connection.connection_string", connectionString}
};

var source
= new InPlaceConfigurationSource();
source.Add(
typeof(ActiveRecordBase), properties);
ActiveRecordStarter.Initialize(source,
typeof(BookShelf).Module.GetTypes());
ActiveRecordStarter.CreateSchema();
}

[SetUp]
public void SetUp()
{
ActiveRecordStarter.CreateSchema();
}

This method creates this in the database:

private void GenerateTestData() {
var lewisCarol
= new Author {Name = "Lewis Carol"};
var xmasCarol
= new Author {Name = "Xmas Carol"};
var luisLuisCarol
= new Author {Name = "Luis Luis Carol"};

var chapters
= new List<string>{ "Chapter 1", "Chapter 2" };
var editions
= new List<string>{"hard back", "spanish"};

var book1
= new Book("Alice in Wonderland", lewisCarol,"xxx-yyy", chapters, editions);
var book2
= new Book("Alice in Wonderland II - Revenge of the Cheshire Cat", xmasCarol,"xxx-yyy", chapters, editions);
var book3
= new Book("Not well known",lewisCarol, "xxx-yyy", chapters, editions);
var book4
= new Book("Not well known either", lewisCarol, "xxx-yyy", chapters, editions);
var book5
= new Book("Some book",luisLuisCarol,"xxx-yyy", chapters, editions);

lewisCarol.Books
= new List<Book> {book1, book3, book4};
xmasCarol.Books
= new List<Book>{book2};
luisLuisCarol.Books
= new List<Book>{ book5 };

var bookshelf
= new Bookshelf
{
Tags
= new List<string>{"Fantasy", "Old"},
Books
= new List<Book> {book1, book2, book5}
};

using (new SessionScope())
{
ActiveRecordMediator
<Author>.Save(lewisCarol);
ActiveRecordMediator
<Author>.Save(xmasCarol);
ActiveRecordMediator
<Author>.Save(luisLuisCarol);
ActiveRecordMediator
<BookShelf>.Save(bookshelf);
}
}

Now we have everything in place, we can write a test to load in the data:

[Test]
public void PlayingWithBatchFetchingAndLazyLoading()
{
GenerateTestData();

using (new SessionScope())
{
Console.WriteLine(
"*******************");
Console.WriteLine(
"Batch Size:{0} LazyCollections:{1} LazyAuthor:{2}",
Switches.BatchSize, Switches.LazyCollections, Switches.LazyAuthor);

Console.WriteLine(
"***\nLoading my book shelves");
var shelves
= ActiveRecordMediator<Bookshelf>.FindAll();

Console.WriteLine(
"***\nCounting books on shelf: ");
Console.WriteLine(
"shelves[0].Books.Count =" + shelves[0].Books.Count);

Console.WriteLine(
"***\nChecking I have the right book");
Console.WriteLine(
"\tshelves[0].Books[0].Id:" + shelves[0].Books[0].Id);

Console.WriteLine(
"***\nCounting the book's chapters");
Assert.That(shelves[
0].Books[0].Chapters.Count, Is.EqualTo(2));

Console.WriteLine(
"***\nHow many books does the first book's author have to his name?");
Assert.That(shelves[
0].Books[0].Author.Books.Count, Is.EqualTo(3));

Console.WriteLine(
"***\nHow many books does the third book's author have to his name?");
Assert.That(shelves[
0].Books[2].Author.Books.Count, Is.EqualTo(1));

Console.WriteLine(
"***\nHow many books does the second book's author have to his name?");
Assert.That(shelves[
0].Books[1].Author.Books.Count, Is.EqualTo(1));

Console.WriteLine(
"***\nWhat is the count of the chapters in a given book?");
Assert.That(shelves[
0].Books[0].Author.Books[1].Chapters.Count, Is.EqualTo(2));

Console.WriteLine(
"***\nHow many chapters does the third book have?");
Assert.That(shelves[
0].Books[2].Chapters.Count, Is.EqualTo(2));
}
}

When you run this, you should see the following output from NHibernate:

 *******************
Batch Size:1 LazyCollections:False LazyAuthor:False
***
Loading my book shelves
NHibernate: SELECT this_.Id as Id0_0_ FROM BookShelf this_
NHibernate: SELECT books0_.BookShelfId as BookShel5___2_, books0_.Id as Id2_, books0_.Id as Id2_1_, books0_.Title as Title2_1_, books0_.ISBN as ISBN2_1_, books0_.Author as Author2_1_, author1_.Id as Id5_0_, author1_.Name as Name5_0_ FROM Books books0_ inner join Authors author1_ on books0_.Author=author1_.Id WHERE books0_.BookShelfId=@p0; @p0 = '98305'
NHibernate: SELECT chapters0_.BookId as BookId__0_, chapters0_.Chapter as Chapter0_ FROM Chapters chapters0_ WHERE chapters0_.BookId=@p0; @p0 = '65541'
NHibernate: SELECT editions0_.BookId as BookId__0_, editions0_.Edition as Edition0_ FROM Editions editions0_ WHERE editions0_.BookId=@p0; @p0 = '65541'
NHibernate: SELECT books0_.Author as Author__1_, books0_.Id as Id1_, books0_.Id as Id2_0_, books0_.Title as Title2_0_, books0_.ISBN as ISBN2_0_, books0_.Author as Author2_0_ FROM Books books0_ WHERE books0_.Author=@p0; @p0 = '32771'
NHibernate: SELECT chapters0_.BookId as BookId__0_, chapters0_.Chapter as Chapter0_ FROM Chapters chapters0_ WHERE chapters0_.BookId=@p0; @p0 = '65540'
NHibernate: SELECT editions0_.BookId as BookId__0_, editions0_.Edition as Edition0_ FROM Editions editions0_ WHERE editions0_.BookId=@p0; @p0 = '65540'
NHibernate: SELECT books0_.Author as Author__1_, books0_.Id as Id1_, books0_.Id as Id2_0_, books0_.Title as Title2_0_, books0_.ISBN as ISBN2_0_, books0_.Author as Author2_0_ FROM Books books0_ WHERE books0_.Author=@p0; @p0 = '32770'
NHibernate: SELECT chapters0_.BookId as BookId__0_, chapters0_.Chapter as Chapter0_ FROM Chapters chapters0_ WHERE chapters0_.BookId=@p0; @p0 = '65537'
NHibernate: SELECT editions0_.BookId as BookId__0_, editions0_.Edition as Edition0_ FROM Editions editions0_ WHERE editions0_.BookId=@p0; @p0 = '65537'
NHibernate: SELECT books0_.Author as Author__1_, books0_.Id as Id1_, books0_.Id as Id2_0_, books0_.Title as Title2_0_, books0_.ISBN as ISBN2_0_, books0_.Author as Author2_0_ FROM Books books0_ WHERE books0_.Author=@p0; @p0 = '32769'
NHibernate: SELECT chapters0_.BookId as BookId__0_, chapters0_.Chapter as Chapter0_ FROM Chapters chapters0_ WHERE chapters0_.BookId=@p0; @p0 = '65539'
NHibernate: SELECT editions0_.BookId as BookId__0_, editions0_.Edition as Edition0_ FROM Editions editions0_ WHERE editions0_.BookId=@p0; @p0 = '65539'
NHibernate: SELECT chapters0_.BookId as BookId__0_, chapters0_.Chapter as Chapter0_ FROM Chapters chapters0_ WHERE chapters0_.BookId=@p0; @p0 = '65538'
NHibernate: SELECT editions0_.BookId as BookId__0_, editions0_.Edition as Edition0_ FROM Editions editions0_ WHERE editions0_.BookId=@p0; @p0 = '65538'
NHibernate: SELECT tags0_.BookShelfId as BookShel1___0_, tags0_.TagName as TagName0_ FROM Tags tags0_ WHERE tags0_.BookShelfId=@p0; @p0 = '98305'
***
Counting books on shelf: 
shelves[0].Books.Count =3
***
Checking I have the right book
    shelves[0].Books[0].Id:65537
***
Counting the book's chapters
***
How many books does the first book's author have to his name?
***
How many books does the third book's author have to his name?
***
How many books does the second book's author have to his name?
***
What is the count of the chapters in a given book?
***
How many chapters does the third book have?

Blimey - that was a lot of SQL!  16 Select statements.  Also, notice how it was all executed on the initial load of the Bookshelf.  The logic for loading ran something like this:

  1. Load all the data from the Bookshelf table for all the bookshelves
  2. For each bookshelf 
    1. Load in all the data from the books and authors table for each book on the shelf and its corresponding author
    2. For each book on the shelf 
      1. Load all the Chapters for the first book on the shelf
      2. Load all the editions for the first book in the shelf
    3. For each author 
      1. Load all the data from the books table for the author
      2. Repeat the book loading logic for every book belonging to the author which was not on the shelf
    4. Load in all the tags for the shelf

All that so that we can ask for the first shelf.  We haven't even begun to touch the object yet.  Why did NHibernate do this?  All of these objects can theoretically be reached from the bookshelf.  Because we may want to touch any of these objects after we have retrieved the BookShelf from the database, NHibernate decided it better go and grab the lot so that we could get to these objects if we had to.  This is eager loading, and is the default behaviour of ActiveRecord unless you explicitly instruct it otherwise.  Notice also, that we have an N+1 selects problem too.  There is a single select statement being issued for every collection for every object.  This is a a very expensive way to fetch data from the database.

The key to optimising the fetch behaviour is how you use the the Lazy and BatchSize properties of the various ActiveRecord attributes.

Removing N+1 Select with BatchSize

The BatchSize property can be used on collections or on ActiveRecord classed directly.  When you use the setting on a collection, NHibernate will attempt to populate the collection for [BatchSize] of the objects which NHibernate is aware of an has not populated yet.

We'll start by adding batching

public static class Switches
{
public const bool LazyCollections = false;
public const bool LazyAuthor = false;
public const int BatchSize = 2;
public const int AuthorBatchSize = 10;
}

Batch Size:2 LazyCollections:False LazyAuthor:False
***
Loading my book shelves 
SELECT this_.Id as Id3_0_ FROM BookShelf this_
SELECT books0_.BookShelfId as BookShel5___2_, books0_.Id as ...
SELECT chapters0_.BookId...WHERE chapters0_.BookId in (@p0, @p1); @p0 = '65541', @p1 = '65540'
SELECT editions0_.BookId...WHERE editions0_.BookId in (@p0, @p1); @p0 = '65541', @p1 = '65540'
SELECT books0_.Author...WHERE books0_.Author in (@p0, @p1); @p0 = '32771', @p1 = '32770'
SELECT chapters0_.BookId...WHERE chapters0_.BookId=@p0; @p0 = '65537'
SELECT editions0_.BookId...WHERE editions0_.BookId=@p0; @p0 = '65537'
SELECT books0_.Author...WHERE books0_.Author=@p0; @p0 = '32769'
SELECT chapters0_.BookId...WHERE chapters0_.BookId in (@p0, @p1); @p0 = '65539', @p1 = '65538'
SELECT editions0_.BookId...WHERE editions0_.BookId in (@p0, @p1); @p0 = '65539', @p1 = '65538'
SELECT tags0_.BookShelfId...WHERE tags0_.BookShelfId=@p0; @p0 = '98305'
***

That improved the behaviour somewhat.  We're down to 11 statements. NHibernate is now resolving the objects in the collections 2 at a time.  This is not very useful as we typically have more than 2 books on a shelf, so we have 2 trips round the loop to load in all the data for our shelf.  If we up the BatchSize to 5 we can get everything in 7 statements:

Batch Size:5 LazyCollections:False LazyAuthor:False
***
Loading my book shelves
SELECT this_.Id as Id3_0_ FROM BookShelf this_
SELECT books0_.BookShelfId...WHERE books0_.BookShelfId=@p0; @p0 = '98305'
SELECT chapters0_.BookId..WHERE chapters0_.BookId in (@p0, @p1, @p2); @p0 = '65541', @p1 = '65537', @p2 = '65540'
SELECT editions0_.BookId...WHERE editions0_.BookId in (@p0, @p1, @p2); @p0 = '65541', @p1 = '65537', @p2 = '65540'
SELECT books0_.Author...WHERE books0_.Author in (@p0, @p1, @p2); @p0 = '32771', @p1 = '32769', @p2 = '32770'
SELECT chapters0_.BookId...WHERE chapters0_.BookId in (@p0, @p1); @p0 = '65539', @p1 = '65538'
SELECT editions0_.BookId...WHERE editions0_.BookId in (@p0, @p1); @p0 = '65539', @p1 = '65538'
SELECT tags0_.BookShelfId...WHERE tags0_.BookShelfId=@p0; @p0 = '98305'
***

Bear in mind that the BatchSize only influences how many of the objects which will be populated in the current known set of objects.  In the example above, even though we only have 5 books we still have make two trips to the database to load all the books in.  The first time NHibernate decides it needs to load in a Book is when it loads the shelf.  At this point in time, there are only 3 known books so they are fetched.  Later on, while loading in the authors, the 2 remaining books are identified and fetched in a second batch.

Using Lazy Loading

The biggest problem with the fetch patterns above is the large amount of data being retrieved which may not be required.  This is potentially very wasteful.  Lazy loading allows you to control this.  With our current object model, it probably doesn't make sense to always pull in the author (and the related books of the author) every time we load a bookshelf.  How about we look at the results of lazy loading the author.

public static class Switches
{
public const bool LazyCollections = false;
public const bool LazyAuthor = true;
public const int BatchSize = 5;
public const int AuthorBatchSize = 10;
}

*******************
Batch Size:5 LazyCollections:False LazyAuthor:True
***
Loading my book shelves
SELECT this_.Id as Id3_0_ FROM BookShelf this_
SELECT books0_.BookShelfId... WHERE books0_.BookShelfId=@p0; @p0 = '98305'
SELECT chapters0_.BookId ... WHERE chapters0_.BookId in (@p0, @p1, @p2); @p0 = '65541', @p1 = '65537', @p2 = '65540'
SELECT editions0_.BookId ... WHERE editions0_.BookId in (@p0, @p1, @p2); @p0 = '65541', @p1 = '65537', @p2 = '65540'
SELECT tags0_.BookShelfId ... WHERE tags0_.BookShelfId=@p0; @p0 = '98305'
***
Counting books on shelf: 
shelves[0].Books.Count =3
***
Checking I have the right book
    shelves[0].Books[0].Id:65537
***
Counting the book's chapters
***
How many books does the first book's author have to his name?
SELECT author0_.Id ... WHERE author0_.Id in (@p0, @p1, @p2); @p0 = '32769', @p1 = '32770', @p2 = '32771'
SELECT books0_.Author ... WHERE books0_.Author in (@p0, @p1, @p2); @p0 = '32771', @p1 = '32769', @p2 = '32770'
SELECT chapters0_.BookId ... WHERE chapters0_.BookId in (@p0, @p1); @p0 = '65539', @p1 = '65538'
SELECT editions0_.BookId ... WHERE editions0_.BookId in (@p0, @p1); @p0 = '65539', @p1 = '65538'
***
How many books does the third book's author have to his name?
***
How many books does the second book's author have to his name?
***
What is the count of the chapters in a given book?
***
How many chapters does the third book have?

This has dramatically changed the way we retrieve data from the database.  Now we do not load any of the books which are not on the shelf until we touch the author property of the book.  The initial load is much cheaper, although we still pull in a load of data which is not required.  How about we lazy load the collections as well?

*******************
Batch Size:5 LazyCollections:True LazyAuthor:True
***
Loading my book shelves
NHibernate: SELECT this_.Id as Id3_0_ FROM BookShelf this_
***
Counting books on shelf:
SELECT books0_.BookShelfId... WHERE books0_.BookShelfId=@p0; @p0 = '98305'
shelves[0].Books.Count =3
***
Checking I have the right book
shelves[0].Books[0].Id:65537
***
Counting the book's chapters
SELECT chapters0_.BookId... WHERE chapters0_.BookId in (@p0, @p1, @p2); @p0 = '65537', @p1 = '65540', @p2 = '65541'
***
How many books does the first book's author have to his name?
SELECT ... WHERE author0_.Id in (@p0, @p1, @p2); @p0 = '32769', @p1 = '32770', @p2 = '32771'
SELECT books0_.Author... WHERE books0_.Author in (@p0, @p1, @p2); @p0 = '32769', @p1 = '32770', @p2 = '32771'
***
How many books does the third book's author have to his name?
***
How many books does the second book's author have to his name?
***
What is the count of the chapters in a given book?
NHibernate: SELECT chapters0_.BookId... WHERE chapters0_.BookId in (@p0, @p1); @p0 = '65538', @p1 = '65539'
***
How many chapters does the third book have?

As you can see, the up front cost of loading the bookshelves is now very cheap, we are progressively loading in data as we touch more and more of the object graph.  I would suggest that this is generally the best behaviour.  In this case we never retrieve the tags or editions from the database because we don't touch them.  If another method wants to load a bookshelf to edit its tags, we won't incur the cost of loading all the books on the shelf (and there could be 100's of those).

In summary, I'd recommend that you default to setting Lazy loading and batch fetching for all collections and consider lazy loading large aggregate classes - such as the Author in this example, so that you don't inadvertently pull out huge chunks of the database.  Also, keep watching your SQL output by setting show_sql to true now and again.  Quite innocent seeming changes to your domain model can quickly magnify the number of objects that can be navigated to from another domain object.  If you don't have a Lazy constraint between your objects to act as an NHibernate firebreak, then you will cause the amount of SQL issued to increase dramatically.

Don't forget that lazy loading can only work when you remain inside the hibernate session.  If you detach your object from the session then you will get a LazyInitializationException thrown when you attempt to access a collection of object that has not been initialised.

Wednesday, November 05, 2008

ActiveRecord lessons learnt: #1 Never forget there's a database

We have been using Castle's Active Record very heavily on my current project we have recently spent a lot of time performance tuning the performance of the application and its use of ActiveRecord.  We learnt a lot over the last few of weeks and in the next few posts I'll try to capture some of these lessons.

ActiveRecord is a great abstraction of the database for a .NET application.  NHibernate and Hibernate are OK, but leave you with a legacy of XML mapping files in your application, which is no fun at all.  The appeal of ActiveRecord over vanilla NHibernate was the chance to shed all that XML cruft.  Now, with a minimal few simple attributes and a simple repository class I can transform my plain old C# object into a persistent domain object with ease.  It is almost as if the database was not there at all...

[ActiveRecord]
public class MyAggregate
{
private int id;
private readonly IList<SomeObject> items;

[PrimaryKey(Access
= PropertyAccess.FieldCamelcase, UnsavedValue = Entity.IdUnassignedString)]
public int Id{get { return id; } }

[HasMany(
typeof(SomeObject), Access=PropertyAccess.FieldCamelCase)]
public IList<SomeObject> SomeItems{ get { return items; } }
}

Look at that!  Just three Attributes and my object is ready for the database.  Marvelous!

This has been the source of many problems for us.  It was just too easy to work with the database.  In the green field days of the project, you could almost forget that the database existed at all.  We had ActiveRecord auto generate the database, and we wrote test fixtures which could create test data in memory or in the database.  This was superb, as we could write standalone unit tests to drive out the behaviour of the domain and application components, soon we had hundreds of fast database free unit tests.  We could also write wired tests for our repositories to prove that we could persist domain objects to and read them back from SQL server.  The same test fixtures could be reused in the automated acceptance tests to modify the application data as required.

Fantastic!  So what went wrong?  Well, we built a slow app, at least it started out slow when it hit UAT (it's much, much faster now).  We fell victim to Joel's Law of Leaky Abstractions which states that "All non-trivial abstractions, to some degree, are leaky".  In this case, we have ActiveRecord abstracting NHibernate which in turn abstracts the database access.  The minute you forget this, you will get bitten.  The Hibernate team certainly acknowledge this.   Just three pages into the first chapter of Java Persistence with Hibernate Christian Bauer and Gavin King remind you that:

"To use Hibernate effectively, a solid understanding of the relational model and SQL is a prerequisite.  You need to understand the relational model and topics such as normalization to guarantee the the integrity of your data, and you'll need to use your knowledge of SQL to tune the performance of your Hibernate application" 

In other words, try to remember that you are still talking to a relational database; we're just making it easier for you to do so from your OO world.

I really like ActiveRecord and I would recommend using it on .NET projects when there is the need an OR Mapping tool.  The fault lay was us, the developers.  We didn't pay enough attention to the fact that there was a database at the end of the calls that the repository was making.  We didn't spot that some of the calls which our test code was making were potentially very expensive in real usage scenarios.  And so we got bitten by the law of leaky abstractions, which manifest itself through a number of bad usage patterns which caused immediate performance problems.

Avoidable Mistake - Absence of good quality test data

We took too long time to recognise we had performance problems because we didn't work hard enough to obtain truly representative test data for our app.  If we had made more of an effort to use 'real' data sooner, then we would have spotted much more, much earlier in our project's lifecycle.  There is no substitute for realistic data.  Get or generate some and plan to integrate it into your application build process.  Make it easy for developers to use realistic data.  We now have a nant target which restores a UAT or production database onto our developer machines and upgrades it to the latest version using dbdeploy.NET.  This is great as it means we can also automate testing our data and schema migration scripts with a production-like database.

Avoidable Mistake - Adoption of the "select all and filter with a bit of LINQ" anti pattern

This is a real killer.  The code snippet below illustrates the problem

public Book FindBooksWitTitle(string title)
{
return repository.AllBooks().Select(book=>book.Title == title).FirstOrDefault();
}

This is fine from a C# perspective, and is nice, clean, expressive code.  The problem is that we had to pull all the books out of the database to run the query.  Fine with a small test database, but very bad with the Amazon book catalogue.

Provide a richer API through the repository instead and push the filtering to the database:

public class BookRepository : IBookRepository
{
//Snip

public IList<Book> GetFirstBookWithTitle(string title)
{
DetachedCriteria crits
= DetachedCriteria.For(typeof(Book));
crits.Add(Expression.Eq(
"Title", title));
return ActiveRecordMediator<Book>.FindFirst(crits);
}

//Snip
}

This is a specific example of a general problem - making the application do work it doesn't need to do.

I'd generally be very suspicious of repositories which return unfiltered lists of any big aggregates.  They can be very useful for test purposes, but can really kill you if they get into the production code.  Keep an eye out for filtering being performed in C# code which would be simpler and more efficient as a relational database query.  These tasks should be pushed down into your repository classes and rewritten in HQL or Criteria queries.

Avoidable Mistake - Loading in data that you don't need

Bear in mind that calls to the database are expensive, especially if said database is on a remote machine.  Every time you go to the database for something that you don't need, you are going to slow down your app and needlessly waste resources.  Here are some examples of mistakes we made:

  • Loading objects from the database to perform unnecessary validation.
  • Eagerly loading objects on page load "just in case" the app might use them.
  • Repeated loads of objects from the database due to page lifecycle events.
  • Loading aggregates when a report query or a summary object would be better

Avoidable Mistake - Forgetting to check the SQL being generated

Make sure you check the SQL statements being used by NHibernate to satisfy your requests.  Sometimes you will get quite a surprise.  To enable logging of the SQL statements to the console when running your tests, add 

<add key="show_sql" value="true" />

to the activerecord section in your app.config file .   If you want to log application output, Ken Egozi has a post showing you how to do it here.