Wednesday, November 4, 2015

Reducing Latency in CQRS Applications


Its almost one year and a half since I started practicing event sourcing architecture. The application I am working on is implemented using domain driven design, event sourcing, and CQRS (Command-Query Responsibility Segregation).

One of the most challenging tasks for me during this journey was minimizing latency between write and read models.

The classical model which we implemented worked really slow - the latency was around one second, while recommended latency should be considerably lower.

The model is shown of the picture:




The reason of a high latency mainly was the path that event messages had to travel to get to backend server - which then updated read model accordingly.

We use our own implementation of event store which runs on SQL Server.

In my implementation the backend server was using catch-up subscription implemented as poller  to global sequenced list of events in event store. So the backend server was polling events arrived to global list after the global "position" which was last position the backend server already processed.

The system experienced the lag during this process, because event was firs persisted and then eventually polled.

To get rid of the lag, I started to think how to notify backend server about new events using alternative, faster mechanism.

If command processor and backend event handler are hosted in a single process - that faster mechanism could be in-process queue.

If they are on different servers I could use some fast in-memory (not persisted) queue.

But what should I do when messages are lost - because for example in-memory queue was down, or what should I do when backend process restarts - in that moment of time there are some unprocessed events in event store so Read Model should catch up with it.

I came up with the solution - where there will be two channels by which events are provided to backend event processor.

One is durable channel as on the picture above, - another is fast in-memory or in-process channel.

When system operates normally - fast channel will always be ahead of durable channel, when system is restarted, or when there are lost messages - those are delivered through catch up subscription.

To implement this I needed somehow join those two channels into one, and subscribe backend event handler to that one joined channel.

Given you know event global position (sequence number) at the moment events are saved to event store - this is possible.

The last thing to solve at this point was doing it in efficient and thread safe way.

I found great library named Disruptor, which is open source, and is exactly what I needed - LMAX Disruptor

The library provided ring buffer construct, which allows multiple publishers to push messages to predefined sequenced cell in a thread safe way, and also have multiply subscribers getting those messages in a strict sequential way - one cell a time. This way whichever channel insert message into next cell, that would trigger subscriber to progress to that cell and process message inside it.

The implementation was really simple.
Old catch up subscription was modified to insert entries into Disruptor instead of passing to event handlers directly

Event handler was modified to become subscriber of Disruptor, it also was modified to persist global position of processed events after processing.

Event store interface's Append method was modified to try to push messages to Disruptor right after events are persisted.

Using this solution - the lag between write model and read model has practically disappeared - as events are processed almost instantly after they are persisted.

No comments: