Chris Dew

writes about software development

UPDATE Considered Harmful

For some applications, typically those where the volume of data is reasonable small (e.g. configuration databases), it is useful to access a database in a INSERT-only fashion. i.e. no UPDATEs or DELETEs.

Every time you perform an UPDATE or DELETE you are throwing away data.

Even if you currently consider the old data to be useless, that may not always be the case.

Amending and Hiding Data

I can imagine your response – OK, how do I amend or hide data?

Assume your domain object is a foo. Normally you would create a table called foo, and that would be the end of the matter.

When using an Event Sourcing design, you have similar table called foo_created which stores the details of foos at the time they were created.

Amendments are simply stored in various other tables, linked to the foo_created table.

Deletetions are simply yet another table, called foo_deleted, linked to the foo_created table.

The Current State of Foo

Getting the current state requires linking the various tables together. This will always be slower that retrieving one record, but you can cache the results if required.

By writing queries with care, the complexity of joining the tables can be kept to O(1). i.e. constant time, regardless of the size of the database.

MySQL Examples

To demonstrate an implementation with MySQL, we will model a simple ‘users’ class, which has an email address, a password hash, the time of last password change, and the time of last login.

This would traditionally be modelled in MySQL as:

1
2
3
4
5
6
7
8
9
CREATE TABLE `traditional_users` (
  `id` INT(10) NOT NULL AUTO_INCREMENT,
  `email` VARCHAR(50) NOT NULL,
  `passwd_hash` VARCHAR(50) NOT NULL,
  `passwd_changed` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `last_login` TIMESTAMP NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  INDEX `email` (`email`)
)

Each time that a user changed their password, or logged in, the old state would be thrown away.

To store the same information in an Event Sourcing system, we need tables to store events, and queries or views to serve up the current state.

The events/tables will be user_created, user_password_changed and user_logged_in.

The schemas for these tables are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
CREATE TABLE `user_created` (
  `timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `user_id` INT(10) NOT NULL AUTO_INCREMENT,
  `email` VARCHAR(50) NOT NULL,
  `password_hash` VARCHAR(50) NOT NULL,
  PRIMARY KEY (`user_id`)
)

CREATE TABLE `user_password_changed` (
  `pk` INT(10) NOT NULL AUTO_INCREMENT,
  `timestamp` TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
  `user_id` INT(10) NULL DEFAULT NULL,
  `new_password_hash` VARCHAR(50) NULL DEFAULT NULL,
  PRIMARY KEY (`pk`),
  INDEX `Index 2` (`user_id`, `pk`),
  CONSTRAINT `FK_user_password_changed_user_created`
          FOREIGN KEY (`user_id`)
          REFERENCES `user_created` (`user_id`)
)

CREATE TABLE `user_logged_in` (
  `pk` INT(10) NOT NULL AUTO_INCREMENT,
  `timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `user_id` INT(10) NOT NULL,
  PRIMARY KEY (`pk`),
  INDEX `Index 2` (`user_id`, `pk`),
  CONSTRAINT `FK_user_logged_in_user_created`
          FOREIGN KEY (`user_id`)
          REFERENCES `user_created` (`user_id`)
)

Next we need to be able to find the last event, for a particular user_id, efficiently.

The following form of query does that (for user_logged_in):

1
2
3
4
5
6
7
SELECT *
FROM user_logged_in u0,
( SELECT MAX(pk) pk
  FROM user_logged_in
  WHERE user_id = ?
) u1
WHERE u0.pk = u1.pk;

First we’ll form a query which uses user_created as its base, with the password_hash value being superseded by the latest user_password_changed new_password_hash (if any).

1
2
3
4
5
6
7
8
9
10
SELECT timestamp, user_id, email,
COALESCE(( SELECT new_password_hash FROM user_password_changed u0,
           ( SELECT MAX(pk) pk
             FROM user_password_changed
             WHERE user_id = ?
           ) u1
           WHERE u0.pk = u1.pk
         ), password_hash) password_hash,
FROM user_created uc
WHERE uc.user_id = ?

Note: COALESCE is used to give us either the password_hash from user_password_changed or (if the former is NULL) from user_created.

Now we can add in the latest values for the password_changed and logged_in fields.

Here is the Current State Query for a user:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SELECT timestamp, user_id, email,
COALESCE(( SELECT new_password_hash FROM user_password_changed u0,
           ( SELECT MAX(pk) pk
             FROM user_password_changed
             WHERE user_id = ?
           ) u1
           WHERE u0.pk = u1.pk
         ), password_hash) password_hash,
( SELECT timestamp FROM user_password_changed u0,
  ( SELECT MAX(pk) pk
    FROM user_password_changed
    WHERE user_id = ?
  ) u1
  WHERE u0.pk = u1.pk
) password_changed,
( SELECT timestamp FROM user_logged_in u0,
  ( SELECT MAX(pk) pk
    FROM user_logged_in
    WHERE user_id = ?
  ) u1
  WHERE u0.pk = u1.pk
) last_logged_in
FROM user_created uc
WHERE uc.user_id = ?

Isn’t this Query the Spawn of Satan?

No.

If you run ‘explain’ on it you’ll see that three foreign field selects have a note ‘Select tables optimized away’.

This means that MySQL is not having the look at the tables in order to find out which record contains the latest event for the user_id (for the fields user_password_changed and user_logged_in).

It is still more expensive than a query on the traditional_user table, but it’s not too bad.

Advantages

What can you do with an Event Sourcing design, that you couldn’t with a traditional one?

  • Locking – with no updates, no locking is required, and instead we can move from prohibiting clashes to resolving them later. This works well with sharding and high availablity. It’s what banks do – they don’t use transactions for everything – it’s important that a cash machine works even if its link is down.

  • History – by adding timestamp < X clauses to the Current State Query we can see the state of an object at any point in the past.

  • Undo – because we store every event, we can undo changes by simply adding a new event. We can undo any change, not just the latest change.

  • Debugging – we now know about every event in the system, which makes debugging a breeze.

Links

See Martin Fowler’s page if you are unfamiliar with Event Sourcing.