Data Migrations for NoSQL with Curator

This post is cross-posted at Data Migrations for NoSQL with
Curator
.

The NoSQL movement has brought us a wave of new data stores beyond the
traditional relational databases. These data stores come with their own
tradeoffs, but they provide some incredible benefits. At
Braintree, we are moving in the direction of using Riak as
our next generation data store. We love its focus on scalability and
availability. Servers can fail without causing any downtime, and we can
add more capacity by simply adding more servers to the cluster.

One great feature of relational databases, however, is the consistency
in the shape of the data. You know if you have a people table, every
row has the same columns. Some fields might be null, but there won’t be
any surprises. Furthermore, if you want to rename or modify a column,
it’s a simple operation. In the case of
PostgreSQL and other databases, a rename is nearly instantaneous. We lose this ability with Riak and most NoSQL
databases. We can easily add attributes (columns), but we cannot easily
rename them or change the data within each document (row).

Since our apps are always evolving at Braintree, we needed a way for our
data to keep up with our code. Our solution is something we’re calling
lazy data migrations, and we’ve built it into our repository and model
framework, curator. You can read
more about curator on our blog at Untangle Domain and Persistence Logic
with
Curator
.

The problem

Say we have a collection of people in Riak. This is analogous to a
people table in a relational database. When we first built the app, we added fields for first_name and last_name:

person = Person.new(:first_name => "Joe", :last_name => "Smith")  

Some time has passed, our app has data, and we now realize that names
are a pain. What do we do with middle names? What about people with
multiple first or last names? We want to just simplify the system and
collect only a name. We no longer care about a separate first and last
name. The problem is we have a ton of data in the old format. How do we
handle that old records have a first_name and last_name, but going
forward, we want just name?

In a relational database, we would simply write a database migration
that looks like:

ALTER TABLE people ADD COLUMN name VARCHAR;  
UPDATE people SET name = first_name || ' ' || last_name;  
ALTER TABLE people DROP COLUMN first_name, DROP COLUMN last_name;  

This migration might take a while to run, but once it’s done, we know
that all data has been migrated. We can then change all of our code to
only deal with name, knowing we no longer have first_name or
last_name.

In a NoSQL database like Riak, we cannot simply change the schema. We
have to come up with a different solution. Here are the steps we went
through in trying to come up with the solution that made its way into
curator:

Solution attempt 1: Scattered conditionals

The first solution is to make the Person class smart enough to handle
both cases.

class Person  
  attr_accessor :first_name, :last_name, :name
end  

We can populate whatever fields we get back from the data store. Then,
when we want to do something with the name, we have to use code like:

if person.name  
  puts "Name is #{person.name}"
else  
  puts "Name is #{person.first_name} #{person.last_name}"
end  

The problem with this approach is that we have to use branching code
like this whenever we want to use the name. It quickly gets messy.

Solution attempt 2: Gathered conditionals

The second solution is to move this logic to the place where we read the
Person out of the data store:

attributes = fetch_from_riak  
if attributes[:name]  
  person = Person.new(:name => attributes[:name])
else  
  person = Person.new(:name => "#{attributes[:first_name]} #{attributes[:last_name]}")
end  

Now, we only have to do it once and we can change our Person class to
only know about name.

This solution works well, but what happens a year down the road when
we’ve made lots of data changes to many different models? We don’t want
a bunch of conditionals all over our persistence code.

Our solution: Lazy data migrations

We pulled the idea from solution 2 into the idea of a migration (similar
to ActiveRecord migrations). Migrations target a given collection at a
given version. They look like this:

class ConsolidateName < Curator::Migration  
  def migrate(attributes)
    first_name = attributes.delete(:first_name)
    last_name = attributes.delete(:last_name)
    attributes.merge(:name => "#{first_name} #{last_name}")
  end
end  

This migration is stored in
db/migrate/people/0001_consolidate_name.rb. We’ve also added the concept of a version to each Model. By default, models start at version
0. When they are read from the Repository, the attributes are run
through any migrations that are a greater version (based on the version
in the filename):

person = PersonRepository.find_by_key("person_id")  
person.version #=> 1  

Now, the migration logic is isolated from the rest of the application.
The rest of the app can safely assume that all Person objects have
only a name:

class Person  
  current_version 1
  attr_accessor :name
end  

We mark the Person class with current_version 1 to signify that new
instances start at version 1, since they have a name attribute rather
than first_name/last_name.

These migrations run when models are read, so they are lazy. Data will
migrate as it’s used, and update when saved. This means that, unlike
with relational databases, the website can be up and serving requests
while the data is migrated.

If you want to force the data to migrate (and not wait for all data to
be used), you can simply find models who haven’t been migrated and save
them. The version attribute is indexed by default:

PersonRepository.find_by_version(0).each do |person|  
  PersonRepository.save(person)
end  

Testing

Unlike ActiveRecord migrations, curator migrations have no side effects.
They simply accept a hash and return a new hash. This makes them easy to
call from a unit test:

require 'spec_helper'  
require 'db/migrate/people/0001_consolidate_name'

describe ConsolidateName do  
  describe "migrate" do
    it "concatenates first_name and last_name" do
      attributes = {:first_name => "Joe", :last_name => "Smith"}
      ConsolidateName.new(1).migrate(attributes)[:name].should == "Joe Smith"
    end
  end
end  

Limitations

Curator migrations are lazy, so at any given time you might have
documents with different versions in the data store. This is not
normally a problem since the migrations will run as soon as the objects
are read. However, if you add a migration that changes an indexed field,
you cannot rely on that index to return all of the correct values until
you migrate them all. In this case, you might want to force migration by
reading and saving all of the documents.

Next Steps

You can see these migrations in action in the
curator_rails_example.

Let us know what you think about lazy data migrations in
curator. Feel free to open issues on GitHub, submit pull requests, and help us make it better.