Sep 142015
 

One great new feature of the upcoming Java 9 release is JShell, a Java REPL. Java is long overdue for an official REPL, so I was eager to try it out. I found this article to be a great starting point: Java 9 Early Access: A Hands-on Session with JShell – The Java REPL.

It’s easy to interact with the built in Java classes out of the box, but JShell becomes a lot more useful if you can interact with your existing application code. We use Bazel at Braintree, so I decided to add a Bazel target that would let me run my app code in JShell.

First, I downloaded the JShell (codenamed kulla) jar from their Jenkins and added it to the third_party directory:

% cd third_party
% wget https://adopt-openjdk.ci.cloudbees.com/view/OpenJDK/job/langtools-1.9-linux-x86_64-kulla-dev/lastSuccessfulBuild/artifact/kulla-0.819-20150913005850.jar

Then, I imported this jar in Bazel by adding this to third_party/BUILD:

java_import(
  name = "kulla_jshell",
  jars = ["kulla-0.819-20150913005850.jar"],
)

Finally, I added a java_binary target to the app. This target sets the main class to JShell and adds both the app and kulla as dependencies:

java_binary(
  name = "repl",
  main_class = "jdk.internal.jshell.tool.JShellTool",
  runtime_deps = [
    ":app",
    "//third_party:kulla_jshell",
  ],
)

One issue is that JShell requires Java 9, but Bazel does not currently support it. The Bazel built jars are compatible, however, so we can build a deploy jar with Bazel and Java 8, and then run it with Java 9:

% bazel build //app:repl_deploy.jar
 
% /usr/lib/jvm/java-9-oracle/bin/java -jar bazel-bin/app/repl_deploy.jar
|  Welcome to JShell -- Version 0.819
|  Type /help for help
 
-> import app.*
 
-> App a = new App()
|  Added variable a of type App with initial value app.App@1761e840
 
-> a.go()
|  Expression value is: "go go go"
|    assigned to temporary variable $3 of type String

Now, I can interactively play with app code in a REPL. Once Bazel adds support for Java 9, it will be even easier.

Sep 012015
 

We recently switched a collection of Java applications and libraries (in a monorepo) from Gradle to Bazel at Braintree. Overall, the transition went well and we are much happier with Bazel than Gradle.

The problems with Gradle

We had a number of issues with Gradle that led us to seek alternate build tools:

Gradle is slow

One of our biggest issues with Gradle was the speed. Everything felt sluggish and unnecessarily slow. For example, here are some comparisons between Gradle and Bazel:

With no files changed, rerunning build:

% time ./gradlew compileJava compileTestJava
10.547 total
 
% time bazel build //...
0.520 total

Running all of the tests takes half the time:

% time ./gradlew test
1:58.84 total
 
% time bazel test //...
1:01.14 total

Changing a single file, recompiling and rerunning dependent tests also takes half the time:

% time ./gradlew test
cpu 25.802 total
 
% time bazel test //...
12.860 total

Gradle is just slow to figure out what has changed and what needs to be run. It’s probably due to the number of subprojects (modules) we have inside this monorepo (about 20). Bazel seems to handle this case much better, and this is before we have even optimized our build dependencies. Gradle’s support for parallel execution is still incubating, and even with it on, we didn’t notice much improvement.

Gradle’s build language is groovy

Maybe Groovy is a great language on its own merits, but it’s a difficult build language for a team that isn’t familiar with it. Most of us are familiar with build languages in the host language (e.g. Rake and Ruby), or common scripting languages like Bash, Python, etc. Groovy adds yet another thing to learn, and has its own quirks.

For example, here is how you execute a command and redirect standard output in a Gradle task:

task myTask(type: Exec) {
    commandLine 'someCommand', 'param1'
 
    doFirst {
        standardOutput = new FileOutputStream('output.txt')
    }
}

This is pretty different than a simple someCommand param1 > output.txt. Gradle doesn’t prevent us from writing Bash scripts, but if you want custom commands to be run as dependencies of other tasks, it can be harder to write them separately.

Bazel’s build languages (Core and Skylark) are subsets of Python, so they are more familiar to us. They also aren’t general purpose languages, which force us to write scripts in scripting languages. That means we’re using the right tool for the job, instead of Gradle for everything.

Gradle is error prone

We found Gradle to be full of gotchas. For example, we had a section in our build.gradle like this to turn on compiler warnings:

subprojects {
  compileJava {
    options.compilerArgs << "-Xlint:all"
    options.compilerArgs << "-Xlint:-processing"
    options.compilerArgs << "-Werror"
  }
 
  apply plugin: "java"
}

It turned out that since the apply plugin: "java" was after the compileJava section, the options were never applied. There was no error or warning; Gradle just silently ignored our code.

We also ran into a lot of frustrations with the Gradle IntelliJ IDEA plugin. It would sometimes fail to refresh after we made changes to our build files (even with auto-import turned on), and then it was hard to get it back into a good state. We’d often have to manually synchronize the project, or even bounce IntelliJ.

Another frustration we ran into was with libraries which pulled in other libraries which conflicted with each other (kafka -> slf4j-log4j12). We were able to fix this in Gradle with code like:

dependencies {
  compile("org.apache.kafka:kafka_2.11:0.8.2.1") {
    exclude module: "slf4j-log4j12"
  }
}

But this didn’t affect the IntelliJ Gradle plugin. To fix that one, we had to manually check in a file called .idea/libraries/Gradle__org_slf4j_slf4j_log4j12_1_6_1.xml:

<component name="libraryTable">
  <library name="Gradle: org.slf4j:slf4j-log4j12:1.6.1">
    <CLASSES />
    <JAVADOC />
    <SOURCES />
  </library>
</component>

This worked, but if you accidentally imported the project again, or made any other manual changes to dependencies, IntelliJ would remove this file and the error would return.

In the Bazel world, every dependency is explicitly stated. In our case, we just left out the conflicting library and only included the good one. It was much simpler.

More benefits of Bazel

Bazel brings more to the table than just fixing our issues with Gradle. Here are a few notable features:

Docker support

Bazel recently announced support for building Docker images (i.e. directly in Bazel, without Dockerfiles): Building deterministic Docker images with Bazel

With Bazel, we were able to create a macro that unified a bunch of our Docker config. Now, our apps can build Docker images with only a few lines of config (and no Dockerfile):

app_docker_image(
  java_binary = "myapp",
  main_class = "braintree.myapp.MyApplication",
)
Bazel query

Bazel supports a powerful query language. For example, if you change a library, you can query and run all of the dependent tests (API changes with extra cheese, hold the fear):

bazel test $(bazel query 'kind(test, rdeps(//..., //mylibrary))')

You can also query and graph your dependencies (Have you ever looked at your build? I mean, really looked at your build?):

bazel query 'deps(//:main)' --output graph > graph.in
 
dot -Tpng < graph.in > graph.png
Test tagging

Bazel lets you specify the size of your test suites (small, medium, etc) and then it will alert you if the suite time exceeds the allotted time. This can help keep test times under control.

There are also a handful of special behavior tags, such as exclusive, manual, external, and flaky: http://bazel.io/docs/bazel-user-manual.html#tags_keywords

How we switched

Once we decided to switch, we had to do a number of things to cut over.

Porting the build files

The first step was porting our existing build.gradle files over to BUILD files. We went through a few iterations on scripting this, and the latest version is on GitHub: bazel-deps.

With this tool, we were able to generate the majority of our WORKSPACE and BUILD files. We explicitly depend on less than 30 libraries (from Maven), but the transitive dependencies come out to over 200! For example, dropwizard alone brings in about 80 dependent libraries.

IntelliJ

We wrote a script to create an IntelliJ project from our Bazel BUILD files. We based it on the script that comes with Bazel (setup-intellij.sh) but modified it for our needs.

For example, our script generates an IntelliJ project with modules for each of our subprojects. This means we need to create a top level modules.xml, a subproject.iml for each project, and then add dependent projects as project level dependencies. Our script also adds Bazel generated files as dependencies to the subprojects, such as generated protobuf classes.

Aliases

We added some aliases for common Bazel commands:

alias bzb="bazel build //..."
alias bzt="bazel test //..."

Aftermath

We’ve now been on Bazel exclusively for a few weeks (after running both for a while). The cutover was a little painful while we figured out things like CI (Bazel is a memory hog by default, and our build boxes are memory constrained). We also had to migrate all of our processes, tools, READMEs, etc. over to Bazel. Inevitably, we missed a few at first which caused headaches for some of our team. Thankfully, we seem to be past all of this, and we’re all happily running Bazel now.

If you’re interested in more cool Bazel features, check out the Bazel Blog.

Apr 232015
 

I’ve been doing some Java lately, and the new functional additions in Java 8 are interesting. Java still has a long way to go, but they made functional programming in Java easier. For example, we can now do simple partial application.

For reference, here is what partial application looks like in Clojure (using the built in function):

(defn add [x y] (+ x y))
 
(def adder (partial add 5))
 
(adder 1)
;; 6

And here is a simple implementation with Java 8:

public class Example {
    public static int add(int x, int y) {
        return x + y;
    }
 
    public static <T, U, R> Function<U, R> partial(BiFunction<T, U, R> f, T x) {
        return (y) -> f.apply(x, y);
    }
 
    public static void main(String[] args) {
        Function<Integer, Integer> adder = partial(Example::add, 5);
        System.out.println(adder.apply(2)); // 7
    }
}

We can also define functions using lambdas:

BiFunction<Integer, Integer, Integer> minus = (x, y) -> x - y;
Function<Integer, Integer> subtractor = partial(minus, 10);
System.out.println(subtractor.apply(4)); // 6

This Java version is more limited than the Clojure one. It can only take a two argument function, whereas Clojure’s supports arbitrarily many arguments.

Java has a BiFunction (function with two arguments), but no TriFunction or more. If we want to write a version of partial that accepts more arguments, we need to write our own TriFunction:

@FunctionalInterface
interface TriFunction<T, U, V, R> {
    R apply(T a, U b, V c);
}

Then, we can make a version of partial that accepts a TriFunction:

public static <T, U, V, R> Function<V, R> partial(TriFunction<T, U, V, R> f, T x, U y) {
    return (z) -> f.apply(x, y, z);
}

And use it in much the same way:

Function<Integer, Integer> adder3 = partial(Example::add3, 1, 2);
System.out.println(adder3.apply(3)); // 6

This version allows us to pass in two arguments to partial, but it doesn’t allow us to pass in a single argument and return a function that takes two arguments:

BiFunction<Integer, Integer, Integer> adder3_2 = partial(Example::add3, 1);
System.out.println(adder3_2.apply(2, 3)); // 6

To implement this, we need yet another version of partial:

public static <T, U, V, R> BiFunction<U, V, R> partial(TriFunction<T, U, V, R> f, T x) {
    return (y, z) -> f.apply(x, y, z);
}

The moral of the story is that Java 8 allows us to implement partial, but we have to implement all of the variations separately.

Here is the full example:

@FunctionalInterface
interface TriFunction<T, U, V, R> {
    R apply(T a, U b, V c);
}
 
public class Example {
    public static int add(int x, int y) {
        return x + y;
    }
 
    public static int add3(int x, int y, int z) {
        return x + y + z;
    }
 
    public static <T, U, R> Function<U, R> partial(BiFunction<T, U, R> f, T x) {
        return (y) -> f.apply(x, y);
    }
 
    public static <T, U, V, R> Function<V, R> partial(TriFunction<T, U, V, R> f, T x, U y) {
        return (z) -> f.apply(x, y, z);
    }
 
    public static <T, U, V, R> BiFunction<U, V, R> partial(TriFunction<T, U, V, R> f, T x) {
        return (y, z) -> f.apply(x, y, z);
    }
 
    public static void main(String[] args) {
        Function<Integer, Integer> adder = partial(Example::add, 5);
        System.out.println(adder.apply(1)); // 6
 
        BiFunction<Integer, Integer, Integer> minus = (x, y) -> x - y;
        Function<Integer, Integer> subtractor = partial(minus, 10);
        System.out.println(subtractor.apply(4)); // 6
 
        Function<Integer, Integer> adder3 = partial(Example::add3, 1, 2);
        System.out.println(adder3.apply(3)); // 6
 
        BiFunction<Integer, Integer, Integer> adder3_2 = partial(Example::add3, 1);
        System.out.println(adder3_2.apply(2, 3)); // 6
    }
}
May 142014
 

This post is cross-posted at Safe Operations For High Volume PostgreSQL.

We use PostgreSQL extensively at Braintree, and it backs many of our highly available services (including our main payments API).

We are constantly building and refining our products, and this often means evolving our database schema. In general, PostgreSQL is great at this, and we can make many different types of schema changes without downtime. There are some gotchas, however, that this post will cover.

Background

We almost never take scheduled downtime of our payments API. This means we run our database schema migrations while the gateway is up and serving requests. We have to be very careful about what database operations we run. If we run a bad command, it can lock out updates to a table for a long time.

For example, if we create a new index on our customers table, we cannot create new customers while that index is building. Anyone who tries to perform a customer create will block, and possibly time out, causing a partial outage.

In general, we are ok with database operations taking a long time. However, any operation that locks a table for updates for more than a few seconds means downtime for us.

You can learn more about our high availability approaches: Ruby Conf Australia: High Availability at Braintree

We derived the lists below through extensive testing, trial and error.

The good

Here’s what we can safely do in a migration without downtime:

Can do this
Add a new column
Drop a column
Add an index concurrently
Drop a constraint (for example, non-nullable)
Add a default value to an existing column

The bad

Here’s the stuff we cannot do, and our current workarounds:

Cannot do this on big tables Our workaround
Add an index Add the index using the CONCURRENTLY keyword
Change the type of a column Add a new column, change the code to write to both columns, and backfill the new column
Add a column with a default Add column, add default as a separate command, and backfill the column with the default value
Add a column that is non-nullable Create a new table with the addition of the non-nullable column, write to both tables, backfill, and then switch to the new table [1]
Add a column with a unique constraint Add column, add unique index concurrently, and then add the constraint onto the table [2]
VACUUM FULL[3] We use pg_repack instead

If you have other workarounds, please share them in the comments.

  1. This workaround is incredibly onerous, and we very rarely do it. We try to be really careful when creating new tables and figuring out the initial set of non-nullable columns.
  2. For example:

    CREATE UNIQUE INDEX CONCURRENTLY token_is_unique ON large_table(token);
    ALTER TABLE large_table ADD CONSTRAINT token UNIQUE USING INDEX token_is_unique;
  3. Dropping a column is very quick, but PostgreSQL won’t reclaim the disk space until you run a “VACUUM FULL” or use another tool like pg_repack.
Dec 202013
 

I had the opportunity this year to present my talk on “High Availability at Braintree” at four conferences. Here is the roundup:

I updated the content between presentations, so I’d recommend the Velocity version if you haven’t seen the slides yet: High Availability at Braintree.

Dec 202013
 

We primarily use Mingle for project management at Braintree. Personally, I find that the Mingle card wall wastes too much space. Each card is the same size, so short cards are filled with empty space. I prefer a denser view, so I made a Greasemonkey script which changes the card heights.

It changes the card wall from this:

to this:

You can find the code and installation instructions on GitHub: remove_mingle_card_height

Oct 042013
 

This post is cross-posted at Development Hacks to Prevent Mistakes.

Bugs are an inevitable part of software development. We do our best to write higher quality software, but we never fully escape releasing bugs into production. At Braintree, we deal with payments, so we’re extra sensitive to bugs. Therefore, we’ve made some interesting choices on how to fight back. In some cases, we’ve done some crazy hacks in the name of code safety. I’ll cover the most interesting ones in this post.

Scoping queries to a merchant

Braintree allows merchants to accept payments online. Almost everything we do is scoped to a merchant. Each merchant is distinct, and carries all of its own data. For example, one merchant’s vaulted credit cards, transaction history, and more should not be visible to other merchants. We are very serious about this isolation, but it’s easy to introduce bugs that would violate it. For example:

class CustomerController < ApplicationController
  def show
    @customer = Customer.where(:token => params[:token]).first
  end
end

This innocuous looking code has a big problem. If the token in the URL is for the wrong merchant, the code will happily show it to the end user. The simple fix is to use something like this instead (assuming we’ve already looked up the user’s merchant):

@customer = @merchant.customers.where(:token => params[:token]).first

The problem is that this is an easy mistake to make. Since we know people are fallible, and are going to make mistakes, we added a monkey patch to prevent these unscoped finds:

>> Customer.where(:token => 'Paul')
RuntimeError: #finds must be scoped on Customer
 
>> merchant = Merchant.find(1)
>> merchant.customers.where(:token => 'Paul')
[#<Customer id: 1 ...

We whitelist models (such as Merchant and User), and if there are cases where we really need to perform an unscoped find, we also have a backdoor:

>> Customer.allow_unscoped_find.where(:token => 'Paul')

This explicitly calls out that we’re intending to do an unscoped find, rather than using one by accident.

The code looks like this: scoped_find.rb

Scoped find hook

The code above makes sure that we scope ActiveRecord methods to a merchant, but it doesn’t cover all cases. For example, it doesn’t stop code like this:

Customer.find_by_sql(["SELECT * FROM customers WHERE token = ?", "Paul"])

We write a decent amount of custom SQL in our application for performance, and we wanted to make sure these cases were safe as well. Therefore, we have a similar hack called the scoped find hook that checks ActiveRecord objects as they load and makes sure they all have the correct merchant. Since our URLs are scoped by merchant (/merchants/<merchant_id>/customers/<token>), we can check each object loaded from the database against the merchant specified in the URL:

>> RequestContext.with_merchant(Merchant.find(1)) { Customer.find(1) }
ScopedFindHook::ScopeError: Customer cannot return objects scoped by the incorrect merchant. Got 6, expected 1.

In this case, RequestContext holds the merchant from the URL, populated by the ApplicationController.

The code looks like this: scoped_find_hook.rb

Recurring billing consistency

Braintree has a recurring billing system that takes care of automatic billing on a schedule. While it seems simple at first, recurring billing is actually quite complicated. For example, a subscription can be past due for several months, accruing a balance. Then, when the credit card is updated, or the merchant chooses to wipe the balance, we need to reactivate the subscription and get it back into the monthly billing cycle. This involves updating the balance, paid_through_date, next_billing_date, billing_period and more.

These complex operations are thoroughly tested, but we’ve still had bugs creep in. This is another case where we wanted runtime safety, so we added a consistency check. When saving a Subscription, we run a series of consistency checks, and if any of them fail, we alert and abort. Some of these checks are:

  • Billing period end date is after the billing period start date
  • If the Subscription is past due, then the failure count is less than the maximum number of retries
  • The next_billing_date is one billing period greater than the last billing date

If a check fails, we can investigate the cause. Sometimes, our consistency checks miss an edge case and we update them. Other times, we’ve caught a legitimate bug. Then, we fix the bug, deploy, and let the next run of recurring billing pick up the subscription.

The implementation is very simple: subscription.rb

Sanity Specs

Besides runtime checks, we also have a lot of development time checks. We call these sanity specs, and the goal is to check for developer mistakes during development. For example, we have a public_id column on most of our tables that represents the externally facing identifier (instead of using our internal database ids). Since we almost always look objects up by this id, we want to make sure it’s always indexed. We can codify this requirement into a spec that will fail if we forget the index:

require 'spec_helper'
 
describe 'Sanity Specs' do
  describe "database tables" do
    they "have a unique index on the public_id column" do
      indexes = ActiveRecord::Base.connection.select_values("SELECT relname FROM pg_class WHERE relkind = 'i'")
      ActiveRecord::Base.connection.tables.each do |table|
        klass = ActiveRecord::Base.descendants.detect { |k| k.table_name == table }
 
        if klass.has_public_id?
            indexes.grep(/index_#{table}_on.*public_id/).size.should eql(1), "#{table} does not have an index on public_id"
          end
        end
      end
    end
  end
end

Here are some other sanity specs we have:

  • Database tables have non-null created_at/updated_at columns
  • Spec files and app files are named consistently
  • Crontabs have a blank line at the end (to work around an old Ubuntu bug)
  • Every controller defines authorization_data (used for our authorization framework)
  • Migrations do not rename columns, since this cannot be done while the app is running *
  • Migrations do not drop columns unless they are marked in the app as deleted *

* See Ruby Conf Australia: High Availability at Braintree for why these are dangerous operations

Our rough policy is that if you get bitten by something writing code, and others are likely to fall into the same trap, we write a sanity spec to prevent future problems.

Oct 162012
 

This post is cross-posted at Scaling PostgreSQL at Braintree: Four Years of Evolution.

We love PostgreSQL at Braintree. Although we use many different data stores (such as Riak, MongoDB, Redis, and Memcached), most of our core data is stored in PostgreSQL. It’s not as sexy as the new NoSQL databases, but PostgreSQL is consistent and incredibly reliable, two properties we value when storing payment information.

We also love the ad-hoc querying that we get from a relational database. For example, if our traffic looks fishy, we can answer questions like “What is the percentage of Visa declines coming from Europe?” without having to pre-compute views or write complex map/reduce queries.

Our PostgreSQL setup has changed a lot over the last few years. In this post, I’m going to walk you through the evolution of how we host and use PostgreSQL. We’ve had a lot of help along the way from the very knowledgeable people at Command Prompt.

2008: The beginning

Like most Ruby on Rails apps in 2008, our gateway started out on MySQL. We ran a couple of app servers and two database servers replicated using DRBD. DRBD uses block level replication to mirror partitions between servers. This setup was fine at first, but as our traffic started growing, we began to see problems.

2010: The problems with MySQL

The biggest problem we faced was that schema migrations on large tables took a long time with MySQL. As our dataset grew, our deploys started taking longer and longer. We were iterating quickly, and our schema was evolving. We couldn’t keep affording to take downtime while we upgraded or even added a new index to a large table.

We explored various options with MySQL (such as oak-online-alter-table), but decided that we would rather move to a database that supported it directly. We were also starting to see deadlock issues with MySQL, which were on operations we felt shouldn’t deadlock. PostgreSQL solved this problem as well.

We migrated from MySQL to PostgreSQL in the fall of 2010. You can read more about the migration on the slides from my PgEast talk. PostgreSQL 9.0 was recently released, but we chose to go with version 8.4 since it had been out longer and was more well known.

2010 – 2011: Initial PostgreSQL

We ran PostgreSQL on modest hardware, and we kept DRBD for replication. This worked fine at first, but as our traffic continued to grow, we needed some upgrades. Unlike most applications, we are much heavier on writes than reads. For every credit card that we charge, we store a lot of data (such as customer information, raw responses from the processing networks, and table audits).

Over the next year, we performed the following upgrades:

  • Tweaked our configs around checkpoints, shared buffers, work_mem and more (this is a great start: Tuning Your PostgreSQL Server)
  • Moved the Write Ahead Log (WAL) to its own partition (so fsyncs of the WAL don’t flush all of the dirty data files)
  • Moved the WAL to its own pair of disks (so the sequential writes of the WAL are not slowed down by the random read/write of the data files)
  • Added more RAM
  • Moved to better servers (24 cores, 16 disks, even more RAM)
  • Added more RAM again (kept adding to keep the working set in RAM)

Fall 2011: Sharding

These incremental improvements worked great for a long time, and our database was able to keep up with our ever increasing volume. In the summer of 2011, we started to feel like our traffic was going to outgrow a single server. We could keep buying better hardware, but we knew there was a limit.

We talked about a lot of different solutions, and in the end, we decided to horizontally shard our database by merchant. A merchant’s traffic would all live on one shard to make querying easier, but different merchants would live on different shards.

We used data_fabric to introduce sharding into our Rails app. data_fabric lets you specify which models are sharded, and gives you methods for activating a specific shard. In conjunction with data_fabric, we also wrote a fair amount of custom code for sharding. We sharded every table except for a handful of global tables, such as merchants and users. Since almost every URL has the merchant id in it, we were able to activate shards in application_controller.rb for 99% of our traffic with code that looked roughly like:

class ApplicationController < ActionController::Base
  around_filter :activate_shard
 
  def activate_shard(&block)
    merchant = Merchant.find_by_public_id(params[:merchant_id])
    DataFabric.activate_shard(:shard => merchant.shard, &block)
  end
end

Making our code work with sharding was only half the battle. We still had to migrate merchants to a different shard (without downtime). We did this with londiste, a statement-based replication tool. We set up the new database servers and used londiste to mirror the entire database between the current cluster (which we renamed to shard 0) and the new cluster (shard 1).

Then, we paused traffic[1], stopped replication, updated the shard column in the global database, and resumed traffic. The whole process was automated using capistrano. At this point, some requests went to the new database servers, and some to the old. Once we were sure everything was working, we removed the shard 0 data from shard 1 and vice versa.

The final cutover was completed in the fall of 2011.

Spring 2012: DRBD Problems

Sharding took care of our performance problems, but in the spring of 2012, we started running into issues with our DRBD replication:

  • DRBD made replicating between two servers very easy, but more than two required complex stacked resources that were harder to orchestrate. It also required more moving pieces, like DRBD Proxy to prevent blocking writes between data centers.
  • DRBD is block level replication, so the filesystem is shared between servers. This means it can never be unmounted and checked (fsck) without taking downtime. We become increasingly concerned that filesystem corruption would go unnoticed and corrupt all servers in the cluster.
  • The filesystem can only be mounted on the primary server, so the standby servers sit idle. It is not possible to run read-only queries on them.
  • Failover required unmounting and remounting filesystems, so it was slower than desired. Also, since the filesystem was unmounted on the target server, once mounted, the filesystem cache was empty. This meant that our backup PostgreSQL was slow after failover, and we would see slow requests and sometimes timeouts.
  • We saw a couple of issues in our sandbox environment where DRBD issues on the secondary prevented writes on the primary node. Thankfully, these never occurred in production, but we had a lot of trouble tracking down the issue.
  • We were still using manual failover because we were scared of the horror stories with Pacemaker and DRBD causing split brain scenarios and data corruption. We wanted to get to automated failover, however.
  • DRBD required a kernel module, so we had to build and test a new module every time we upgraded the kernel.
  • One upgrade of DRBD caused a huge degradation of write performance . Thankfully, we discovered the issue in our test environment, but it was another reason to be wary of kernel level replication.

Given all of these concerns, we decided to leave DRBD replication and move to PostgreSQL streaming replication (which was new in PostgreSQL 9). We felt like it was a better fit for what we wanted to do. We could replicate to many servers easily, standby servers were queryable letting us offload some expensive queries, and failover was very quick.

We made the switch during the summer of 2012.

Summer 2012: PostgreSQL 9.1

We updated our code to support PostgreSQL 9.1 (which involved very few code changes). Along with the upgrade, we wanted to move to fully automated failover. We decided to use Pacemaker and these great open source scripts for managing PostgreSQL streaming replication: https://github.com/t-matsuo/resource-agents/wiki. These scripts handle promotion, moving the database IPs, and even switching from sync to async mode if there are no more standby servers.

We set up our new database clusters (one per shard). We used two servers per datacenter, with synchronous replication within the datacenter and asynchronous replication between our datacenters. We configured Pacemaker and had the clusters ready to go (but empty). We performed extensive testing on this setup to fully understand the failover scenarios and exactly how Pacemaker would react.

We used londiste again to copy the data. Once the clusters were up to date, we did a similar cutover: we paused traffic, stopped londiste, updated our database.yml, and then resumed traffic. We did this one shard at a time, and the entire procedure was automated with capistrano. Again, we took no downtime.

Fall 2012: Today

Today, we’re in a good state with PostgreSQL. We have fully automated failover between servers (within a datacenter). Our cross datacenter failover is still manual since we want to be sure before we give up on an entire datacenter. We have automated capistrano tasks to orchestrate controlled failover using Pacemaker and traffic pausing. This means we can perform database maintenance with zero downtime.

One of our big lessons learned is that we need to continually invest in our PostgreSQL setup. We’re always watching our PostgreSQL performance and making adjustments where needed (new indexes, restructuring our data, config tuning, etc). Since our traffic continues to grow and we record more and more data, we know that our PostgreSQL setup will continue to evolve over the coming years.

[1] For more info on how we pause traffic, check out How We Moved Our Data Center 25 Miles Without Downtime and High Availability at Braintree