Over my career so far, I’ve worked in a number of payments companies, including several startups. At the last startup, I was involved in building out a payments platform from scratch (from first line of code). This post is a collection of thoughts and lessons learned. Hopefully, at least some of this is useful to others.
The sections are relatively independent, so here are some quick links:
- Use The Tools You Have
- Optimize For Change
- Focus on Iteration
- Modular Monolith
- Put Everything in the Database
- Make It Easy To Query The Database
- Job Drain Pattern
- Check in Generated Files
- Decision Logs
- Continuous Deployment
Use The Tools You Have (Before Adding New Tools)
Every new tool, language, database, etc adds an enormous amount of complexity. You have to set it up and manage it (even managed offerings still require work), integrate with it, learn the ins and outs (often only after it’s failed in some way), and you will find out things you didn’t even know to think about.
So before I reach for something new, I try to use what we have, even if it’s not the optimal thing. For example, my projects have often used PostgreSQL as the database. PostgreSQL is quite full featured, so I try to use it for as much as possible. This includes job queues, search, and even simple caching (e.g. table that stores temporary values which get cleared out over time). It’s not necessarily the ideal platform for these, but it’s so much easier to just manage the one database than a whole suite of data systems. And at some point, the app will outgrow PostgreSQL’s capability for one or more of these, but even deferring that decision and work is hugely valuable.
The same goes for introducing new languages and frameworks. When possible, I like to use what we have and only introduce something new once we’ve pushed the existing stuff to the breaking point.
Another advantage is that over time, a lot of software becomes deprecated, but not removed. Some product or feature is no longer maintained, but since it’s in active use, it’s not fully shut down or deleted. It’s bad enough to leave deprecated code and services running, but it’s even worse if this means you now have extra databases or other platform systems that still have to be maintained, but don’t provide any current value. Even deprecated systems still need security upgrades, migrations to new servers, and more.
Optimize For Change
It is especially true with startups, but really change is a part of any software project. Requirements change, our understanding of the problems change, technology changes, and even the focus of a company can change. So it’s important to ensure that the software can change as well. Sometimes this is subjective (which architecture is the most amenable to change) and other times it’s concrete.
For example, I worked on one system which had both an customer installed on-premise system and a cloud hosted system. The on-premise system was extremely hard to change as it required customers to do their own upgrades (often on their own schedules). In contrast, the cloud hosted system was fully under our control. So optimizing for change meant putting as much into the cloud hosted system as possible and keeping the on-premise portion thin. That way, we needed fewer changes to the hard to change parts, and we could roll out as many changes as we needed to the cloud piece on our own schedule.
Optimizing for change can also help with architecture discussions and decisions. When deciding between alternatives, picking the one that is easiest to change later can be helpful. It’s easier to try new things when the cost of undoing that change isn’t as high. If the new framework or tool doesn’t work out, you can switch back or switch to something else that’s new.
In my opinion, one of the best ways to optimize for change is to keep things as simple as possible. Sometimes, folks will over-engineer current systems to try to predict how they will evolve in the future and to try to future-proof them now. One example of this is making things generic when there is only one type today. I think this is a mistake. Our guesses for how things will change are often incorrect, and it’s easier to change a simple system than a complex one. It’s also easier to maintain a simpler system today than carry the over-engineered baggage around with us.
Focus on Iteration
It’s super important to be able to break down work into small, deliverable pieces. I’ve seen too many projects go months without showing any value. Sometimes they do finally deliver, but other times, they will get canceled or significantly altered instead. It’s far better to release piecemeal, even if it’s not fully featured. Feature flags and other ways to partially roll out features are great here. It allows you to get production feedback from a subset of customers, or even just internal folks. And it allows visible progress throughout a long project.
I find that a lot of frustration over software estimates and delivery time frames go away if folks can see visible progress over time, rather than a nebulous future delivery date.
One thing I wish I had a better solution for was making the stability of features more obvious. For example, I want to ship quickly to get feedback, but then I want to still be able to change that feature or API. However, once customers start using something, they often implicitly assume that it won’t change.
It would be great to find a way to mark features or APIs as alpha, beta, stable, etc and set clear expectations and time frames for those features. For example, encouraging customers to try out an alpha API, but knowing that it will change and they will have to update their integration periodically. Personally, I haven’t seen this done super well yet.
Testing code is super valuable, and there are many different approaches with different trade-offs. A lot can be said on this topic, but I’ll just mention one aspect that I’ve been thinking about a lot: balancing speed and quality of tests.
In general, having a lot of tests lets you make changes with confidence. If a large, thorough suite of tests pass, you can be reasonably sure you haven’t broken something. It can even let you upgrade core components with confidence, such as the application framework or language version.
However, the more tests you have the slower they will take to run. What starts as a suite of just a few seconds can easily take minutes or longer if you aren’t careful. One way this is addressed is by trying to isolate tests from other systems, often with mocking. For example, testing the core of the business logic without the database, or testing the API without actually opening connections and making API calls, or mocking out responses from 3rd party systems.
But the trade-off here is that as you isolate tests to make them faster, you may also make them less realistic and less able to catch problems. The mock based tests are fast, but perhaps the mock doesn’t work the same way as the real component in certain edge cases.
Or you want to change something about the interaction between components, and now you have to update hundreds of cases where you set up mocks for testing.
I don’t have a great answer for this one. I try to isolate code from external dependencies when I can (e.g. by writing business logic as simple functions that take their data as input). And for the rest at the edges or when testing interactions, I just try to be thoughtful about the trade-offs we make for speed vs accuracy with testing. I also tend to prefer fakes over mocks, where you have a mostly stable stand-in that is used across many tests instead of setting up mock expectations per test.
A lot has been said on modular monoliths elsewhere, so I’ll just add that I really like this approach. It’s really hard to know what the eventual seams of a software system will be, and it’s hard for a small team to work on many separate services (including hosting, deployment, monitoring, upgrades, etc).
In the recent cases where we used a monolith, I think it worked out really well. It will always be work to pull a service out of the monolith eventually, but we can try to be thoughtful about the code separation within the monolith to help make it easier (and to crystallize our thinking on what is a separate domain area). And we’re deferring these decisions until later, so we can focus on building more quickly now (which is especially important in a startup).
Put Everything in the Database
I’m a big fan of storing almost everything in the database. I find that it makes things so much easier to understand and debug if you can query all of the relevant data together. Often, I will prefer the database to logging, since you can’t easily correlate logs with stored data (e.g. Storing External Requests).
For example, in payment systems, payments often move through many different states. It’s really helpful to have entries in the database that represent what changed and when, even if only the final state is important. Then, when trying to debug why a payment is in a weird state, we can see all of the relevant data in all of the tables in one place (e.g. in an events or audits table).
Adding a unique request identifier makes it even more useful. Then, you can associate a failed API requests with all of its database records.
There are practical considerations, however, as data sizes really start to grow. One strategy I’ve used is to store some of this data with a shorter lifespan. For example, log style data may only be useful for a few weeks, so it can be deleted after that. Or exported to files and archived separately.
Another issue is with Personally Identifiable Information (PII). There are often legal and ethical requirements for this type of data, so it needs to be considered on a case by case basis. Sometimes, it can still be stored, but only for a short time. Other times, it should be scrubbed or excluded from the database.
Make It Easy To Query The Database
Once you get everything into the database, I find it super helpful to give folks an easy way to query it. Recently, I used Metabase and really enjoyed how it allowed easy, web based querying and graphing of our data. We set it up with a read-only connection to a read replica, so there was little concern of impacting production or accidentally changing data. We found that both developers and non-technical folks used it extensively.
For example, we made dashboards where you could enter an
orderId and see all of the data from all of the tables that stored associated data. This was hugely valuable for debugging and for our support folks.
Again, there are considerations of who can see the data, and how much of it. But in general, giving folks the ability to answer their own data questions is super powerful, and it takes load off developers. And building shared dashboards and graphs so everyone can watch the same metrics was very powerful.
Job Drain Pattern
Once a system outgrows a single database, data consistency issues start to pop up. Even introducing a background job system or a search tool can start to show issues. For example, the main database was written, but the process that copied to the search tool failed. Or the background job was queued before the main database was committed.
There are various ways to solve this problem, and in particular, I like the job drain pattern, written up well at Transactionally Staged Job Drains in Postgres . I’ve used this pattern on several different projects successfully.
Check in Generated Files
Similar to put everything in the database is put everything into git. For me, this includes generated files when possible. I know a lot of ecosystems prefer generating only at build time into temporary directories, but I really like having them in git. I find it really useful to be able to diff these files when making changes, such as upgrading the generation library or code. Otherwise, it can be hard to tell if anything meaningful has changed, or if more has changed than you expected.
When working with Gradle, I also like to check in the generated lockfiles that specify the exact version of every transitive dependency. Then, when Dependabot/Renovate/etc perform automated upgrades, it’s easy to see which transitive dependencies have also changed.
I think in general, a lot of internal documentation is wasted effort. People spend countless hours writing up product plans or docs that are never looked at again.
However, I do think some documentation is often valuable. In particular, I like Decision Logs. The idea is that whenever the team needs to make a decision, that decision is captured in some light documentation. I think it serves two purposes:
Writing up the options along with the advantages and disadvantages of each helps clarify thinking, and helps make better decisions. It shows what you’ve considered, and allows others to note gaps or misunderstandings. It’s also often helpful to clarify what you are not trying to address with the decision, i.e. what’s out of scope.
Months or even years later, looking back at the Decision Log can be useful to understand why the system is designed a certain way. For example, someone new is hired and doesn’t understand why you chose Database X over Database Y. They can go read the entry. Or when someone proposes something new that’s already been considered, you can go back and see why it wasn’t chosen previously and if anything in the situation has changed (e.g. with the company or the capabilities of the tool). The Decision Log helps to remove “institutional knowledge” where only a handful of old-timers know the reasons for anything.
I do think that these Decision Logs (and other documentation) should be kept relatively light, however. Folks should not spend days writing them up.
I’m a big fan of continuous deployment. This can look different on different projects, but ideally, every commit to the main branch will deploy to both test and production environments. I see a number of benefits:
- It means the time between commit and production is small, so completed work gets into the hands of users quickly. You also don’t have to worry about when code will be released, and when other code that depends on it can also be released. You can merge a change, let it deploy, and then merge another change.
- It requires the deploys to be fully automated, which both makes them repeatable and also generally discoverable. Anyone can see what steps are run for every deployment and they are always the same (no hidden steps). Furthermore, if there’s ever a need for a manual deployment, someone can go look at the scripts and run the same commands.
- It removes an often time consuming developer chore. Now, deploys just happen and you don’t have to spend time coordinating or performing them.
For beta features, or features that aren’t ready to be visible to everyone, I think feature flags work well. There are lots of libraries and products in this space, but it’s possible to start simple with what is built into GitLab: https://docs.gitlab.com/ee/operations/feature_flags.html