I work extensively with JSON day to day, and I often reach for jq when exploring documents. I love jq
, but I find it hard to use. The syntax is super powerful, but I have to study the docs anytime I want to do anything beyond just selecting fields.
Once I learned DuckDB could read JSON files directly into memory, I realized that I could use it for many of the things where I’m currently using jq
. In contrast to the complicated and custom jq
syntax, I’m very familiar with SQL and use it almost daily.
Here’s an example:
First, we fetch some sample JSON to play around with. I used the GitHub API to grab the repository information from the golang org:
% curl 'https://api.github.com/orgs/golang/repos' > repos.json
Now, as a sample question to answer, let’s get some stats on the types of open source licenses used.
The JSON structure looks like this:
[
{
"id": 1914329,
"name": "gddo",
"license": {
"key": "bsd-3-clause",
"name": "BSD 3-Clause \"New\" or \"Revised\" License",
...
},
...
},
{
"id": 11440704,
"name": "glog",
"license": {
"key": "apache-2.0",
"name": "Apache License 2.0",
...
},
...
},
...
]
This might not be the best way, but here is what I cobbled together after searching and reading some docs for how to do this in jq
:
% cat repos.json | jq \
'group_by(.license.key)
| map({license: .[0].license.key, count: length})
| sort_by(.count)
| reverse'
[
{
"license": "bsd-3-clause",
"count": 23
},
{
"license": "apache-2.0",
"count": 5
},
{
"license": null,
"count": 2
}
]
And here is what it looks like in DuckDB using SQL:
% duckdb -c \
"select license->>'key' as license, count(*) as count \
from 'repos.json' \
group by 1 \
order by count desc"
┌──────────────┬───────┐
│ license │ count │
│ varchar │ int64 │
├──────────────┼───────┤
│ bsd-3-clause │ 23 │
│ apache-2.0 │ 5 │
│ │ 2 │
└──────────────┴───────┘
For me, this SQL is much simpler and I was able to write it without looking at any docs. The only tricky part is querying nested JSON with the ->>
operator. The syntax is the same as the PostgreSQL JSON Functions, however, so I was familiar with it.
And if we do need the output in JSON, there’s a DuckDB flag for that:
% duckdb -json -c \
"select license->>'key' as license, count(*) as count \
from 'repos.json' \
group by 1 \
order by count desc"
[{"license":"bsd-3-clause","count":23},
{"license":"apache-2.0","count":5},
{"license":null,"count":2}]
We can still even pretty print with jq
at the end, after using DuckDB to do the heavy lifting:
% duckdb -json -c \
"select license->>'key' as license, count(*) as count \
from 'repos.json' \
group by 1 \
order by count desc" \
| jq
[
{
"license": "bsd-3-clause",
"count": 23
},
{
"license": "apache-2.0",
"count": 5
},
{
"license": null,
"count": 2
}
]
JSON is just one of the many ways of importing data into DuckDB. This same approach would work for CSVs, parquet, Excel files, etc.
And I could choose to create tables and persist locally, but often I’m just interrogating data and don’t need the persistence.
Read more about DuckDB’s great JSON support in this blog post: Shredding Deeply Nested JSON, One Vector at a Time
Update:
I also learned that DuckDB can read the JSON directly from a URL, not just a local file:
% duckdb -c \
"select license->>'key' as license, count(*) as count \
from read_json('https://api.github.com/orgs/golang/repos') \
group by 1 \
order by count desc"
The sections are relatively independent, so here are some quick links:
Every new tool, language, database, etc adds an enormous amount of complexity. You have to set it up and manage it (even managed offerings still require work), integrate with it, learn the ins and outs (often only after it’s failed in some way), and you will find out things you didn’t even know to think about.
So before I reach for something new, I try to use what we have, even if it’s not the optimal thing. For example, my projects have often used PostgreSQL as the database. PostgreSQL is quite full featured, so I try to use it for as much as possible. This includes job queues, search, and even simple caching (e.g. table that stores temporary values which get cleared out over time). It’s not necessarily the ideal platform for these, but it’s so much easier to just manage the one database than a whole suite of data systems. And at some point, the app will outgrow PostgreSQL’s capability for one or more of these, but even deferring that decision and work is hugely valuable.
The same goes for introducing new languages and frameworks. When possible, I like to use what we have and only introduce something new once we’ve pushed the existing stuff to the breaking point.
Another advantage is that over time, a lot of software becomes deprecated, but not removed. Some product or feature is no longer maintained, but since it’s in active use, it’s not fully shut down or deleted. It’s bad enough to leave deprecated code and services running, but it’s even worse if this means you now have extra databases or other platform systems that still have to be maintained, but don’t provide any current value. Even deprecated systems still need security upgrades, migrations to new servers, and more.
It is especially true with startups, but really change is a part of any software project. Requirements change, our understanding of the problems change, technology changes, and even the focus of a company can change. So it’s important to ensure that the software can change as well. Sometimes this is subjective (which architecture is the most amenable to change) and other times it’s concrete.
For example, I worked on one system which had both an customer installed on-premise system and a cloud hosted system. The on-premise system was extremely hard to change as it required customers to do their own upgrades (often on their own schedules). In contrast, the cloud hosted system was fully under our control. So optimizing for change meant putting as much into the cloud hosted system as possible and keeping the on-premise portion thin. That way, we needed fewer changes to the hard to change parts, and we could roll out as many changes as we needed to the cloud piece on our own schedule.
Optimizing for change can also help with architecture discussions and decisions. When deciding between alternatives, picking the one that is easiest to change later can be helpful. It’s easier to try new things when the cost of undoing that change isn’t as high. If the new framework or tool doesn’t work out, you can switch back or switch to something else that’s new.
In my opinion, one of the best ways to optimize for change is to keep things as simple as possible. Sometimes, folks will over-engineer current systems to try to predict how they will evolve in the future and to try to future-proof them now. One example of this is making things generic when there is only one type today. I think this is a mistake. Our guesses for how things will change are often incorrect, and it’s easier to change a simple system than a complex one. It’s also easier to maintain a simpler system today than carry the over-engineered baggage around with us.
It’s super important to be able to break down work into small, deliverable pieces. I’ve seen too many projects go months without showing any value. Sometimes they do finally deliver, but other times, they will get canceled or significantly altered instead. It’s far better to release piecemeal, even if it’s not fully featured. Feature flags and other ways to partially roll out features are great here. It allows you to get production feedback from a subset of customers, or even just internal folks. And it allows visible progress throughout a long project.
I find that a lot of frustration over software estimates and delivery time frames go away if folks can see visible progress over time, rather than a nebulous future delivery date.
One thing I wish I had a better solution for was making the stability of features more obvious. For example, I want to ship quickly to get feedback, but then I want to still be able to change that feature or API. However, once customers start using something, they often implicitly assume that it won’t change.
It would be great to find a way to mark features or APIs as alpha, beta, stable, etc and set clear expectations and time frames for those features. For example, encouraging customers to try out an alpha API, but knowing that it will change and they will have to update their integration periodically. Personally, I haven’t seen this done super well yet.
Testing code is super valuable, and there are many different approaches with different trade-offs. A lot can be said on this topic, but I’ll just mention one aspect that I’ve been thinking about a lot: balancing speed and quality of tests.
In general, having a lot of tests lets you make changes with confidence. If a large, thorough suite of tests pass, you can be reasonably sure you haven’t broken something. It can even let you upgrade core components with confidence, such as the application framework or language version.
However, the more tests you have the slower they will take to run. What starts as a suite of just a few seconds can easily take minutes or longer if you aren’t careful. One way this is addressed is by trying to isolate tests from other systems, often with mocking. For example, testing the core of the business logic without the database, or testing the API without actually opening connections and making API calls, or mocking out responses from 3rd party systems.
But the trade-off here is that as you isolate tests to make them faster, you may also make them less realistic and less able to catch problems. The mock based tests are fast, but perhaps the mock doesn’t work the same way as the real component in certain edge cases.
Or you want to change something about the interaction between components, and now you have to update hundreds of cases where you set up mocks for testing.
I don’t have a great answer for this one. I try to isolate code from external dependencies when I can (e.g. by writing business logic as simple functions that take their data as input). And for the rest at the edges or when testing interactions, I just try to be thoughtful about the trade-offs we make for speed vs accuracy with testing. I also tend to prefer fakes over mocks, where you have a mostly stable stand-in that is used across many tests instead of setting up mock expectations per test.
A lot has been said on modular monoliths elsewhere, so I’ll just add that I really like this approach. It’s really hard to know what the eventual seams of a software system will be, and it’s hard for a small team to work on many separate services (including hosting, deployment, monitoring, upgrades, etc).
In the recent cases where we used a monolith, I think it worked out really well. It will always be work to pull a service out of the monolith eventually, but we can try to be thoughtful about the code separation within the monolith to help make it easier (and to crystallize our thinking on what is a separate domain area). And we’re deferring these decisions until later, so we can focus on building more quickly now (which is especially important in a startup).
I’m a big fan of storing almost everything in the database. I find that it makes things so much easier to understand and debug if you can query all of the relevant data together. Often, I will prefer the database to logging, since you can’t easily correlate logs with stored data (e.g. Storing External Requests).
For example, in payment systems, payments often move through many different states. It’s really helpful to have entries in the database that represent what changed and when, even if only the final state is important. Then, when trying to debug why a payment is in a weird state, we can see all of the relevant data in all of the tables in one place (e.g. in an events or audits table).
Adding a unique request identifier makes it even more useful. Then, you can associate a failed API requests with all of its database records.
There are practical considerations, however, as data sizes really start to grow. One strategy I’ve used is to store some of this data with a shorter lifespan. For example, log style data may only be useful for a few weeks, so it can be deleted after that. Or exported to files and archived separately.
Another issue is with Personally Identifiable Information (PII). There are often legal and ethical requirements for this type of data, so it needs to be considered on a case by case basis. Sometimes, it can still be stored, but only for a short time. Other times, it should be scrubbed or excluded from the database.
Once you get everything into the database, I find it super helpful to give folks an easy way to query it. Recently, I used Metabase and really enjoyed how it allowed easy, web based querying and graphing of our data. We set it up with a read-only connection to a read replica, so there was little concern of impacting production or accidentally changing data. We found that both developers and non-technical folks used it extensively.
For example, we made dashboards where you could enter an orderId
and see all of the data from all of the tables that stored associated data. This was hugely valuable for debugging and for our support folks.
Again, there are considerations of who can see the data, and how much of it. But in general, giving folks the ability to answer their own data questions is super powerful, and it takes load off developers. And building shared dashboards and graphs so everyone can watch the same metrics was very powerful.
Once a system outgrows a single database, data consistency issues start to pop up. Even introducing a background job system or a search tool can start to show issues. For example, the main database was written, but the process that copied to the search tool failed. Or the background job was queued before the main database was committed.
There are various ways to solve this problem, and in particular, I like the job drain pattern, written up well at Transactionally Staged Job Drains in Postgres . I’ve used this pattern on several different projects successfully.
Similar to put everything in the database is put everything into git. For me, this includes generated files when possible. I know a lot of ecosystems prefer generating only at build time into temporary directories, but I really like having them in git. I find it really useful to be able to diff these files when making changes, such as upgrading the generation library or code. Otherwise, it can be hard to tell if anything meaningful has changed, or if more has changed than you expected.
When working with Gradle, I also like to check in the generated lockfiles that specify the exact version of every transitive dependency. Then, when Dependabot/Renovate/etc perform automated upgrades, it’s easy to see which transitive dependencies have also changed.
I think in general, a lot of internal documentation is wasted effort. People spend countless hours writing up product plans or docs that are never looked at again.
However, I do think some documentation is often valuable. In particular, I like Decision Logs. The idea is that whenever the team needs to make a decision, that decision is captured in some light documentation. I think it serves two purposes:
Writing up the options along with the advantages and disadvantages of each helps clarify thinking, and helps make better decisions. It shows what you’ve considered, and allows others to note gaps or misunderstandings. It’s also often helpful to clarify what you are not trying to address with the decision, i.e. what’s out of scope.
Months or even years later, looking back at the Decision Log can be useful to understand why the system is designed a certain way. For example, someone new is hired and doesn’t understand why you chose Database X over Database Y. They can go read the entry. Or when someone proposes something new that’s already been considered, you can go back and see why it wasn’t chosen previously and if anything in the situation has changed (e.g. with the company or the capabilities of the tool). The Decision Log helps to remove “institutional knowledge” where only a handful of old-timers know the reasons for anything.
I do think that these Decision Logs (and other documentation) should be kept relatively light, however. Folks should not spend days writing them up.
I’m a big fan of continuous deployment. This can look different on different projects, but ideally, every commit to the main branch will deploy to both test and production environments. I see a number of benefits:
For beta features, or features that aren’t ready to be visible to everyone, I think feature flags work well. There are lots of libraries and products in this space, but it’s possible to start simple with what is built into GitLab: https://docs.gitlab.com/ee/operations/feature_flags.html
]]>Some of these calls will be to external companies which provide APIs, and some will be to other services/teams within the same organization, but outside the current scope of the software system.
In all cases, managing 3rd party API calls can be tricky. There’s the initial integration, where the API requests/responses might not be what you expect or what the docs say. Then you go to production and find out the behavior there doesn’t quite match the sandbox/test environment. Even once it’s all working, APIs change over time, and what worked yesterday might not work tomorrow. And then there are weird edge cases where you get back a response for one call out of a million which blows up your processing and handling.
To help with these issues, I like when systems store all external requests.
The first step is storing all inbound and outbound API calls. People will often use logs for this, but I think it’s far more valuable to put them in the database instead. This makes them easier to query, easier to aggregate, and easier to join against other data.
You can use one table for both inbound and outbound or separate them depending on preferences, but generally it’s useful to store most of the available information, such as:
200
, 500
)For metadata, I like to use a JSON column and add in any metadata that links this request to an entity in the system. For example, it could include user_id
, order_id
, request_id
, etc. As a JSON column, you can include more than one, and even include other complex nested information.
Storing the request time provides several benefits. One, it makes it easy to do some basic analysis of the API calls (e.g. what’s the average time for this call, or average time if we provide these fields but not these other fields?). It also helps with debugging, such as when you see a certain error code after exactly 5 seconds every time, you can infer there’s some 3rd party timeout.
Response headers often include extra debugging information, such as request or trace Ids, version numbers, etc. It’s common when asking for 3rd party support to provide these values so they can go look in their own logging to find your requests.
Once you start storing this information, it will often be immediately useful, even in development. And you can use it in integration tests (e.g. ensuring you pass a certain field with a certain value when making the API call). But the real power comes from debugging in production, such as finding all of the API calls associated with a failed payment to see what went wrong using simple SQL queries.
Rather than try to write code for every API call, it’s often better to hook into the request/response lifecycle in one place and instrument all calls. The way you do this depends on the language and libraries, but they are generally called interceptors
. For example:
Where possible, I like to record the request fields before the outbound call is started (e.g. request body, request headers) and then go back and update the row to store the response fields once the call is completed. There are several advantages over a single write at the end of the request cycle:
There are obviously security and privacy considerations when recording external requests. One basic approach is to filter parts of the bodies and headers that you don’t want stored. For example, authentication headers, PII (Personally Identifiable Information) and other sensitive information. I like replacing this information with something like [REDACTED]
rather than just removing it so it’s clear that a value was present.
Some filtering can be global (redact all Authentication
headers) and some can be request specific (for this request, redact password
).
I also recommend adding a lifecycle to these tables. For example, delete or redact all data after 2 weeks. That way, if something does creep in unexpectedly, it won’t last very long. Most of the debugging value is in recent data. And pruning old data also keeps the table size in check.
]]>NULL
values more sane.
A well known but annoying weirdness with NULL
values is that NULL != NULL
, so a UNIQUE
column can still have multiple NULL
values.
(Examples use numeric id columns for simplicity, but I generally prefer more complex ids such as ULIDs.)
CREATE TABLE test (
id serial PRIMARY KEY,
value TEXT UNIQUE
);
INSERT INTO test (value) VALUES ('a');
-- This fails on the duplicate:
-- ERROR: duplicate key value violates unique constraint "test_value_key"
-- DETAIL: Key (value)=(a) already exists.
INSERT INTO test (value) VALUES ('a');
-- But this does not:
INSERT INTO test (value) VALUES (null);
INSERT INTO test (value) VALUES (null);
SELECT * from test;
id | value
----+-------
1 | a
3 |
4 |
(3 rows)
However, PostgreSQL 15 released a new feature which can change this behavior: UNIQUE NULLS NOT DISTINCT
:
CREATE TABLE test (
id serial PRIMARY KEY,
value TEXT UNIQUE NULLS NOT DISTINCT
);
-- Now this fails on the second insert:
-- ERROR: duplicate key value violates unique constraint "test_value_key"
-- DETAIL: Key (value)=(null) already exists.
INSERT INTO test (value) VALUES (null);
INSERT INTO test (value) VALUES (null);
Read more about it in the release notes: PostgreSQL Release 15 Notes
A common use case that I’ve run into is a table which has multiple foreign keys, but only one is expected to be populated. For example, say we have a notifications
table to represent notifications we send out (e.g. emails, text messages, etc). These notifications might be triggered and related to a specific entity in our system, such as an order, user, company, etc. We want to add foreign keys to represent what this notification is for, but we only want to populate one of them.
For example:
CREATE TABLE notifications (
id serial,
company_id INT REFERENCES companies (id),
order_id INT REFERENCES orders (id),
user_id INT REFERENCES users (id)
);
INSERT INTO notifications (company_id) VALUES (100);
INSERT INTO notifications (order_id) VALUES (200);
SELECT * from notifications;
id | company_id | order_id | user_id
----+------------+----------+---------
1 | 100 | |
2 | | 200 |
(2 rows)
Often, I’ll see a table like this add a constraint to ensure that at least one of the columns is populated:
ALTER TABLE notifications
ADD CONSTRAINT notifications_reference
CHECK (company_id IS NOT NULL OR order_id IS NOT NULL OR user_id IS NOT NULL);
However, this does not stop you from accidentally populating more than one column. This can happen easily if you use an Object Relational Mapper (ORM) which generates the SQL for you, and you’ve accidentally set more than one attribute of your object:
INSERT INTO notifications (company_id, order_id, user_id)
VALUES (NULL, 300, 400);
SELECT * from notifications;
id | company_id | order_id | user_id
----+------------+----------+---------
1 | 100 | |
2 | | 200 |
3 | | 300 | 400
(3 rows)
There’s a pair of PostgreSQL functions which can let us write a better constraint checks, though, called num_nulls/num_nonnulls
. This lets us check that there is only one non NULL value among a set of columns. For example:
ALTER TABLE notifications
ADD CONSTRAINT notifications_reference
CHECK (num_nonnulls(company_id, order_id, user_id) = 1);
-- Now we get an error on insert if there is more than one value:
-- ERROR: new row for relation "notifications" violates check constraint "notifications_reference"
-- DETAIL: Failing row contains (3, null, 300, 400).
INSERT INTO notifications (company_id, order_id, user_id)
VALUES (NULL, 300, 400);
-- Or if there are no values:
-- ERROR: new row for relation "notifications" violates check constraint "notifications_reference"
-- DETAIL: Failing row contains (4, null, null, null).
INSERT INTO notifications (company_id, order_id, user_id)
VALUES (NULL, NULL, NULL);
Read more about them in the docs: Comparison Functions
]]>Later, many of my teams/projects switched to random or pseudorandom string identifiers. These have many advantages over incrementing integers, especially when used as public identifiers (e.g. in URLs):
One easy way to generate unique, random identifiers is by using a UUID. But lately, I’ve been using ULID types instead. ULID stands for Universally Unique Lexicographically Sortable Identifier
, which is like a time sortable UUID.
ULIDs look like 01GPC4NAN03RXV2EXS7308BHJ6
, and we can include extra information by prepending. For example, a Payment Id could be PAY01GPC4NAN03RXV2EXS7308BHJ6
Benefits from the spec:
A few more benefits:
created_at
column if desired since the time is embedded.And there are implementations in many languages.
One downside of ULIDs, however, is their lack of tooling. Periodically, I’d want a quick way to generate new ULIDs. Or I’d want to parse an existing ULID and see when it was generated (since they embed the timestamp).
So I made a simple website which used the javascript ULID library: https://pgr0ss.github.io/ulid-tools/
It currently does 3 things:
The code is at https://github.com/pgr0ss/ulid-tools. (Note: my html/javascript skills are pretty rusty.)
In fairness, everything comes with tradeoffs and ULIDs aren’t without their faults. For example:
/users/US01GPC6NGM662XD35QWYERHW6B6/payments/PAY01GPC6NSA8P3DWX6ATS29ABV84
).There’s a draft spec for new UUID versions which are time sorted (inspired by ULID and others): https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format
Maybe these will be accepted and gain widespread adoption in the future.
]]>In short, I wanted an easy way to decode a JWT locally from the command line. I have been a longtime fan of jq, and figured it could do this. I don’t generally care about validating the signature. I just want to see the contents.
After some searching and reading the docs, I wrote a simple function which I added to my ~/.zshrc
:
jwt-decode() {
jq -R 'split(".") |.[0:2] | map(@base64d) | map(fromjson)' <<< $1
}
I use it like this:
% jwt-decode eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c
[
{
"alg": "HS256",
"typ": "JWT"
},
{
"sub": "1234567890",
"name": "John Doe",
"iat": 1516239022
}
]
The only dependency is jq
, which is easy to install (e.g. brew install jq
on a Mac). This command splits on .
, base64 decodes each piece, parses them as JSON, and then pretty prints.
@BindMethods
).
However, one thing that is not super straightforward is how to make it work with custom types. For example, I’ve been
playing with ULIDs lately as identifiers (specifically
f4b6a3/ulid-creator), and I want those stored in TEXT
columns in the
database. It took me a while to figure out the necessary pieces.
In short, you need both an ArgumentFactory
(for custom type -> db) and a ColumnMapper
(for db -> custom type). It
looks like this:
Jdbi jdbi = ...
jdbi.registerArgument(
new AbstractArgumentFactory<Ulid>(Types.VARCHAR) {
@Override
protected Argument build(Ulid value, ConfigRegistry config) {
return (position, statement, ctx) -> statement.setString(position, value.toString());
}
});
jdbi.registerColumnMapper(
new ColumnMapper<Ulid>() {
@Override
public Ulid map(ResultSet r, int columnNumber, StatementContext ctx) throws SQLException {
return Ulid.from(r.getString(columnNumber));
}
});
With those two registrations in place, everything else should just work. These could even be bundled together into a
JdbiPlugin
, which I might do at some point.
Co-authored by Leigh McCulloch
Cross-posted to Braintree’s Product and Technology Blog: Continuous Deployment Isn’t Just for Applications
We now continuously deploy important internal libraries written in both Ruby and Java, and plan to ramp up this effort to include internal tools as well. The process our engineers go through when contributing to shared libraries looks like this:
We’ve found that Continuously Deploying our libraries has eliminated repetitive tasks and helped us move quickly with shared code initiatives where a monolithic repository was not an option. New engineers can now easily integrate and contribute to libraries by reading simple scripts rather than long documentation describing our release processes. Deploying test versions from open pull requests has given engineers the ability to iterate on functionality spanning applications and shared libraries without needing to merge to master early.
Here are the steps we took to make this process as easy as possible.
The first big question was how to version libraries. For semantic versioning we’d use major.minor.patch
(e.g. v2.3.1
) but that doesn’t work as well when you are releasing every commit of the master branch.
We’ve iterated a few times on the versioning, and our current version scheme looks like major.datetime.description.git_revision
. For example, a version for consumption might be 4.20181210231852.master.ef57297
whereas a pull request might be 0.94.20181210231646.pr.24fcfd5
.
0
for testing versions. And we only increment the major version when making backwards incompatible changes.master
for master builds and pr
for PRs. We do this to clearly signal where a version has been built from. While we use the major versions to signal this as well it’s much clearer to a human to see the description.We also wanted to encourage engineers to iterate on changes in a shared library without merging their work-in-progress to master too early. We wanted to make working across shared library boundaries as simple as possible.
To support this behavior, we build and publish a version of each shared library on every pull request with a version that never supersedes master builds. We use a major version of 0
to signal the build is not a release for the current or a future major version. We use a minor version that matches the pull request number so that consumers of the library can pin to that minor version to get updates to their pull request.
Our build system posts a comment back to the pull request with the published version to ensure new engineers discover this workflow immediately.
Jenkins posting a message to our pull request
Our versioning schema means that a ruby consumer of the library can pin themselves to a major version and pick up upgrades without breaking their integration.
For example, this would reference the latest version in in a ruby Gemfile by using the tilde operator:
gem "mylib", "~> 2.0"
For pull requests, it looks like this in the ruby Gemfile:
gem "mylib", "~> 0.89.0"
In ruby, this means you can update the library to the latest release in the current major version with bundle update mylib
.
Consumers are also welcome to pin to the specific version, such as this in ruby:
gem "mylib", "4.20181210231852.master.ef57297"
And in a maven pom.xml:
<dependency>
<groupId>braintree</groupId>
<artifactId>mylib</artifactId>
<version>4.20181210231852.master.ef57297</version>
</dependency>
We’re converting more of our libraries and internal tools to this approach. Let us know how you’re building Continuous Deployment into your culture!
]]>Ghost served me well for a long time, but it finally reached a point where it was more effort to maintain than I was getting out of it.
I self host Ghost on my server, and it’s become hard to keep up with Ghost upgrades. Some upgrades are very simple, but there have been at least two major upgrades that required significant work. And I worry that if I don’t keep up, I’m leaving my site open to security vulnerabilities.
In contrast, Jekyll is a static site generator. The resulting blog is just a bunch of static files served by nginx, so the risk of security issues is vastly reduced.
A static site is also much simpler to host and uses far fewer CPU and memory resources.
While I appreciated many of the features of Ghost, it’s not much more difficult for me to write blog posts on my laptop in markdown. And it’s nice being able to see the exact same blog locally and on my hosted server.
An ancillary benefit of switching is that some of my old posts look a lot better now. The formatting was corrupted in the conversion to Ghost originally, and the converted Jekyll markdown looks better now.
I started with jekyll_ghost_importer which gave me a nice baseline, but left a few issues:
I fixed these issues with a ruby snippet like:
Dir.glob("_posts/*").each do |post|
contents = File.read(post)
updated = contents
.gsub(%r{(\s)(https?://.*)(\s)}, "\\1[\\2](\\2)\\3")
.sub("---\n\n", "author: Paul\n---\n\n")
.sub(/date: '(.+)'/, "date: '\\1 UTC'")
File.open(post, "w") { |f| f.write(updated) }
end
Next, I browsed a bunch of Jekyll themes until I found a simple one I liked: tale. I also tweaked the About page.
Then, I configured Disqus, Google Analytics, and added redirects for my old RSS feed. I also changed the permalink to include a trailing slash to match Ghost: permalink: /:year/:month/:day/:title/
I added a simple script to deploy:
JEKYLL_ENV=production jekyll build
rsync -avz --delete _site/ pgrs.net:/var/www/blog/
Finally, I checked to make sure my top posts still worked and looked good. I spot checked a few, but wanted to make sure all my popular posts maintained their current URL. I downloaded my 50 most popular posts from Google Analytics and put them in a file. Then, I wrote a simple wget
command to check that they all returned a successful response:
wget -q --spider -i <file of URLs>
echo $? # exit code of 0 means success
After I made the cutover, I watched my server logs for a few days for abnormalities. I discovered a few important things:
If a page wasn’t found, my nginx setup would render the 404 error page, but as a successful 200 response instead of the proper 404 response code. It would look correct in a browser but be incorrect for search engines and other bots.
I fixed this by adding this to my server
block:
error_page 404 /404/index.html;
And then this to my location
block:
try_files $uri $uri.html $uri/index.html =404;
The =404
tells nginx to serve a 404 response if it can’t find the static file, and the error_page
tells it which page to render.
My Ghost blog used to serve Accelerated Mobile Pages (AMP) pages with a trailing /amp/
at the end of every post URL. My new Jekyll doesn’t do this, which was causing a lot of 404s in the logs. I’m not sure how many errors were shown to users vs Google just serving a stale version of the page.
In any case, I fixed it by redirecting AMP links back to the main post:
rewrite ^(.*)/amp/?$ $1 last;
I verified that the non-AMP pages look good on my phone.
Google still seems to have some AMP pages cached, but hopefully these go away over time.
I noticed many 404s in my logs of requests to posts with an off by one date. For example, they would request /2018/11/27/foo
instead of /2018/11/26/foo
.
I discovered that Ghost doesn’t seem to mind if the date is incorrect. It will properly redirect to the correct URL. Jekyll’s static site does care, however, and these will 404.
I tried to figure out where these incorrect URLs came from. My best guess is that an old Ghost bug served the post at the wrong URL for some amount of time: https://github.com/TryGhost/Ghost/issues/7655. I deployed the fixed version a long time ago, but I think the incorrect URLs made their way into search indexes.
Since the 404s in the logs seem to only be from bots, and the issue on my Ghost blog has likely been fixed for a long time, I figured it was ok to leave these 404s. I double checked that search results in the major US search engines returned the correct URLs.
]]>I write a lot of code in Vim, and depending on the programming language, mostly stay inside the editor.
However, I have a common workflow where I want to send a link of what I’m looking at to someone. I used to open GitHub or GitHub Enterprise, browse to the file and line, and then copy/paste the URL to someone. This is onerous, so I decided to automate it with a Vim function (that lives in the vimrc
file).
Here’s the function:
function! GitHubURL() range
let branch = systemlist("git name-rev --name-only HEAD")[0]
let remote = systemlist("git config branch." . branch . ".remote")[0]
let repo = systemlist("git config --get remote." . remote . ".url | sed 's/\.git$//' | sed 's_^git@\\(.*\\):_https://\\1/_' | sed 's_^git://_https://_'")[0]
let revision = systemlist("git rev-parse HEAD")[0]
let path = systemlist("git ls-files --full-name " . @%)[0]
let url = repo . "/blob/" . revision . "/" . path . "#L" . a:firstline . "-L" . a:lastline
echomsg url
endfunction
command! -range GitHubURL <line1>,<line2>call GitHubURL()
Now, I can select a block of code and run :GitHubURL
and it will print something like https://github.com/braintreeps/vim_dotfiles/blob/f6c550529b16b48f9ac99d5dd60c354373aa3fa1/vimrc#L275-L284
It should work for both GitHub and GitHub Enterprise.
]]>