Storage, cost, and time

At Stripe, I've been spending some time thinking about the growth of Stripe's fastest growing online dataset. I finally landed on a model that's been useful for thinking about how to plan for the future.

We have some tricky constraints:

Self managed, not perfect data layer that has a lengthy shard split process (O(weeks))
Latency sensitive, must be SSD backed. Contributes to inelastic storage scaling options.
Can't delete any of the dataset that dominates storage utilization. Can TTL some data around the fringes.

What I found not useful:

MoM growth rate: we made the mistake of claiming we had a 10% MoM growth rate. It's not that that metric is wrong today, it's that it won't be that way in a few months. This was a relatively new dataset/worload. We were measuring our growth rate compared to a small a denominator. Over time, our growth rate will normalize.

What I found very useful:

Absolute growth numbers: we have a horizontally scalable data layer, but it requires splitting shards in Mongo and takes a while. It's a bit of overhead, not a perfect solution. So our real problem here was understanding when we would need to go through this again. To tackle this, I looked at absolute growth rates (X TB a month or whatever).
Increase capacity & decrease absolute growth rate: then, I split our efficiency efforts into two buckets - whether they would decrease our absolute growth rate (smaller data model, TTL some metadata after X months) or increase our capacity (adopting more storage dense instances) and how much time that would buy us before we need to take action (split shards again).

Ultimately, we need tiered storage. We're not using Postgres, but love to see this work in pg_tier.

Storage, cost, and time

How I wish I knew to model scaling across time.