Jepsen: NATS 2.12.1

Recently, Kyle Kingsbury published a Jepsen report on NATS 2.12.1. Synadia welcomes the efforts of those who work to help make NATS a better product. To that end, we would like to thank Kyle for volunteering his time.

We have been engaged with Kyle via GitHub Issues recently on findings where injecting certain faults can result in data loss or cluster split-brain scenarios.

The first identified issue, where a stream could be unexpectedly deleted after process kills in 2.10.20 to 2.10.22 had already been identified and fixed in 2.10.23 earlier this year in a series of patches to improve the correctness and reliability of our clustering logic. Many more such fixes were included in 2.11 and 2.12.

Further issues around how the filestore handles corrupted blocks or metadata, such as those introduced by bitflips or truncation, are still being worked on. While JetStream was designed from day one with message checksums which are verified on each load, we are aware that there are certain cases where a node may not automatically recover corrupted information from other peers in the cluster, particularly where the missing data is lost from the beginning or middle of a stream rather than the end.

We also engaged around the handling of cluster membership changes, although this was not a focus area of the final report. We have recognized and implemented a number of fixes in 2.12.3 to reduce the ways in which peer removal operations can go wrong. We have also contributed back some fixes to the Jepsen testing library for NATS to improve the safety of membership changes, yielding better results, as well as improving our documentation.

One area of particular discussion is around fsyncing of data. Owing to the fact that one size often does not fit all, NATS has a configuration option called sync_interval which allows the operator to configure how often JetStream writes are fsynced to the disk, providing a choice between performance and durability guarantees. This matters particularly when guarding against total power loss scenarios where writes to the kernel page cache that have not yet been synced to the disk could be lost. This is generally treated as the maximum potential data loss window. The kernel may flush writes to disk sooner, but there are important caveats, particularly when writes are merged or reordered before they reach stable storage.

There are a number of considerations involved when selecting a sync_interval. By default the server will sync data every two minutes for any outstanding writes. A system that deals mostly with replicated, ephemeral or lossy data can benefit from the improved performance that comes with a higher sync interval, whereas one that requires strict durability may require that this value be as low as practically reasonable.

Clustered deployments across multiple availability zones are ideal in that they are less likely to experience simultaneous power or storage failures and therefore can provide continued availability during the loss of a minority of nodes and can often be reconciled over the network afterwards.

Non-clustered deployments do not have this luxury built-in, although it is possible to replicate data by sourcing or mirroring over leafnode connections to another location. We’ve since updated the documentation with clarifications around the sync_interval and its interactions during multiple coordinated failures or degraded environmental health under “Syncing data to disk”. We are improving our documentation to provide users with the information they need to make the right decision for their environment and business needs.

In addition to being able to configure sync_interval to time-based intervals, it is also possible to configure it to always, forcing JetStream to call fsync after every write. This is a common recommendation for deployments where loss must be absolutely minimized or where the underlying platform is itself unable to guarantee uptime, as with many embedded single-server deployments of NATS. This higher level of durability comes at the expense of performance.

We have previously discussed the work that has gone into improving reliability of NATS in versions 2.11 and 2.12. Increased focus in testing, along with our partnership with Antithesis, has yielded significant improvements thus far. We are therefore happy to read that, barring disk corruption or OS failures, data loss was not observed otherwise:

We did not observe data loss with simple network partitions, process pauses, or crashes in version 2.12.1.

Looking forward, we have projects lined up to improve the reliability and consistency of JetStream’s replication further in the face of disk corruption or OS failures.

Today, when a stream’s backing storage is corrupted, the server detects this and truncates the remaining data in that specific block to continue service availability. We will develop new ways for the server to detect lost data and recover it from other cluster peers where possible, reducing the need for manual intervention or recovery. Recovering from OS failures could be handled in a similar way, ensuring a single or multiple OS failures is always guaranteed to preserve consistency.

A number of improvements from these findings have already been made available with the release of NATS 2.12.3.