Data Integrity

Understanding rqlite data integrity guarantees

Data files at startup

When a node starts, rqlite computes a CRC Checksum over every on-disk data file and compares it to a value that was stored when rqlite originally wrote the file (this value is stored in a sidecar file). The node refuses to run if any file fails the check. This catches bitrot, or other modification, that may have occurred while a node sits shut down.

Snapshots

rqlite computes a CRC over each Raft snapshot at write time and stores the CRC alongside the snapshot file. When a Leader sends a snapshot to a Follower, it sends the stored CRC with it. The Follower recomputes the CRC over the bytes it received and rejects the snapshot if the values do not match.

This gives end-to-end integrity from the moment the snapshot was first written. It catches both transport corruption and any on-disk corruption that may have struck the sender’s copy between snapshot creation and send.

Database state after snapshot merging

After rqlite merges previously snapshotted SQLite data, it runs SQLite’s PRAGMA quick_check against the resulting database. The check walks most of the database’s internal structure: B-trees and page references. A failure causes rqlite to log the error and refuse to proceed with the new state.

PRAGMA quick_check is a structural check. It detects malformed pages, broken indexes, and similar damage, but it does not detect every form of bitrot. A flipped bit inside a column value that leaves the surrounding page structure intact will pass the check.

What rqlite does not check

rqlite does not scan every byte on disk on a continuous schedule while a node runs. A file that sits unread between the startup check and its next access could suffer silent bitrot that rqlite might not detect.

This division of responsibility matches SQLite’s own design. SQLite’s atomic commit documentation states that SQLite assumes the detection and correction of bit errors – from cosmic rays, thermal noise, device driver bugs, or other causes – is the responsibility of the underlying hardware and operating system. rqlite inherits that assumption and names it here so operators can make informed choices.

The checks rqlite does perform – CRCs at startup, snapshot CRCs end-to-end, and PRAGMA quick_check after merging – add negligible runtime cost over the lifetime of a node. Continuous on-disk scrubbing is the expensive piece, and it is the piece best handled by the storage layer.

For continuous protection, deploy rqlite on a file system that performs its own checksumming and periodic scrubbing. ZFS and Btrfs, for example, do this natively. Most cloud block storage – AWS EBS, GCP Persistent Disk, Azure Managed Disk – handles this at the storage layer.

Recovery from detected corruption

When a node detects corruption – at startup, on snapshot receive, or during PRAGMA quick_check – it refuses to use the bad data. To recover, ensure the node is stopped, remove its data directory, restart it, and let it rejoin the cluster. Alternatively remove and deprovision the node, and add a new node to the cluster. In either case rqlite will replicate the current state from the Leader.

This path depends on the cluster having quorum and at least one healthy peer with intact data. A correlated failure across nodes – a firmware bug affecting every disk of the same model, or a shared-storage outage – may force you to perform an emergency recovery. Operators running production clusters should also keep independent backups against this case.

Last modified July 25, 2026: Typos (70feb4a)