Practical Byzantine Fault Tolerance

Reference: Castro, M. & Liskov, B. (1999). Practical Byzantine Fault Tolerance. In 3rd Symposium on Operating Systems Design and Implementation (OSDI ’99), pp. 173–186, USENIX. Companion: Castro, M. (2001). Practical Byzantine Fault Tolerance. PhD thesis, MIT. Journal version: ACM Transactions on Computer Systems 20(4), pp. 398–461, 2002. Open access PDF (MIT CSAIL) · USENIX OSDI ’99 page · Journal version (BFT-TOCS)

Summary

Castro and Liskov demonstrate that Byzantine fault tolerance — agreement among 3f+1 replicas in the presence of up to f arbitrarily faulty (malicious, buggy, compromised) nodes — can be made practical: their protocol, PBFT, achieves throughput within a small factor of unreplicated service for realistic workloads, where prior BFT protocols had been orders of magnitude slower. The protocol assumes the partial-synchrony model (eventual upper bound on message delay) for liveness; safety holds in fully asynchronous networks. The core protocol is a three-phase primary-backup scheme — pre-prepare, prepare, commit — driven by a designated primary (replica p such that p ≡ v mod n for view number v). The primary orders client requests; the prepare phase ensures 2f+1 replicas agree on the order in the current view; the commit phase ensures persistence across view changes. A view change protocol replaces the primary if it is suspected of failure: backups time out, exchange certified message logs, and elect the next primary; the new primary reconstructs the longest committed prefix from the received certificates. The two key engineering moves that make BFT practical are (1) MAC vectors (one symmetric MAC per recipient) instead of public-key signatures on every message — public-key crypto is reserved for view changes — and (2) a careful checkpoint-and-garbage-collect mechanism that bounds memory and accelerates recovery. The paper applies PBFT to a Byzantine-fault-tolerant NFS implementation; performance is within 3% of unreplicated NFS for realistic file-system workloads. PBFT inaugurated 25+ years of practical BFT research and is the direct ancestor of modern blockchain consensus protocols including Tendermint, HotStuff, and Diem/Aptos’s BFT family.

Key Ideas

3f+1 replicas tolerate f Byzantine failures: the standard BFT bound (Lamport, Shostak & Pease 1982) — needed because Byzantine replicas can equivocate, so a 2f+1 quorum (sufficient against crash failures) can be split if f Byzantine replicas vote opposite ways to two halves of the honest set.
Three-phase commit driven by a primary: pre-prepare (primary assigns sequence number n to a request, broadcasts), prepare (each replica that accepts the pre-prepare broadcasts a prepare message; once 2f+1 prepare messages agree, the request is prepared), commit (each prepared replica broadcasts commit; once 2f+1 commit messages agree, the request is committed and executed).
View changes for primary failure: when a backup times out without progress, it broadcasts a view-change message containing certified prepared / committed certificates from the previous view; the new primary (next in round-robin) collects 2f+1 view-change messages and constructs a new-view containing the prepared requests that must be re-executed.
MAC vectors instead of signatures: every message carries a MAC for each recipient (computed under a pairwise symmetric key). One MAC is two orders of magnitude faster than a public-key signature; pairwise MACs are sufficient because Byzantine replicas cannot forge a MAC under a key they don’t know.
Checkpoints and garbage collection: every K requests, replicas take a checkpoint of state and broadcast a checkpoint message; once 2f+1 matching checkpoints exist (a stable checkpoint), older log entries can be discarded. Stable checkpoints also accelerate state transfer for recovering or lagging replicas.
Byzantine-fault-tolerant NFS: end-to-end demonstration that BFT can be deployed in production-style systems with manageable performance overhead — essential evidence that BFT was practical, not just theoretically interesting.
Safety always, liveness under partial synchrony: PBFT preserves safety (no two committed values disagree) under fully asynchronous networks; liveness requires partial synchrony (timeouts must eventually become accurate). FLP-compatible: the asynchronous gap is in liveness, not safety.

Connections

Conceptual Contribution

Claim: Byzantine fault tolerance can be made practical (within a small factor of unreplicated performance) by combining a three-phase primary-backup protocol with view-change recovery, MAC vectors instead of public-key signatures on the common path, and checkpoint-driven garbage collection. Safety holds in asynchronous networks; liveness requires partial synchrony.
Mechanism: Primary-backup protocol with 3f+1 replicas; three phases (pre-prepare, prepare, commit) each requiring 2f+1-strong quorum certificates; view-change protocol triggered by backup timeouts, electing the next primary by round-robin; MAC vectors with public-key crypto reserved for view changes; periodic stable checkpoints for log truncation and state transfer.
Concepts introduced/used: PBFT, Byzantine Agreement, View Change, MAC Vector, Stable Checkpoint, 3f+1 Quorum, Partial Synchrony.
Stance: systems-engineering paper / dissertation summary.
Relates to: Implements Byzantine agreement (Lamport, Shostak & Pease 1982 / Pease, Shostak & Lamport 1980) practically. Subject to the same FLP impossibility as crash-fault consensus, with the same partial-synchrony resolution. Pre-blockchain, BFT was largely a theoretical curiosity; PBFT proved deployment viable, but it took the explicit blockchain-as-economic-system framing of Bitcoin (Nakamoto 2008, not in vault) and especially of Tendermint (Buchman 2016) to drive industrial BFT adoption. Direct ancestor of HotStuff (Yin et al. 2019), which inherits PBFT’s three-phase structure but achieves linear communication complexity (vs PBFT’s quadratic) and responsiveness (no waiting for max network delay during normal operation). PBFT-style three-phase protocols underlie the consensus layers of Diem / Aptos, much of Hyperledger Fabric, and (with adaptations) Cosmos SDK chains. Compared to Raft / multi-Paxos: PBFT tolerates malicious nodes at the cost of 3f+1 replicas (vs 2f+1), three communication phases (vs two), and quadratic message complexity (vs linear); for crash-only environments Raft is simpler and faster.

Tags

#consensus #byzantine-fault-tolerance #pbft #castro #liskov #distributed-systems #foundations

Backlinks

Linked Pages

Raft

Ongaro & Ousterhout’s (2014) crash-fault-tolerant consensus algorithm equivalent in capability to multi-Paxos but designed primarily for understandability. Decomposes consensus into Leader Election / Log Replication / safety; enforces a strong-leader discipline (logs flow only leader→follower); uses randomised election timeouts and term-numbered logical clocks. The dominant CP-style consensus algorithm in modern systems: etcd, Kubernetes, CockroachDB, TiKV, Consul.

In this vault

HotStuff

Yin et al.’s (2019) BFT consensus protocol with linear communication complexity (O(n) per decision in both the common case and view change) and responsiveness (commits at network speed, not max-timeout speed) — both improvements over PBFT. Achieves linearity via threshold-signature quorum certificates and uniform three-chain commit; chained variant pipelines three views into one message per leader. The consensus core of Diem, Aptos, Sui’s Mysticeti, and many recent BFT-PoS blockchains.

In this vault

Impossibility of Distributed Consensus with One Faulty Process

Reference

Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). “Impossibility of Distributed Consensus with One Faulty Process.” Journal of the ACM, 32(2), 374-382. URL

Summary

The FLP result is the canonical impossibility theorem of asynchronous distributed computing. Its statement is sharp: no deterministic consensus protocol can guarantee termination in an asynchronous message-passing system if even a single process may crash. Unlike earlier results that required Byzantine faults or lossy networks, FLP assumes reliable messaging and only one benign crash failure — yet still derives impossibility.

The proof proceeds by showing that every consensus protocol admits an initial bivalent configuration (one from which either decision value is still reachable), and that from any bivalent configuration an adversary scheduler can always delay one message to force the system into another bivalent configuration. Thus an admissible run exists in which no process ever decides. The core technical tool is the commutativity of disjoint process steps (Lemma 1) and a careful analysis of “critical” configurations where a specific process’s next step is decision-forcing.

The result cleaves distributed computing into what is possible under various synchrony assumptions. Real-world protocols respond by weakening one axis: Paxos and Raft adopt partial synchrony and accept that liveness can only be guaranteed “eventually”; randomized consensus (Ben-Or, Rabin) achieves termination with probability 1; failure detectors (Chandra-Toueg ◊S) encapsulate the synchrony needed. FLP remains the bedrock boundary against which all consensus engineering is measured.

Key Ideas

Consensus problem: N processes, binary inputs; non-faulty processes must all decide the same value; some initial configuration must admit each decision.
Asynchronous model: unbounded message delays; no clocks; no timeouts.
One crash failure: the weakest possible fault assumption that still breaks consensus.
Bivalent configurations: states from which both 0 and 1 outcomes are still reachable.
Adversary scheduler: by reordering message deliveries, keeps the system in a bivalent configuration forever.
Safety vs. liveness: FLP shows safety + liveness + fault-tolerance cannot coexist in pure async.
Escape hatches: partial synchrony, randomization, failure detectors, or accepting non-termination in corner cases.

Connections

CAP Theorem — CAP is a direct relative: in partition-prone systems, atomic read/write also unattainable.
CALM Theorem — monotonic logic sidesteps consensus by avoiding it.
Keeping CALM - When Distributed Consistency is Easy
Coordination Avoidance — the design pattern motivated by FLP.
Gossip Protocols — probabilistic convergence as an alternative to deterministic agreement.
Time Clocks and the Ordering of Events in a Distributed System — Lamport’s logical time underlies the proof’s commutation arguments.
Knowledge and Common Knowledge in a Distributed Environment — common knowledge likewise unattainable in async systems.

Conceptual Contribution

Byzantine Agreement

The classical problem (Pease, Shostak & Lamport 1980; Lamport, Shostak & Pease 1982 Byzantine Generals Problem): get n processes to agree on a value when up to f of them may be Byzantine (arbitrarily faulty, including coordinating malicious behaviour). The standard bound is n ≥ 3f+1 for deterministic agreement with authenticated messages, n ≥ 3f+1 is also necessary in the partially-synchronous model. Practical algorithms: PBFT (Castro & Liskov 1999), HotStuff (Yin et al. 2019), Tendermint, Honeybadger.

In this vault

View Change

The recovery sub-protocol of PBFT and successor BFT protocols by which a stuck or faulty primary is replaced. Backups that have not made progress within a timeout broadcast view-change messages containing certificates of their prepared / committed entries; the next primary (round-robin) collects 2f+1 view-change messages, computes the longest committed prefix, and starts the next view with a new-view message. View change is the most subtle part of any BFT protocol; many bugs in deployed BFT systems are localised to view-change edge cases. HotStuff simplifies view change to a single message round (one of its principal contributions over PBFT).

In this vault

PBFT

Castro & Liskov’s (1999) Practical Byzantine Fault Tolerance protocol — 3f+1 replicas tolerating up to f Byzantine failures via a three-phase primary-backup protocol (pre-prepare / prepare / commit) plus view-change recovery. The first BFT protocol fast enough for production deployment; demonstrated on Byzantine NFS within 3% of unreplicated performance. Direct ancestor of Tendermint, HotStuff, Diem/Aptos consensus.

In this vault

Brewers Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Reference

Gilbert, S., & Lynch, N. (2002, revisited 2012). “Perspectives on the CAP Theorem.” MIT / National University of Singapore. URL

Summary

This paper is the formal proof and later retrospective of Brewer’s CAP conjecture: in a distributed system subject to communication failures, no web service can simultaneously guarantee Consistency (atomic read/write), Availability (every request receives a response), and Partition-tolerance (the system continues to operate when messages are lost between nodes). The proof is elegantly short: partition the servers into two groups; a write on one side and a read on the other must answer, but the read cannot know of the write, so either consistency or availability must fail.

The authors situate CAP within the deeper trade-off between safety and liveness properties in unreliable systems — the very trade-off FLP formalized for consensus. Consistency is a safety property (“nothing bad happens”), availability is a liveness property (“something good eventually happens”), and the unreliability axis includes partitions, crashes, and Byzantine faults. CAP is then one specific instance of the general fact that safety + liveness are jointly unattainable in sufficiently unreliable systems.

The paper distinguishes practical regimes (always-consistent with best-effort availability; always-available with weak/eventual consistency; hybrid tactics) and connects to partial synchrony results (Dwork, Lynch, Stockmeyer) that quantify exactly how much timing reliability is needed. CAP has become a rallying slogan and a misused one — the paper explicitly warns it is a theorem about adversarial partitions, not a license to abandon consistency whenever convenient.

Key Ideas

CAP theorem: pick at most two of consistency, availability, partition-tolerance in an unreliable network.
Asynchronous impossibility: even without actual partitions, async delays force the same trade-off — you cannot distinguish a slow network from a partitioned one.
Safety vs. liveness lens: CAP is a concrete instance of a broader unreliability theorem.
Weak consistency models: eventual, causal, sequential — engineered escapes from strict CAP.
Synchrony continuum: fully synchronous → partially synchronous → fully asynchronous; feasibility varies along it.
Practical taxonomy: CP, AP, and CA-only-without-partitions system designs.
Not a license: the theorem is often cited to justify weaker-than-needed guarantees; read carefully.

Connections

CAP Theorem
CALM Theorem — identifies the programs for which consistency does not require coordination.
Keeping CALM - When Distributed Consistency is Easy
Coordination Avoidance
Gossip Protocols — eventual consistency in the AP regime.
Impossibility of Distributed Consensus with One Faulty Process — FLP is CAP’s consensus cousin.
Time Clocks and the Ordering of Events in a Distributed System

Conceptual Contribution

Replicated State Machine

The architectural pattern in which a fault-tolerant service is built by running identical deterministic state machines on multiple nodes and using a consensus protocol (Raft, Paxos, PBFT, HotStuff) to agree on the order of inputs (commands) applied to the machine. Given identical initial state and identical input sequences, deterministic replicas reach identical states. The pattern underlies essentially every modern fault-tolerant database, configuration store, and blockchain.

In this vault

Distributed Consensus

The problem of getting a set of distributed processes to agree on a single value despite some of them failing or behaving adversarially. The negative bound is FLP: in a fully asynchronous network with even one crash failure, no deterministic protocol can guarantee both safety and liveness. Practical algorithms (Paxos, Raft, PBFT, HotStuff) circumvent FLP by making partial-synchrony or randomisation assumptions. Crash-fault tolerance requires 2f+1 nodes to tolerate f failures; Byzantine tolerance requires 3f+1.

In this vault

In Search of an Understandable Consensus Algorithm

In Search of an Understandable Consensus Algorithm (Extended Version)

Reference: Ongaro, D. & Ousterhout, J. (2014). In Search of an Understandable Consensus Algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC ’14), pp. 305–319. (Extended version on arXiv: 1404.4097.) Companion: Ongaro, D. (2014). Consensus: Bridging Theory and Practice. PhD thesis, Stanford University. Open access PDF (raft.github.io) · project home · arXiv:1404.4097 (extended)

Summary

Ongaro and Ousterhout introduce Raft, a consensus algorithm for replicated state machines that is equivalent in fault-tolerance and performance to multi-Paxos but designed primarily for understandability. The paper opens with the observation that despite Paxos’s status as the canonical consensus algorithm (Lamport 1998), it has consistently proved difficult for students and engineers to learn, reason about, and implement correctly: Lamport’s Paxos description is famously oblique, derivative explanations diverge, and most production “Paxos” implementations are actually significantly different algorithms. Raft is a deliberate engineering response to this state of affairs. It decomposes consensus into three relatively independent sub-problems — leader election, log replication, and safety — and adds an explicit strong leader discipline (logs flow only from leader to followers, never the reverse) plus a log-matching invariant that simplifies the consistency argument. Cluster membership changes are handled by a single-server-at-a-time approach (joint consensus is presented as the more general alternative). The paper includes a user study comparing student understanding of Raft against Paxos: across two universities, Raft scored substantially higher on comprehension tests after equivalent teaching time. Ongaro’s PhD thesis adds detail on snapshotting, log compaction, and client interaction. Raft is now the consensus algorithm of choice in the systems community: etcd (Kubernetes), CockroachDB, TiKV, Consul, RethinkDB, and many others use Raft directly; the algorithm is taught in distributed-systems courses worldwide. The paper deliberately demotes formal-verification rigour in favour of operator and engineer accessibility — a methodological stance with its own descendants in the systems literature.

Key Ideas

Three sub-problems: leader election (timeout-driven elections with randomised timeouts to break ties), log replication (leader appends entries and replicates to a majority), safety (committed entries must persist; only up-to-date candidates can win elections).
Strong leader: at any moment at most one leader exists per term; followers passively accept the leader’s appends. All client requests go through the leader; logs flow only leader→follower. This rules out an entire class of Paxos’s apparent symmetry.
Election restriction: a candidate’s vote request is rejected by any voter whose log is more up-to-date (longer term, or same term and longer index). Combined with majority voting, this guarantees that any newly elected leader contains all previously committed entries.
Log-matching invariant: if two logs contain an entry with the same index and term, then they are identical in all entries up to and including that index. This is enforced by the replication protocol (followers reject appends inconsistent with their last entry) and is the key property simplifying the safety argument.
Membership changes via joint consensus: to safely move from cluster C_old to C_new, the leader appends a joint configuration C_old,new that requires majorities of both configurations to commit; once committed, the leader appends the final C_new. (The thesis presents the simpler single-server-at-a-time method.)
Explicit terms as logical clocks: every server maintains a current term number; communications carry the sender’s term, and any server with a stale term steps down. Terms eliminate stale-leader pathologies that Paxos handles less directly.
Comprehensibility-as-design-criterion: the user-study results are presented as a primary contribution — the explicit thesis that algorithm design should weight understandability as it would performance or fault-tolerance.
Production realities: snapshotting for log compaction, linearizable read leases for read-only requests, client session-IDs for at-most-once semantics — covered in the thesis and absorbed into the standard Raft implementation patterns.

Connections

Conceptual Contribution

Claim: A consensus algorithm equivalent in fault-tolerance and performance to multi-Paxos can be designed primarily for understandability by decomposing consensus into independent sub-problems (leader election / log replication / safety), enforcing a strong-leader discipline, and adding a log-matching invariant; understandability should be a first-class design criterion alongside fault-tolerance and performance.
Mechanism: Strong-leader replicated-state-machine architecture; randomised election timeouts; AppendEntries RPC carrying previous-entry index+term so followers can reject inconsistent appends; vote-restriction by log up-to-date-ness; joint-consensus membership changes; user-study evaluation on graduate students.
Concepts introduced/used: Raft, Leader Election, Log Replication, Strong Leader, Log Matching Invariant, Joint Consensus, Term (as logical clock), Election Restriction.
Stance: systems-engineering paper with a methodological thesis (understandability as design criterion).
Relates to: Equivalent in capability to (multi-)Paxos (Lamport 1998 / 2001), explicitly and pointedly so — Raft is the re-presentation of Paxos’s solution space under a different organising principle. Subject to the same FLP impossibility result (Fischer, Lynch & Paterson 1985) and the same CAP Theorem trade-offs as Paxos: Raft chooses CP over AP in a network partition, sacrificing availability of the minority side. Crash-fault-tolerant only — Byzantine variants (PBFT, HotStuff) tolerate adversarial nodes but at the cost of a more expensive message protocol. Foundational for the modern CP-flavoured distributed-systems landscape: etcd / Kubernetes, CockroachDB, TiKV, Consul, MongoDB, and many others use Raft for cluster coordination; many “Paxos” implementations have been quietly rewritten as Raft for the same reasons Ongaro & Ousterhout argue. The paper’s methodological thesis — that designing for human comprehension is itself a research contribution — is influential beyond consensus and finds echoes in the design of Rust (over C++), TLA+ (over CSP-style notations), and the Pact-style choreographies that prefer DSL-shape over raw process-calculus terms.

Tags

#consensus #distributed-systems #raft #ongaro #replicated-state-machines #leader-election #foundations

The Part-Time Parliament

Reference: Leslie Lamport (1998). ACM Transactions on Computer Systems, 16(2):133-169 (minor corrections 2000). Source file: lamport-paxos.pdf. URL

Summary

The original exposition of the Paxos consensus algorithm, framed as an archaeological account of the part-time parliament of the (fictional) ancient Greek island of Paxos. Legislators drift in and out of the Chamber and messengers may lose, duplicate, or delay messages; nevertheless the parliament maintains a consistent ledger of decrees. Translated to distributed computing, this is the canonical solution to asynchronous fault-tolerant state-machine replication: a majority quorum protocol that guarantees safety (no two ledgers ever disagree on a decree) and makes progress whenever a majority of processes and their messages are eventually responsive.

The paper develops the single-decree Synod protocol first — choosing one value — using a ballot structure with three conditions (unique ballot numbers, pairwise-intersecting quorums, and a rule forcing a ballot’s decree to equal the latest decree of any earlier ballot whose quorum overlaps). It then generalises to multi-decree Parliament (multi-Paxos). The prose style (Greek-parable framing, archaeological footnotes) famously delayed the paper’s publication and shaped the oral tradition of the algorithm.

Key Ideas

Asynchronous consensus under crash failures and unreliable messaging via majority quorums
Two-phase protocol: prepare (collect promises / latest earlier votes) then accept (commit a value)
Ballot numbers totally ordered; any two quorums intersect; late ballots adopt an earlier committed value
Safety is unconditional; liveness depends on eventual synchrony plus a distinguished proposer
Multi-Paxos: amortise phase 1 across a sequence of decrees under a stable leader
State-machine replication as the canonical application

Connections

Conceptual Contribution

Claim: Fault-tolerant consensus is achievable in an asynchronous message-passing system with crash failures as long as a majority of processes is eventually live and can communicate, by requiring every decision-making quorum to intersect every other.
Mechanism: Ballots with unique, totally ordered numbers; each ballot has a quorum; a ballot succeeds iff its quorum all vote. Condition B3 — a ballot’s decree must match the decree of the latest earlier ballot in which any quorum member voted — propagates any already-committed value forward, preserving safety across leader changes.
Concepts introduced/used: Paxos, Synod protocol, ballots, quorums, proposer/acceptor/learner roles (implicit), state-machine replication, safety vs. liveness, leader election, eventual synchrony.
Stance: formal / algorithmic — an algorithm with a correctness proof disguised as archaeology.
Relates to: Paxos sidesteps Impossibility of Distributed Consensus with One Faulty Process (FLP) by relaxing liveness under worst-case asynchrony. It operationalises the happens-before order of Time Clocks and the Ordering of Events in a Distributed System (another Lamport paper) at the agreement level. Sits firmly in the CP corner of the CAP Theorem, trading availability during partitions for consistency. Contrasts with CALM Theorem, which identifies classes of computations that need no coordination at all.

Tags

Byzantine Fault Tolerance

The property of a distributed protocol to reach correct consensus despite arbitrary, including malicious, failures of up to f of 3f+1 participants. BFT underlies replicated coordination kernels (e.g., DepSpace/EDS) and motivates constraints on server-side extensions to preserve determinism.

In this vault

Practical Byzantine Fault Tolerance

Summary

Key Ideas

3f+1 replicas tolerate f Byzantine failures: the standard BFT bound (Lamport, Shostak & Pease 1982) — needed because Byzantine replicas can equivocate, so a 2f+1 quorum (sufficient against crash failures) can be split if f Byzantine replicas vote opposite ways to two halves of the honest set.
Three-phase commit driven by a primary: pre-prepare (primary assigns sequence number n to a request, broadcasts), prepare (each replica that accepts the pre-prepare broadcasts a prepare message; once 2f+1 prepare messages agree, the request is prepared), commit (each prepared replica broadcasts commit; once 2f+1 commit messages agree, the request is committed and executed).
View changes for primary failure: when a backup times out without progress, it broadcasts a view-change message containing certified prepared / committed certificates from the previous view; the new primary (next in round-robin) collects 2f+1 view-change messages and constructs a new-view containing the prepared requests that must be re-executed.
MAC vectors instead of signatures: every message carries a MAC for each recipient (computed under a pairwise symmetric key). One MAC is two orders of magnitude faster than a public-key signature; pairwise MACs are sufficient because Byzantine replicas cannot forge a MAC under a key they don’t know.
Checkpoints and garbage collection: every K requests, replicas take a checkpoint of state and broadcast a checkpoint message; once 2f+1 matching checkpoints exist (a stable checkpoint), older log entries can be discarded. Stable checkpoints also accelerate state transfer for recovering or lagging replicas.
Byzantine-fault-tolerant NFS: end-to-end demonstration that BFT can be deployed in production-style systems with manageable performance overhead — essential evidence that BFT was practical, not just theoretically interesting.
Safety always, liveness under partial synchrony: PBFT preserves safety (no two committed values disagree) under fully asynchronous networks; liveness requires partial synchrony (timeouts must eventually become accurate). FLP-compatible: the asynchronous gap is in liveness, not safety.

Connections

Conceptual Contribution

Claim: Byzantine fault tolerance can be made practical (within a small factor of unreplicated performance) by combining a three-phase primary-backup protocol with view-change recovery, MAC vectors instead of public-key signatures on the common path, and checkpoint-driven garbage collection. Safety holds in asynchronous networks; liveness requires partial synchrony.
Mechanism: Primary-backup protocol with 3f+1 replicas; three phases (pre-prepare, prepare, commit) each requiring 2f+1-strong quorum certificates; view-change protocol triggered by backup timeouts, electing the next primary by round-robin; MAC vectors with public-key crypto reserved for view changes; periodic stable checkpoints for log truncation and state transfer.
Concepts introduced/used: PBFT, Byzantine Agreement, View Change, MAC Vector, Stable Checkpoint, 3f+1 Quorum, Partial Synchrony.
Stance: systems-engineering paper / dissertation summary.
Relates to: Implements Byzantine agreement (Lamport, Shostak & Pease 1982 / Pease, Shostak & Lamport 1980) practically. Subject to the same FLP impossibility as crash-fault consensus, with the same partial-synchrony resolution. Pre-blockchain, BFT was largely a theoretical curiosity; PBFT proved deployment viable, but it took the explicit blockchain-as-economic-system framing of Bitcoin (Nakamoto 2008, not in vault) and especially of Tendermint (Buchman 2016) to drive industrial BFT adoption. Direct ancestor of HotStuff (Yin et al. 2019), which inherits PBFT’s three-phase structure but achieves linear communication complexity (vs PBFT’s quadratic) and responsiveness (no waiting for max network delay during normal operation). PBFT-style three-phase protocols underlie the consensus layers of Diem / Aptos, much of Hyperledger Fabric, and (with adaptations) Cosmos SDK chains. Compared to Raft / multi-Paxos: PBFT tolerates malicious nodes at the cost of 3f+1 replicas (vs 2f+1), three communication phases (vs two), and quadratic message complexity (vs linear); for crash-only environments Raft is simpler and faster.

Tags

#consensus #byzantine-fault-tolerance #pbft #castro #liskov #distributed-systems #foundations