Related spec: Overview
Author: @Gusto Bacvinka
Reviewers: 🔺@Marcin Pawlowski 🔺@Daniel Sanchez Quiros 🔺@Álvaro Castro-Castilla
Introduction
The NomosDA dispersal protocol is used by encoders to publish data across the network's subnetworks. A critical design question arises when an encoder cannot establish a connection with one or more of these subnetworks.
This connectivity failure can occur in several scenarios:
- Node Initialization: The node has recently started and is still in the process of establishing its initial connections.
- Transient Network Issues: The node has temporarily lost connections and is attempting to redial the same nodes or find new peers within the unreachable subnetwork.
- Persistent Network Partition: The node is unable to connect to any peer within a specific subnetwork but maintains healthy connections to all others.
This situation presents a fundamental choice: should the protocol enforce a strict connectivity requirement, or should it allow for an "optimistic" dispersal to the reachable subnetworks?
Proposed Solutions
Below are two primary approaches to address this problem, along with an analysis of their respective benefits and drawbacks.
Solution 1: Optimistic Dispersal
In this model, the encoder sends data immediately to all currently reachable subnetworks, even if one or more are unavailable. The protocol would then have to rely on other mechanisms to eventually propagate the missing pieces.
- Advantages:
- Maximizes Liveness: Data dispersal is not blocked by temporary network glitches or slow peer discovery in a single subnetwork. This leads to higher uptime and throughput.
- Fault Tolerance: The system can continue to function in a degraded state, tolerating partial network partitions without halting completely.
- Drawbacks:
- Violates Availability Guarantees: This is the most significant drawback. If the missing data shares are never successfully transmitted to the unreachable subnetwork, the full data block can be reconstructed, but sampling may fail.
- Increased Protocol Complexity: The network needs a separate, reliable mechanism to identify and repair these incomplete dispersals. This adds complexity to the validator and node logic, requiring state management for pending or failed shares.
Solution 2: Strict Dispersal
In this model, an encoder must have an active connection to at least one peer in every required subnetwork before attempting to disperse data. If this condition is not met, the dispersal operation fails immediately or is queued.
- Advantages:
- Guarantees Data Availability: If a dispersal operation succeeds from the encoder's perspective, it provides a strong guarantee that all subnetworks have received their respective shares. This simplifies the data integrity model.
- Simpler Protocol Design: The logic is straightforward: either all connections are available and you send, or they are not and you wait/fail. There is no need for a complex state-tracking and repair mechanism for partial dispersals.
- Drawbacks:
- Reduced Liveness: The entire dispersal process can be halted by a single point of failure (e.g., a temporary partition from one subnetwork). This can lead to significant delays and reduced system throughput.
- Poor User Experience: For an end-user or application, operations could frequently fail or hang due to transient network conditions that are beyond their control. A buffered or queued approach can mitigate this but adds complexity.