Owner: @Fabio Barone
Support: @Gusto Bacvinka @Daniel Sanchez Quiros
During the Nomos Offsite in June, a first draft of a potential Data-Availability (DA) architecture has been devised ‣
At the offsite the decision was taken to run a Proof-Of-Concept (PoC), to strengthen the assumptions made in that document, and to validate its design.
The DA architecture designed at the offsite introduces an executor connecting with x
peers, of which each has other peers in sub-networks. The executor therefore has to maintain x
connections, and can assume that all its content will be then redundantly distributed through the DA network.
In our case, x
matches the columns specified in the Data Availability Network Specification- so 4096.
Ref | Question | Answer |
---|---|---|
Q1 | Is requiring 4096 connections from an executor feasible? This refers to establishing physical connections, not questioning the cryptographic protocol. | The NomosDA spec analyzed the cryptographic aspects and concluded 4096 column size is the optimum. The corollary is that work is pushed away from nodes towards the Executor. This is by design. This means that Executors will be required to run on superior specs hardware. Considering this, establishing 4096 connections is not a technical challenge. Feasability test proved that even from private ISP connections this is possible. Engineering challenges arise from maintaining the connections and responding to failures. Also attack scenarios need to be considered. |
Q2 | What failover strategies exist if the upload from an executor fails due to unavailable or byzantine peers? | Following the ‣ design, each DA node inside a subnet connects to each other node inside the same subnet. From this design, and after implementing the PoC, a couple of strategies for failover emerged: |
It is not the role of this PoC to define the final failover strategy. | | Q3 | Is burdening the executor to do failover acceptable? | There is in fact no alternative, if 4096 physical connections (or even 2048) are required. The only alternative would be using gateways as the main design choice. | | Q4 | What are the bandwidth requirements for the executor? | As what concerns DA, this can be easily answered by calculating 4096 connections times a column size. Additionally to mention are sampling calls and retries.
It should be considered that these figures are on top of other requirements for the Executor from other aspects (cryptarchia, mixnet, etc.) | | Q5 | What are resource requirements for the executor? | From DA perspective: Stable networking, and adequate connection speed | | Q6 | How would an executor find the addresses of peers he needs to connect to? | The executor will require the complete list of all DA node participants and their position in subnets. It can then pick one node per subnet at random (as done in this PoC), and failover to others if there are issues. Also, it is free to analyze the selected peer’s performance and choose another peer in a subnet based on availability, speed, latency etc. | | Q7 | Is 4096 subnets the right design? This refers to the design of the physical/logical subnets, not questioning the cryptographic protocol. | | | Q8 | What is the algorithm used to assign nodes to subnets? | POSTPONE, not part of this PoC | | Q9 | What is the protocol used for DA nodes to establish connection between each other, considering subnets? | The observations in this PoC suggest that the best approach seems to be direct connections between nodes in a subnet - ideally even without any technology on top like pubsub (this hinges on the fact that nodes can evaluate deterministically where they belong in the subnet structure). The PoC made evident that each subnet will consist of only a limited number of nodes.
The implementation will have to address things like node nodes joining/leaving, byzantine behavior and other engineering challenges. | | Q10 | Can we use libp2p and its implementation of gossipsub or not? | While the libp2p library for python used in this PoC is very limited and inadequate for production, for the rust implementation we should be able to fully rely on it.
libp2p should be used as a base layer for every network connection. It will allow us to easily adapt to changing requirements or findings. Once libp2p is used, it’s straightforward to change a transport, switch a peer discovery mechanism or adopt a pubsub protocol (or some new upcoming protocol) | | Q11 | What are bandwidth requirements for a DA node? | Already measured in the DA experiments. Can probably be skipped | | Q12 | What are resource requirements for a DA node? | Network speed is suggested, but mostly fast storage. Sufficient storage also for the expected retrievability of chunks. | | Q13 | Are byzantine scenarios addressed in case of misbehaving DA nodes? → enumerate byzantine scenarios which need to be considered. | The PoC analyzed some very simple byzantine scenarios. The chosen design shows surprisingly stable behavior when nodes are not available due to the replication factor.
For example, assuming a very probably launch scenario of about a few hundred nodes, 4096 subnets (factor 1:40 to 1:10) result in nodes being placed in multiple subnets, resulting in nodes storing chunks for many columns, not just one, while reusing the same connections. Thus trying another node in the same subnet will succeed.
Care should be devoted to byzantine scenarios where nodes are not just unavailable, but behave apparently well but act maliciously (covert attacks). These are not scope of this PoC. Specifically analyzed should be scenarios of complete subnet takeover. | | Q14 | Do the algorithms and protocols scale to networks of different node numbers? | The simple structure of the Poc showed that scaling works very well, in terms of node number or subnet numbers. | | Q15 | Can the algorithms and protocols handle node churn? DA nodes are supposed to be stable and long-running, however, there might be situations where this can not be taken for granted (e.g. unstable code updates). | 1. Node churn of single DA nodes coming and leaving: failover as described above will be able to handle this, if a replication factor is used. 2. Node churn should not affect subnet structure (at least not during a given epoch).
Special attention in implementation should be devoted to subnets becoming completely unavailable. Some compensation fallback should be devised. | | Q16 | Can the algorithms and protocols handle unstable network conditions? | Due to the small package sizes this represents a negligible concern. Trivial mechanisms like retries on failure and failovers should address most situations. | | Q17 | Can the same network structure be used for DA upload and sampling? | Yes | | Q18 | What is the suggested sampling protocol? Which subnets, which nodes, etc. | * Random node selection
To answer these questions, the PoC will consist of two main subsets: