As the project grows in complexity, and especially as the project operates, it will become harder and harder to understand what's going on, what should be happening, and observe bugs in the system.
For this, the best way to operate is to start working on tools that will help us analyze and understand the distributed system both in normal operation conditions and in faulty conditions.
Obviously, there is a challenge with this: we cannot (and don't want to) monitor nodes in production set up by users. It goes against our privacy principles, and anyway, they can opt out.
The options for this are multiple, and some of the tools would be meant for users, some are just internal for the Testnet, and perhaps opt-in for production.
This is an initial list that we should start elaborating in more detail:
Visualization tools
- Block explorer (public)
- Data blobs visualization in real time
- Transaction visualization in real time
- Find transactions by id
- Find blocks by id
- Find blobs by id
- Delegations (since stake is private, this would be self-reported only on testnet, or optionally reported if we want to)
- Fork visualization
- Node reporting of forks
- Aggregation of forks (a visualization that shows how many nodes are considering which fork as the tip, at any given time).
- This is fully opt-in, or even just enabled at the testnet.
- The purpose of this is to visualize and better understand the functioning of the system in real life.
- Mixnet visualization
- Real time statistics of the mixnet: usage, paths, and other data that we might find useful to fine-tune the system.
- This is testnet only, for understanding the system, as it would otherwise leak data.
Instrumentation
- Distributed Logging / Tracing
- There are multiple options for this, but a simple and effective one could be simple distributed logging delivered to a centralized logging service to analyze the logs of all nodes in an aggregated form. The main disadvantage of this is that it loses the causality.
- Distributed tracing makes it easier to observe causality (ie the causal ordering of events in different nodes). It is harder to implement, but might be useful for some protocols. In my opinion, for more chatty protocols (pBFT) would be more useful, but in the case of Cryptarchia the fork visualization tool might be better. For the mixnet it might be actually useful though.
- Consider the most adequate tool for each protocol (consensus, data availability, mixnet, p2p network).
Monitoring
- Health monitoring and reporting
- Report CPU usage, memory usage and overall system health
- Make it visual and simple to understand
- Opt-in or Testnet only