Sampling Analysis

When both chunk sampling and column sampling are subject to the possibility of sampling redundancy (i.e., multiple samples targeting the same column), the sampling probabilities and detection analysis need to incorporate this overlap explicitly. Here is a revised analysis considering redundancy for both scenarios:

Key Setup

Data Dimensions:
- Original data: $k \times n$ matrix.
- After Reed-Solomon extension: $k×2n$ matrix.
- $2n$: Total number of columns (#of subnets).
Adversarial Assumptions:
- $m$: Number of unavailable columns (out of $2n$).
- Adversary can withhold entire columns, meaning all chunks in those columns are unavailable.
Sampling Scenarios:
- $s$: Total samples taken (either chunks or columns).
- Light client samples with replacement, introducing potential redundancy.

Probability of Redundancy

General Sampling Overlap

For both chunk and column sampling, overlap arises because of random sampling with replacement. The probability of sampling $r$ unique columns out of $s$ samples can be modeled using the following approach:

Expected Number of Unique Columns Sampled:
- Let $r$ denote the number of unique columns sampled after $s$ trials.
- Expected unique samples: $\mathbb{E}[r] = 2n \left( 1 - \left( 1 - \frac{1}{2n} \right)^s \right).$
- This accounts for redundancy since $1 - \frac{1}{2n}$ is the probability that a column is not sampled in one trial.
Distribution of Unique Samples: The number of unique columns sampled follows a distribution when sampling with replacement:

$P(r \text{ unique columns}) \approx \binom{2n}{r} \cdot \frac{\binom{s}{r} \cdot r!}{(2n)^s}$

Detection Probability Analysis

Chunk-by-Chunk Sampling

Light client samples $s$ chunks. Each chunk belongs to one of the $2n$ columns.
For $m$ unavailable columns, the goal is to sample at least one chunk from an unavailable column.

Probability of Sampling an Available Chunk: The probability that a sampled chunk belongs to an available column is:

$P(\text{chunk available}) = \frac{2n - m}{2n}.$
Probability of Sampling $s$ Available Chunks: The probability that all $s$ samples are from available columns:

$P(\text{all chunks available}) = \left( \frac{2n - m}{2n} \right)^s$.
Probability of Detecting Unavailability: At least one chunk must be from an unavailable column for detection:

$P_{\text{detect, chunk}} = 1 - P(\text{all chunks available}) = 1 - \left( \frac{2n - m}{2n} \right)^s$.

Column-by-Column Sampling

Light client samples $s$ columns directly. Each sample is drawn with replacement.
The analysis is analogous to chunk sampling, but now entire columns are the sampling unit.