Validation Set Construction#

To address the common lack of replicates in spatial transcriptomics, where the entire spot-level dataset is often used for training without a hold-out set, we adopt a count splitting approach: Given a spot-level count matrix $\mathbf{Y}=[Y_{s,g}]\in\mathbb{N}_0^{S\times G}$, we partition each entry as $Y_{s,g}^{(\mathrm{train})}\sim\mathrm{Binomial}\big(2Y_{s,g}, \tfrac{1}{2}\big)$, $Y_{s,g}^{(\mathrm{val})}=2Y_{s,g}-Y_{s,g}^{(\mathrm{train})}$.

Under a Poisson model assumption, count splitting ensures that $Y_{s,g}^{(\mathrm{train})}$ and $Y_{s,g}^{(\mathrm{val})}$ follow the same distribution as $Y_{s,g}$, and are conditionally independent given their mean value.

This construction therefore yields statistically independent training and validation sets suitable for unbiased assessment of predictive performance.

Rscript generate_count_split.R ${prefix} $cnts_train_name $cnts_val_name $seed

Input: cnts.csv, file containing count matrix of your ST data, with each row representing a spot and each column representing a gene.
Parameters:
- ${prefix}: directory to the folder containing the file, i.e. data/.
- $cnts_train_name: name of the training set. Default name: 'cnts_train'.
- $cnts_val_name: name of the validation set. Default name: 'cnts_val'.
- $seed: random seed use. Default: 1.
Output: cnts_train_seed_1.csv and cnts_val_seed_1.csv: files containing the training and validation set, respectively.