SDRS: Synthetic Designed Experiments for Diagnosing Vision Model Failures

Abstract

Current synthetic data pipelines for computer vision generate images without diagnosing what the downstream model actually needs. We propose Synthetic Designed Experiments for Representational Sufficiency (SDRS), a principled framework based on the statistical theory of Design of Experiments (DoE). SDRS treats the downstream model as a black-box system and the synthetic generator as an experimental apparatus. Using fractional factorial designs, SDRS efficiently audits a model's factor-sensitivity profile via ANOVA decomposition, identifying coverage failures (Type I gaps) and spurious dependencies (Type II gaps).

Theoretical Framework: ANOVA Decomposition

SDRS leverages the Analysis of Variance (ANOVA) to decompose the model's response (e.g., loss or accuracy) into contributions from individual scene factors and their interactions. For a set of factors \(\{F_1, F_2, \dots, F_n\}\), the total variance in model performance is partitioned as:

SS_{\text{total}} = \sum_{i} SS_{F_i} + \sum_{i < j} SS_{F_i \times F_j} + SS_{\text{error}}

A high F-statistic for a specific factor indicates that the model is highly sensitive to that factor, revealing potential representational gaps or biases.

Type I Gaps: Coverage failures where the model lacks representational sufficiency for certain factor levels.
Type II Gaps: Reliance on spurious nuisance dependencies (e.g., background shortcuts).

Experiment 1: Diagnostic on dSprites

We planted specific biases in a dSprites-based dataset to test if SDRS could detect them. The audit correctly identified both gap types, and targeted data improved accuracy significantly.

Figure 1: ANOVA Audit on dSprites. The F-statistics reveal high sensitivity to shape and orientation before correction, which is mitigated after targeted synthetic data intervention.

Accuracy Comparison

Condition	Accuracy
No Synthetic Data (Baseline)	47.4%
Random Synthetic Data	53.8%
Domain Randomization	53.5%
SDRS (Targeted)	79.0%

Experiment 2: Dense Segmentation

In a procedural scene segmentation task, SDRS detected background-complexity shortcuts that limited model generalization.

Figure 2: Segmentation Audit. The audit identifies background complexity as a major factor influencing model performance (Type II gap).

mIoU Performance

Method	mIoU
Baseline	0.332
Random Sampling	0.976
SDRS (Targeted)	0.998

Experiment 3: Entanglement Detection

SDRS can also be used to audit the generator itself, identifying cross-factor contamination in imperfect synthetic pipelines.

Figure 3: Entanglement Audit. The ANOVA decomposition identifies "leaked" factors where the generator fails to maintain independent control over scene parameters.

Conclusion

SDRS transforms synthetic data generation from a "hit-or-miss" random process into a principled diagnostic tool. By applying Design of Experiments to vision models, we can systematically identify and fix representational failures, leading to more robust and reliable AI systems.