GenMed: A Pairwise Generative Reformulation
of Medical Diagnostic Tasks

One diffusion model of the joint distribution P(X, Y), steered at test time — turning rigid input→output prediction into flexible, training-free output optimization.

Equal contribution  ·  Corresponding author  ·  Fellow, IEEE

1CVLab, EPFL  ·  2Fudan University  ·  3Beihang University  ·  4ELLIS Institute Finland & Aalto University
5The Hong Kong Polytechnic University

A single generative model for diverse medical tasks: 3D image segmentation, degraded-input segmentation, and 3D shape completion.
A single generative model for diverse medical tasks. Rather than learning a separate discriminative map P(Y|X) per task, GenMed models the joint distribution P(X, Y) once and enforces input consistency through constraint-guided sampling at inference — covering standard, zero-shot cross-modality, few-shot and degraded-input segmentation, as well as shape completion from single-, multi-, tri-plane, partial and noisy cues.

Summary

A joint generator for flexible medical inference

GenMed replaces task-specific medical predictors with a generative model of paired variables P(X, Y). At inference time, the observed input acts as a constraint that steers sampling toward a consistent pair, so the same model can handle standard segmentation, cross-modality transfer, degraded inputs, few-shot settings, and 3D shape completion.

85.4Avg. Dice on standard 3D segmentation
+24.6Dice gain on zero-shot CT→MRI
77.7Dice with only 2 CT training samples
68.7kText–shape samples across 139 categories
Formal abstract

Data-driven medical AI is traditionally framed as a discriminative mapping from an input X to an output Y via a learned function f — a formulation that generalizes poorly across the heterogeneous data and modalities of real clinical settings. We propose a fundamentally different, generative paradigm: we model the joint distribution P(X, Y) with a diffusion model and reframe inference as a test-time output optimization problem.

By guiding the generative process to match observed inputs, GenMed enables flexible, gradient-based conditioning at inference time — without architectural changes or retraining — and naturally supports arbitrary, previously unseen combinations of observations. Extensive experiments demonstrate strong performance on standard and cross-modality segmentation, few-shot segmentation with only 2 or 4 training samples, degraded-input segmentation, shape completion from sparse and partial observations, and zero-shot transfer to a new domain.

To support these evaluations, we curated and released a large-scale text–shape dataset derived from MedShapeNet. Our results highlight the versatility of generative joint modeling as a foundation for reusable, task-agnostic medical AI systems.

Method

Pairwise diffusion, guided at test time

GenMed trains an unconditional diffusion model over data pairs (X, Y). At inference, a known observation X guides sampling toward a consistent pair (X′, Y′) with X′ ≈ X — so the model is trained once and conditioned on any signal afterwards.

GenMed architecture: a diffusion model over the joint latent space, with inference-time guidance from the observed part.
GenMed architecture. Training learns to denoise samples from P(X, Y) in either explicit or latent space. At inference, the known condition X explicitly or implicitly guides the sampling trajectory so the generated output satisfies the constraint X0 ≈ X.
P(X, Y)

Joint, not conditional

Modeling the joint distribution removes the need for a dedicated input-conditioning encoder, so a single model serves many tasks and any combination of observations.

∇ ℒ

Guidance, not retraining

A soft loss steers both paired variables together (GenMed-Full), avoiding the semantic gap of naive guidance and adapting to new inputs purely at test time.

Explicit & latent space

Works directly on voxels for segmentation, and on encoded latents for 3D shapes — deferring decoding to the end of the denoising trajectory for stable shape completion.

Segmentation and completion can be chained

The same pairwise paradigm supports an end-to-end diagnostic pipeline: segment the raw scan, then complete the shape with the pretrained prior to recover geometrically plausible anatomy under corrupted imaging.

Two-stage diagnostic pipeline under different degradations: segmentation, completion, and ground-truth target.
Two stages under different degradations. The first stage produces a segmentation from incomplete imaging evidence, and the completion stage restores missing structure with the learned shape prior. The gain is largest when the observation is most degraded: +13.9 Dice for 4-slice KiTS23 inputs and +11.2 Dice for low-resolution inputs.

Mask-Prompt Guidance

Different observations, one model

GenMed treats a visual prompt as a constraint on the generated pair rather than as a fixed input channel. This lets the same pretrained prior respond to sparse slices, intersecting planes, missing regions, and noisy partial evidence.

One-plane

A single cross-section anchors the anatomy while the prior fills the unobserved volume.

Tri-plane

Three orthogonal slices provide stronger spatial constraints without changing the model.

Multi-plane

Multiple sparse sections guide denser completion while still leaving shape inference to the prior.

Broken

Missing or corrupted regions are treated as partial observations to be reconciled during sampling.

Comparison of ground truth, partial prompt, input-conditioning baseline, and GenMed guided result across four prompt types.
Guidance vs. input conditioning. Across broken, one-plane, tri-plane and multi-plane prompts, GenMed uses the observed region as a test-time constraint and recovers shapes that better follow the ground-truth boundary than the input-conditioning baseline.

Results · Segmentation

Robust where discriminative models break

A model trained on full 3D volumes transfers — with no retraining — to a different modality, to as few as 2–4 training samples, and to severely degraded inputs (low resolution or missing slices).

Average Dice (↑) across five cardiac structures. Baselines vs. GenMed under increasingly hard settings. — = not reported.
Method Standard (TS) CT → CT Zero-shot CT→MRI 2-shot CT
nnU-Net83.762.035.347.0
SwinUNETR81.459.832.3
Input-Cond. Diffusion81.755.031.941.2
GenMed-Full (Ours)85.483.859.977.7

Datasets: TotalSegmentator (TS) and MM-WHS. The gap widens dramatically under distribution shift — cross-modality and few-shot — where conditional models lose most of their accuracy.

Segmentation under degraded inputs: low resolution, missing frames, and few slices.
Segmentation under degraded inputs. Across low-resolution volumes, missing middle slices, and only a handful of slices, GenMed-Full is far less affected than nnU-Net, SwinUNETR and a diffusion-atlas baseline.

Results · Shape Completion

One shape prior, many kinds of partial evidence

A single model trained only on complete shapes completes anatomy from diverse, previously unseen observations — single-, multi-, and tri-plane cross-sections, broken regions, and stochastically sampled corruptions.

Shape completion under different visual prompts compared to input conditioning.
Completion under different visual prompts. Compared with input conditioning, GenMed produces completions that align more closely with the blue ground-truth boundaries across organs and tissues of increasing structural complexity — eyeball, urinary bladder, heart and bone.
Text-only shape generation on MedShapeNet.
MethodMMD ↓COV ↑1-NNA ↓
SDFusion2.0752.3370.44
Diffusion-SDF3.1941.0583.49
OctFusion8.7524.1487.56
GenMed-Base (Ours)1.1652.8066.24

A medical-tailored SDF backbone yields higher-fidelity shapes and a generated distribution closer to the real one.

Beyond generation, GenMed turns the same prior into a completion engine. On the hardest multi-plane prompts it reaches 75.9 Dice and the lowest worst-case error (UHD 10.2), surpassing input-conditioning baselines.

A notable finding: for these dense, voxel-level geometric prompts, classifier-free guidance variants underperform direct conditioning — yet GenMed's pairwise formulation adapts naturally, recovering finer boundary detail even when only a single slice is available.

Zero-shot shape completion on a 3D eyeball dataset under different defect prompts.
Zero-shot transfer to a new domain. Applied without any fine-tuning to an unseen 3D eyeball dataset, GenMed still aligns more closely with the ground truth than input conditioning across broken, one-plane, tri-plane and multi-plane prompts.

Interactive

3D comparison, side by side

Rotate any mesh — all columns stay camera-synced. The updated cases include smoothed broken-organ prompts for pancreas, pulmonary artery, and adrenal gland, alongside gallbladder, eyeball, and kidney examples. We compare the ground-truth shape, the input conditioning baseline, and GenMed (Ours), reporting Dice ↑, CD ↓ and UHD ↓ vs. ground truth on each method. Plane prompts render the observed cutting planes through a faint full-shape ghost.

Observed — given to the model (cutting plane / region) Unobserved — must be inferred Redundant / extra — spurious in the prompt

Drag to rotate · scroll to zoom · plane prompts show the observed cutting planes through a faint ghost of the full shape; broken prompts show the observed, redundant and unobserved regions. CD/UHD use the paper's metric (×100), Dice on the SDF occupancy.

Citation

BibTeX

@article{zhang2026genmed,
  title   = {GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks},
  author  = {Zhang, Hantao and Guo, Weidong and Liu, Yuhe and Yang, Jiancheng and
             Bhagavan, Sathvik and Xu, Mingda and Shi, Danli and Fua, Pascal},
  journal = {arXiv preprint arXiv:2605.10645},
  year    = {2026}
}