Evaluation Pipeline
The evaluation pipeline lets you compare the segmentation quality of one or more trained models side-by-side on a shared test dataset. It runs predictions for each experiment, computes pixel-level metrics from confusion matrices, aggregates results across images, and optionally generates comparison visualizations.
When to Use It
Use the evaluation pipeline when you want to:
- Compare multiple model architectures or training runs on the same test set.
- Compute IoU, F1, precision, and recall across all test images.
- Generate per-image metric CSVs and aggregated JSON reports.
- Produce confusion matrix plots for a qualitative view of class-level errors.
Running the Pipeline
python pytorch_segmentation_models_trainer/evaluate_experiments.py \
--config-dir ./configs/evaluation \
--config-name pipeline_config
Override parameters without editing YAML:
python evaluate_experiments.py \
experiments[0].checkpoint_path=/new/model.ckpt \
pipeline_options.parallel_inference.enabled=false
Build the evaluation CSV automatically from image and mask folders:
python evaluate_experiments.py \
evaluation_dataset.build_csv_from_folders.enabled=true \
evaluation_dataset.build_csv_from_folders.images_folder=/data/images \
evaluation_dataset.build_csv_from_folders.masks_folder=/data/masks
Pipeline Execution Steps
The EvaluationPipeline runs the following steps in order:
- Prepare dataset — build or load the ground-truth CSV.
- Run predictions — call
predict.pyas a subprocess for each experiment (sequential or parallel across GPUs). - Calculate metrics — compute confusion matrices and derive per-image and aggregated metrics.
- Aggregate results — combine per-experiment results into a unified structure.
- Generate visualizations — confusion matrix plots and comparison charts (if enabled).
- Save summary report — write a JSON summary to
output.base_dir.
Config Structure
Dataset Configuration
Three modes are supported for specifying the evaluation dataset:
Mode 1 — Existing CSV:
evaluation_dataset:
input_csv_path: /data/test_dataset.csv # CSV with 'image' and 'mask' columns
build_csv_from_folders:
enabled: false
Mode 2 — Build CSV from image/mask folders:
evaluation_dataset:
build_csv_from_folders:
enabled: true
images_folder: /data/test/images/
masks_folder: /data/test/masks/
Mode 3 — Direct folder evaluation:
evaluation_dataset:
direct_folder_evaluation:
enabled: true
ground_truth_folder: /data/gt_masks/
predictions_folder: /data/predictions/ # optional pre-computed predictions
gt_pattern: "*.tif"
pred_pattern: "*.tif"
force_rebuild: false
Experiments List
Each entry in experiments defines one model run. The pipeline calls predict.py for each, saving outputs to output_folder.
experiments:
- name: unet_resnet34
checkpoint_path: /checkpoints/unet_resnet34_best.ckpt
predict_config: /configs/predict_unet.yaml
output_folder: /output/eval/unet_resnet34/predictions/
overrides: # optional Hydra overrides for this experiment
inference_threshold: 0.45
- name: deeplab_resnet50
checkpoint_path: /checkpoints/deeplab_resnet50_best.ckpt
predict_config: /configs/predict_deeplab.yaml
output_folder: /output/eval/deeplab_resnet50/predictions/
The predict_config path must point to a valid predict config YAML (see the Running Inference guide). The pipeline invokes it as a subprocess, passing checkpoint_path and device overrides automatically.
Metrics Configuration
metrics:
num_classes: 2
class_names:
- background
- building
segmentation_metrics: [] # reserved for future torchmetrics overrides
All metrics are computed from accumulated confusion matrices. The following are reported for each experiment at both the per-image and aggregated levels:
| Metric | Key | Description |
|---|---|---|
| Accuracy | Accuracy | Macro-averaged per-class accuracy |
| IoU / Jaccard | JaccardIndex | Macro-averaged intersection over union |
| F1 Score | F1Score | Macro-averaged Dice / F1 coefficient |
| Precision | Precision | Macro-averaged precision |
| Recall | Recall | Macro-averaged recall |
Per-class metrics are also reported as IoU_{class_name}, F1_{class_name}, Precision_{class_name}, Recall_{class_name}, and Accuracy_{class_name}.
Pipeline Options
pipeline_options:
skip_existing_predictions: true # skip experiments whose output_folder already has .tif files
skip_existing_evaluations: false # skip metric calculation if results already exist
parallel_inference:
enabled: false # set true to run experiments in parallel across GPUs
strategy: one_experiment_per_gpu # or "sequential"
gpus: null # null = auto-detect; or [0, 1, 2] for specific GPUs
parallel_image_processing:
enabled: true # process images within each experiment in parallel
num_workers: null # null = os.cpu_count()
Output Configuration
output:
base_dir: /output/evaluation/
timestamp_folders: false
structure:
experiments_folder: experiments
comparison_folder: comparisons
files:
summary_report: summary_report.json
per_image_metrics_pattern: "{experiment_name}_per_image_metrics.csv"
confusion_matrix_data_pattern: "{experiment_name}_confusion_matrix.npy"
Visualization Configuration
visualization:
comparison_plots:
enabled: true
confusion_matrix:
save_individual: true # one plot per experiment
save_comparison: true # side-by-side grid of all experiments
GPUDistributor: Parallel Inference
When pipeline_options.parallel_inference.enabled: true, the GPUDistributor class allocates experiments to available GPUs.
Strategies
| Strategy | Behavior |
|---|---|
one_experiment_per_gpu | Round-robin assignment: experiments are distributed cyclically across available GPUs |
sequential | All experiments run on the first available GPU or CPU |
The GPUDistributor auto-detects GPUs using PyTorch. Any GPU with less than 1 GB of free memory is excluded. You can pin specific GPUs with pipeline_options.parallel_inference.gpus: [0, 2].
pipeline_options:
parallel_inference:
enabled: true
strategy: one_experiment_per_gpu
gpus: [0, 1] # use only GPU 0 and GPU 1
Each parallel experiment loads a separate copy of the model into its assigned GPU. Ensure each GPU has sufficient memory for the model plus the inference batch before enabling parallel mode.
Pre-computed Predictions
If you already have prediction rasters and only want to run metric calculation, use load_predictions_from_folder:
pipeline_options:
load_predictions_from_folder:
enabled: true
base_folder: /output/predictions/ # each experiment gets a subfolder named by experiment.name
Or point each experiment to its own folder:
experiments:
- name: unet_resnet34
checkpoint_path: /checkpoints/unet_resnet34_best.ckpt
predict_config: /configs/predict_unet.yaml
output_folder: /output/eval/unet_resnet34/predictions/
precomputed_predictions_folder: /precomputed/unet_resnet34/
Prediction-to-Ground-Truth Matching
The MetricsCalculator matches prediction files to ground-truth masks by filename. Matching uses these strategies in priority order:
- Exact stem match —
image001.tifmatchesimage001.tif. - Strip common prefixes — strips
mask_,gt_from the ground-truth stem before matching. - Substring match — if the cleaned GT stem is a substring of any prediction stem (or vice versa).
Ground-truth rasters and prediction rasters are spatially aligned using rasterio bounding-box intersection before metric calculation. This handles cases where prediction and GT tiles have slightly different extents.
Output Files
After a successful run, the output directory contains:
/output/evaluation/
├── summary_report.json # full pipeline summary
├── experiments/
│ ├── unet_resnet34/
│ │ └── metrics/
│ │ ├── unet_resnet34_per_image_metrics.csv
│ │ ├── aggregated_metrics.json
│ │ └── unet_resnet34_confusion_matrix.npy
│ └── deeplab_resnet50/
│ └── metrics/
│ ├── deeplab_resnet50_per_image_metrics.csv
│ ├── aggregated_metrics.json
│ └── deeplab_resnet50_confusion_matrix.npy
└── comparisons/
├── confusion_matrices/
│ ├── unet_resnet34_confusion_matrix.png
│ └── deeplab_resnet50_confusion_matrix.png
└── confusion_matrix_comparison.png
Full Example Config
# ── Dataset ───────────────────────────────────────────────────────────────────
evaluation_dataset:
input_csv_path: /data/test_dataset.csv
build_csv_from_folders:
enabled: false
# ── Experiments ───────────────────────────────────────────────────────────────
experiments:
- name: unet_resnet34
checkpoint_path: /checkpoints/unet_resnet34_best.ckpt
predict_config: /configs/predict_unet.yaml
output_folder: /output/eval/unet_resnet34/predictions/
overrides:
inference_threshold: 0.5
- name: unet_resnet50
checkpoint_path: /checkpoints/unet_resnet50_best.ckpt
predict_config: /configs/predict_unet.yaml
output_folder: /output/eval/unet_resnet50/predictions/
overrides:
inference_threshold: 0.5
# ── Metrics ───────────────────────────────────────────────────────────────────
metrics:
num_classes: 2
class_names:
- background
- building
segmentation_metrics: []
# ── Pipeline Options ──────────────────────────────────────────────────────────
pipeline_options:
skip_existing_predictions: true
skip_existing_evaluations: false
parallel_inference:
enabled: false
strategy: one_experiment_per_gpu
gpus: null
parallel_image_processing:
enabled: true
num_workers: 8
# ── Output ────────────────────────────────────────────────────────────────────
output:
base_dir: /output/evaluation/
timestamp_folders: false
structure:
experiments_folder: experiments
comparison_folder: comparisons
files:
summary_report: summary_report.json
per_image_metrics_pattern: "{experiment_name}_per_image_metrics.csv"
confusion_matrix_data_pattern: "{experiment_name}_confusion_matrix.npy"
# ── Visualization ─────────────────────────────────────────────────────────────
visualization:
comparison_plots:
enabled: true
confusion_matrix:
save_individual: true
save_comparison: true
# ── Logging ───────────────────────────────────────────────────────────────────
logging:
level: INFO
save_to_file: true
log_file: evaluation.log