Skip to main content

Dataset Builder Tools

The dataset builder tools help you prepare segmentation datasets from raw rasters and vector annotations. They are accessible via the pytorch-smt-tools CLI.

Overview

ToolCLI commandDescription
Band Combinercombine-bandsMerge bands from N matching rasters into a single multi-band GeoTIFF
Tile Dataset Builderbuild-tile-datasetTile rasters and rasterize vector polygons as masks
Sliding Window Builderbuild-sliding-window-datasetCrop existing image/mask pairs into fixed-size patches
MBTiles Multiclass Mask Builderbuild-mbtiles-multiclass-masksRasterize class polygons onto frame extents using an MBTiles reference grid

Band Combiner (combine-bands)

Finds files with matching names across multiple source directories and combines their bands into a single GeoTIFF per group.

CLI usage

pytorch-smt-tools combine-bands \
--source-dir /data/rgb \
--source-dir /data/nir \
--output-dir /data/combined \
--glob "**/*.tif" \
--workers 4

Options

OptionDefaultDescription
--source-dir(required)Source directory. Repeat for each directory.
--output-dir(required)Output directory for combined TIFFs.
--glob**/*.vrtGlob pattern inside each source directory.
--name-patternNonePattern to extract group key, e.g. MI_{name}.tif.
--skip-alpha / --no-skip-alpha--skip-alphaSkip the last band when a source has >3 bands.
--overwriteFalseOverwrite existing output files.
--workers4Number of worker threads.

Python API

from pathlib import Path
from pytorch_segmentation_models_trainer.tools.dataset_builder.band_combiner import (
find_file_groups,
combine_sources_to_tiff,
combine_all,
)

# Find groups common to all directories
groups = find_file_groups(
source_dirs=[Path("rgb/"), Path("nir/")],
glob_pattern="**/*.tif",
)

# Combine all groups
results = combine_all(
source_dirs=[Path("rgb/"), Path("nir/")],
output_dir=Path("combined/"),
glob_pattern="**/*.tif",
)

Tile Dataset Builder (build-tile-dataset)

Generates fixed-size tile patches from raster images and rasterizes vector polygon annotations as segmentation masks.

CLI usage

pytorch-smt-tools build-tile-dataset conf/examples/build_tile_dataset.yaml

Config YAML

# conf/examples/build_tile_dataset.yaml
image_paths:
- /path/to/image1.tif
- /path/to/image2.tif
vector_path: /path/to/masks.gpkg
vector_layer: cobertura_terrestre
class_attribute: tipo
output_dir: /path/to/output/dataset
tile_width: 256
tile_height: 256
overlap_x_percent: 20
overlap_y_percent: 20
min_valid_pixel_ratio: 0.1
skip_empty_tiles: true
generate_full_size_masks: false
max_workers: 8
background_value: 255

Use a background_value that does not collide with any valid class ID. The default 255 is a good fit for masks stored as uint8 when classes start at 0 or 1.

Output structure

output_dir/
{image_stem}/
images/ ← tile GeoTIFFs
masks/ ← uint8 class-index masks
mask_full.tif ← only when generate_full_size_masks=true
dataset.csv ← columns: image_path, label_path, rows, columns

Python API

from pathlib import Path
from pytorch_segmentation_models_trainer.tools.dataset_builder.tile_dataset_builder import (
build_tile_dataset,
)

df = build_tile_dataset(
image_paths=[Path("image1.tif"), Path("image2.tif")],
vector_path=Path("masks.gpkg"),
class_attribute="tipo",
output_dir=Path("dataset/"),
tile_width=256,
tile_height=256,
background_value=255,
)
print(f"{len(df)} tiles saved")

Sliding Window Builder (build-sliding-window-dataset)

Crops existing image/mask pairs from a CSV into overlapping sliding-window patches.

CLI usage

pytorch-smt-tools build-sliding-window-dataset dataset.csv \
--output-dir /data/patches \
--window-size 256 \
--overlap 0.25 \
--remap "6:4,7:5" \
--workers 8

Options

OptionDefaultDescription
INPUT_CSV(required)CSV with image and mask columns.
--output-dir(required)Root output directory.
--window-size256Patch size in pixels.
--overlap0.0Overlap fraction in [0, 1).
--remapNoneClass remapping, e.g. 7:5,6:4.
--blacklistNoneComma-separated directory segments to skip.
--workers8Number of worker threads.

Config YAML example

# conf/examples/build_sliding_window_dataset.yaml
input_csv: /path/to/dataset.csv
output_dir: /path/to/output
window_size: 256
overlap: 0.0
class_remap:
6: 4
7: 5
blacklist:
- masks_old
- backup
n_workers: 8

Python API

from pathlib import Path
from pytorch_segmentation_models_trainer.tools.dataset_builder.sliding_window_builder import (
build_sliding_window_dataset,
)

df = build_sliding_window_dataset(
input_csv=Path("dataset.csv"),
output_dir=Path("patches/"),
window_size=256,
overlap=0.25,
class_remap={6: 4, 7: 5},
)
print(f"{len(df)} patches saved")

Tutorial notebook

See Notebook Tutorials for an end-to-end Potsdam walkthrough that starts from image/mask folders, generates a CSV, and creates a small windowed training set.