Windowed Image Datasets
These datasets are designed to extract patches from full-size rasters using a deterministic sliding-window (grid) approach. Unlike random-crop datasets, they allow you to process the entire area of your images in a fixed grid, which is particularly useful for validation, testing, and consistent performance monitoring.
WindowedImageDataset
WindowedImageDataset is a basic dataset that yields only the image patch and its path. It is useful for tasks like feature extraction, unsupervised learning, or simply extracting patches without requiring any target labels.
Configuration
val_dataset:
_target_: pytorch_segmentation_models_trainer.dataset_loader.image_dataset.WindowedImageDataset
image_dir: /path/to/images
crop_size: [256, 256]
stride: 256 # Non-overlapping grid
image_dtype: "uint8"
WindowedImageAutoencoderDataset
WindowedImageAutoencoderDataset is specialized for Autoencoder tasks. It returns both image (input) and target (reconstruction label), where both are initially identical crops from the raster. This class supports optional corruption augmentations that apply only to the image key, allowing you to train or validate Denoising Autoencoders on a fixed grid.
Features
- Deterministic Grid: Guarantees that the same patches are seen in every epoch.
- Efficient Reading: Uses
rasteriowindowed reads to load only the required patch from disk. - Optional Window Verification: Can validate every candidate patch during initialisation and index only readable windows.
- Window Index Cache: Stores the verified window index in JSON so later runs skip the validation pass when inputs and config are unchanged.
- Serialized Raster Reads: Can serialize window reads per source GeoTIFF when multiple DataLoader workers read the same compressed raster on shared storage.
- Corruption Support: Apply noise, blur, or other corruptions only to the input image.
- Synchronized Augmentations: Standard augmentations (like normalization or flips) are applied identically to both input and target.
Configuration
test_dataset:
_target_: pytorch_segmentation_models_trainer.dataset_loader.image_dataset.WindowedImageAutoencoderDataset
image_dir: /path/to/test_images
crop_size: [256, 256]
stride: 256
verify_windows: true
window_index_cache: /path/to/cache/test_window_index.json
serialize_rasterio_reads: true
rasterio_lock_dir: /tmp/psmt_rasterio_locks
reopen_rasterio_on_read: true
corruption_augmentation_list:
- _target_: albumentations.GaussNoise
p: 1.0
augmentation_list:
- _target_: albumentations.Normalize
- _target_: albumentations.pytorch.ToTensorV2
Key Parameters
image_dir: Root folder scanned recursively for rasters.crop_size: The size of the extracted patches as[height, width].stride: The distance between consecutive patches. If equal tocrop_size, it produces a non-overlapping grid. If smaller, patches will overlap.image_extensions: (Optional) List of extensions to include (e.g.,[.tif, .png]).image_dtype: Output data type (default isuint8). Usenativeto keep the raster's original type.selected_bands: (Optional) 1-based list of raster bands to read.verify_windows: (Optional, defaultfalse) When enabled, each candidate window is read once during dataset initialisation. Windows that fail to read or return an unexpected shape are excluded fromlen(dataset)and global indexing.window_index_cache: (Optional) JSON path used withverify_windows. The dataset saves the verified window list plus crop, stride, band selection, dtype, image paths, file sizes, and modification timestamps. If any metadata changes, the cache is rebuilt automatically.serialize_rasterio_reads: (Optional, defaultfalse) When enabled, eachrasteriowindow read is protected by a per-file interprocess lock. Use this whennum_workers > 0causes decoder errors while several workers read the same large GeoTIFF.rasterio_lock_dir: (Optional) Directory for lock files. Defaults to/tmp/psmt_rasterio_locks.reopen_rasterio_on_read: (Optional, defaultfalse) Opens and closes the raster inside each locked read instead of reusing the per-worker rasterio handle cache. This is slower, but avoids persistent GDAL state when the lock alone is not enough.
Verifying and Caching Valid Windows
Use verify_windows when some rasters contain corrupt blocks, missing overviews, or other local read problems that should not appear during training or validation. The first run is slower because every candidate patch is read once. Set window_index_cache to persist the verified index:
val_dataset:
_target_: pytorch_segmentation_models_trainer.dataset_loader.image_dataset.WindowedImageAutoencoderDataset
image_dir: /data/validation/images
crop_size: [224, 224]
stride: 224
verify_windows: true
window_index_cache: /data/validation/cache/window_index.json
serialize_rasterio_reads: true
rasterio_lock_dir: /tmp/psmt_rasterio_locks
reopen_rasterio_on_read: true
The cache stores only window coordinates, not image pixels. Training still reads pixels on demand with the normal rasterio LRU file-handle cache.
Iterable Worker Sharding
Use IterableWindowedImageDataset or IterableWindowedImageAutoencoderDataset
when a DataLoader with multiple workers should avoid concurrent reads from the
same source raster. These variants assign whole images to workers:
val_dataset:
_target_: pytorch_segmentation_models_trainer.dataset_loader.image_dataset.IterableWindowedImageAutoencoderDataset
image_dir: /data/validation/images
crop_size: [224, 224]
stride: 224
verify_windows: true
window_index_cache: /data/validation/cache/window_index.json
data_loader:
shuffle: false
num_workers: 4
persistent_workers: false
prefetch_factor: 1
shuffle must remain false for iterable datasets. If validation contains many
large rasters, this keeps parallelism across files without requiring per-window
locks. If validation contains only one raster, only one worker can own that file.