Building Extraction with Frame Field Models
A complete example showing how to train a Frame Field segmentation model for precise building footprint extraction from aerial imagery, with post-processing to produce clean polygon vectors.
Use Case
Standard segmentation models produce raster masks that require post-processing (e.g., contour extraction) to become polygon vectors. Frame Field models simultaneously predict:
- Interior mask — which pixels belong to buildings
- Boundary mask — the outlines of buildings
- Crossfield — a tangent frame field that encodes the local orientation of building edges
The crossfield guides polygonization algorithms to produce axis-aligned, geometrically clean building footprints — significantly better than simple contour tracing.
The Frame Field approach is based on Polygonization by Frame Field Learning. The FrameFieldSegmentationPLModel and FrameFieldSegmentationDataset in this library implement a compatible training pipeline.
Project Structure
frame_field_project/
├── data/
│ ├── train/
│ │ ├── images/ # RGB aerial imagery
│ │ ├── polygon_mask/ # Interior (building footprint) binary masks
│ │ ├── boundary_mask/ # Boundary (edge) binary masks
│ │ ├── vertex_mask/ # Vertex (corner) binary masks
│ │ ├── crossfield_mask/ # 2-channel crossfield angle images
│ │ ├── distance_mask/ # Distance transform masks
│ │ └── size_mask/ # Building size masks
│ └── val/
│ └── ... # Same structure as train
├── configs/
│ └── train.yaml
├── train.csv
├── val.csv
└── outputs/
Step 1: Generate Multi-Channel Masks
Frame Field training requires several mask types generated from polygon annotations (typically GeoJSON or shapefile). Use the project's built-in mask building tools with --mode build_mask.
pytorch-smt-mask-builder \
--annotations data/annotations/buildings_train.geojson \
--images-dir data/train/images \
--output-dir data/train \
--mode build_mask
This produces the following files per image tile:
polygon_mask/— binary mask: 1 = building interior, 0 = backgroundboundary_mask/— binary mask: 1 = building boundary pixelsvertex_mask/— binary mask: 1 = corner/vertex pixelscrossfield_mask/— 2-band float32 image encoding tangent frame anglesdistance_mask/— distance transform from each interior pixel to the nearest edgesize_mask/— per-building normalized size map
The Mask Building user guide explains the mask generation process in detail, including how to work with GeoJSON, shapefiles, and tiled imagery.
Step 2: Create CSV Files
The CSV must include columns for every mask type loaded by FrameFieldSegmentationDataset.
Create train.csv:
image,polygon_mask,boundary_mask,vertex_mask,crossfield_mask,distance_mask,size_mask
data/train/images/tile_001.png,data/train/polygon_mask/tile_001.png,data/train/boundary_mask/tile_001.png,data/train/vertex_mask/tile_001.png,data/train/crossfield_mask/tile_001.tif,data/train/distance_mask/tile_001.tif,data/train/size_mask/tile_001.tif
data/train/images/tile_002.png,data/train/polygon_mask/tile_002.png,data/train/boundary_mask/tile_002.png,data/train/vertex_mask/tile_002.png,data/train/crossfield_mask/tile_002.tif,data/train/distance_mask/tile_002.tif,data/train/size_mask/tile_002.tif
import pandas as pd
from pathlib import Path
def create_frame_field_csv(base_dir, output_csv):
images = sorted(Path(base_dir, "images").glob("*.png"))
rows = []
for img in images:
stem = img.stem
row = {
"image": str(Path(base_dir, "images", f"{stem}.png")),
"polygon_mask": str(Path(base_dir, "polygon_mask", f"{stem}.png")),
"boundary_mask": str(Path(base_dir, "boundary_mask", f"{stem}.png")),
"vertex_mask": str(Path(base_dir, "vertex_mask", f"{stem}.png")),
"crossfield_mask": str(Path(base_dir, "crossfield_mask", f"{stem}.tif")),
"distance_mask": str(Path(base_dir, "distance_mask", f"{stem}.tif")),
"size_mask": str(Path(base_dir, "size_mask", f"{stem}.tif")),
}
# Only include row if all mask files exist
if all(Path(v).exists() for v in row.values()):
rows.append(row)
df = pd.DataFrame(rows)
df.to_csv(output_csv, index=False)
print(f"Created {output_csv} with {len(df)} samples")
create_frame_field_csv("data/train", "train.csv")
create_frame_field_csv("data/val", "val.csv")
Step 3: Training Configuration
Create configs/train.yaml:
# --- Backbone segmentation network ---
# The backbone produces the feature maps used by the frame field head.
backbone:
_target_: segmentation_models_pytorch.Unet
encoder_name: resnet34
encoder_weights: imagenet
in_channels: 3
classes: 3 # Interior + Boundary + Vertex channels
activation: sigmoid
# --- Frame Field model flags ---
compute_seg: true # Predict interior/boundary/vertex segmentation
compute_crossfield: true # Predict the tangent frame field
# --- Segmentation output flags ---
seg_params:
compute_interior: true
compute_edge: true
compute_vertex: true
# --- Training Dataset ---
train_dataset:
_target_: pytorch_segmentation_models_trainer.dataset_loader.dataset.FrameFieldSegmentationDataset
input_csv_path: train.csv
image_width: 224
image_height: 224
# Mask column keys (must match CSV column names)
mask_key: polygon_mask
boundary_mask_key: boundary_mask
vertex_mask_key: vertex_mask
crossfield_mask_key: crossfield_mask
distance_mask_key: distance_mask
size_mask_key: size_mask
# Which mask channels to load
return_boundary_mask: true
return_vertex_mask: true
return_crossfield_mask: true
return_distance_mask: true
return_size_mask: true
data_loader:
shuffle: true
num_workers: 4
pin_memory: true
drop_last: true
prefetch_factor: 2
augmentation_list:
- _target_: albumentations.RandomRotate90
p: 0.5
- _target_: albumentations.HorizontalFlip
p: 0.5
- _target_: albumentations.VerticalFlip
p: 0.5
- _target_: albumentations.RandomBrightnessContrast
brightness_limit: 0.2
contrast_limit: 0.2
p: 0.4
- _target_: albumentations.RandomCrop
height: 224
width: 224
always_apply: true
- _target_: albumentations.Normalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
p: 1.0
- _target_: albumentations.pytorch.transforms.ToTensorV2
always_apply: true
# --- Validation Dataset ---
val_dataset:
_target_: pytorch_segmentation_models_trainer.dataset_loader.dataset.FrameFieldSegmentationDataset
input_csv_path: val.csv
image_width: 224
image_height: 224
mask_key: polygon_mask
boundary_mask_key: boundary_mask
vertex_mask_key: vertex_mask
crossfield_mask_key: crossfield_mask
distance_mask_key: distance_mask
size_mask_key: size_mask
return_boundary_mask: true
return_vertex_mask: true
return_crossfield_mask: true
return_distance_mask: true
return_size_mask: true
data_loader:
shuffle: false
num_workers: 4
pin_memory: true
drop_last: false
prefetch_factor: 2
augmentation_list:
- _target_: albumentations.Resize
height: 224
width: 224
always_apply: true
- _target_: albumentations.Normalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
p: 1.0
- _target_: albumentations.pytorch.transforms.ToTensorV2
always_apply: true
# --- Compound Loss Configuration ---
# Frame Field training uses multiple loss terms that are normalized and summed.
# The compound_loss system handles all component losses automatically.
loss_params:
compound_loss:
normalize_losses: true
losses:
# Segmentation losses (applied to interior, boundary, and vertex channels)
- name: seg_interior_bce
_target_: torch.nn.BCEWithLogitsLoss
weight: 1.0
- name: seg_boundary_bce
_target_: torch.nn.BCEWithLogitsLoss
weight: 1.0
- name: seg_crossfield_align
# Coupling loss: aligns the crossfield tangents to building edges
_target_: pytorch_segmentation_models_trainer.custom_losses.frame_field_losses.CrossfieldAlignLoss
weight: 1.0
- name: seg_smoothness
# Crossfield smoothness regularization
_target_: pytorch_segmentation_models_trainer.custom_losses.frame_field_losses.CrossfieldSmoothLoss
weight: 0.1
# --- Optimizer ---
optimizer:
_target_: torch.optim.AdamW
lr: 0.001
weight_decay: 1.0e-4
eps: 1.0e-8
# --- Learning Rate Scheduler ---
scheduler_list:
- scheduler:
_target_: torch.optim.lr_scheduler.CosineAnnealingLR
T_max: 100
eta_min: 1.0e-7
interval: epoch
frequency: 1
name: cosine_lr
# --- Hyperparameters ---
hyperparameters:
batch_size: 8
epochs: 100
# --- PyTorch Lightning Trainer ---
pl_trainer:
max_epochs: ${hyperparameters.epochs}
accelerator: gpu
devices: 1
precision: 16-mixed
gradient_clip_val: 1.0
gradient_clip_algorithm: norm
check_val_every_n_epoch: 1
log_every_n_steps: 20
# --- Callbacks ---
callbacks:
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
monitor: loss/val
mode: min
save_top_k: 3
save_last: true
filename: "best-{epoch:02d}-{loss/val:.4f}"
auto_insert_metric_name: false
- _target_: pytorch_lightning.callbacks.EarlyStopping
monitor: loss/val
mode: min
patience: 20
min_delta: 0.0005
- _target_: pytorch_lightning.callbacks.LearningRateMonitor
logging_interval: epoch
# --- Logger ---
logger:
_target_: pytorch_lightning.loggers.TensorBoardLogger
save_dir: ./logs
name: frame_field_seg
mode: train
device: cuda
# Device passed to ComputeSegGrads preprocessor inside FrameFieldSegmentationPLModel
device: cuda
normalize_losses: true enables automatic normalization of each loss component so that no single term dominates training due to scale differences. Loss normalisation values are computed at the start of training from a subset of training batches.
Step 4: Run Training
cd frame_field_project
pytorch-smt --config-dir ./configs --config-name train
The trainer logs individual component losses (e.g., losses/train_seg_interior_bce, losses/train_seg_crossfield_align) in addition to the total loss, making it easy to diagnose training issues per loss term.
Step 5: Inference with Polygonization
SingleImageFromFrameFieldProcessor runs tiled inference and merges both the segmentation (seg) and crossfield (crossfield) outputs across overlapping tiles.
Create configs/predict.yaml:
backbone:
_target_: segmentation_models_pytorch.Unet
encoder_name: resnet34
encoder_weights: null
in_channels: 3
classes: 3
activation: sigmoid
compute_seg: true
compute_crossfield: true
seg_params:
compute_interior: true
compute_edge: true
compute_vertex: true
mode: predict
device: cuda
checkpoint_path: ./logs/frame_field_seg/version_0/checkpoints/best-epoch=XX-loss_val=X.XXXX.ckpt
inference_image_reader:
_target_: pytorch_segmentation_models_trainer.tools.data_handlers.raster_reader.FolderImageReaderProcessor
folder_name: ./data/test/images
recursive: true
image_extension: png
inference_processor:
_target_: pytorch_segmentation_models_trainer.tools.inference.inference_processors.SingleImageFromFrameFieldProcessor
model_input_shape: [224, 224]
step_shape: [112, 112]
mask_bands: 3 # interior + boundary + vertex channels
# Polygonizer converts the segmentation + crossfield into polygon vectors
polygonizer:
_target_: pytorch_segmentation_models_trainer.tools.polygonization.polygonizer.TemplatePolygonizerProcessor
seg_threshold: 0.5
min_area: 10 # Minimum polygon area in pixels to keep
export_strategy:
_target_: pytorch_segmentation_models_trainer.tools.inference.export_inference.VectorExportInferenceStrategy
output_file_path: ./predictions/{input_name}_buildings.geojson
inference_threshold: 0.5
save_inference: true
pytorch-smt --config-dir ./configs --config-name predict
Understanding the Output
Inference produces two outputs per input image:
- Raster mask — a multi-band GeoTIFF containing the interior, boundary, and vertex probability maps.
- Vector polygons — a GeoJSON file with building footprint polygons derived via frame-field-guided polygonization.
The polygonization step uses the crossfield to snap polygon edges to dominant orientations, producing axis-aligned buildings typical of cartographic mapping.
Inspect Polygon Output
import geopandas as gpd
import matplotlib.pyplot as plt
gdf = gpd.read_file("predictions/tile_001_buildings.geojson")
print(f"Detected {len(gdf)} building polygons")
print(gdf.geometry.area.describe())
fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, facecolor="orange", edgecolor="red", linewidth=0.5, alpha=0.6)
ax.set_title("Extracted Building Footprints")
ax.axis("off")
plt.tight_layout()
plt.show()
Next Steps
- Experiment with
encoder_name: resnet50orefficientnet-b4for higher-capacity feature extraction - Increase
image_width/image_heightto capture larger buildings at higher resolution - Tune
min_areain the polygonizer to filter small false-positive detections - See the Frame Field Dataset guide for more on the multi-channel mask format