Parquet Support & Caching
The framework supports Apache Parquet for dataset metadata, offering significantly faster loading times and lower memory consumption compared to standard CSV files.
Automatic CSV Caching
By default, when you provide a .csv file in input_csv_path, the framework will:
- Check if a
.cache.parquetfile exists in the same directory. - Compare the modification time of the CSV and the Parquet cache.
- If the Parquet file is newer, it loads it directly (instant loading).
- If it doesn't exist or is outdated, it reads the CSV and updates the cache automatically.
This behavior is transparent and happens inside the AbstractDataset class.
Native Parquet Support
You can also provide a .parquet file directly in your configuration:
train_dataset:
_target_: pytorch_segmentation_models_trainer.dataset_loader.dataset.SegmentationDataset
input_csv_path: data/train_metadata.parquet
root_dir: data/images
CLI Conversion Tool
If you want to convert your CSV files to Parquet manually to avoid the first-run overhead or for better data management, use the csv-to-parquet tool:
# Convert a single file
csv-to-parquet metadata.csv
# Convert all CSVs in a directory recursively
csv-to-parquet datasets/ -r
# Specify a different output path
csv-to-parquet metadata.csv -o processed_metadata.parquet
Performance Benefits
- Faster Initialization: Reading a Parquet file with 1M rows is often 10x-50x faster than parsing a CSV.
- Lower Memory: Parquet files are loaded more efficiently by Pandas using memory mapping.
- Strong Typing: Column types are preserved, avoiding "string vs float" inference issues common with CSVs.