Interior Scene Segmentation Pipeline
Fine-tuned Mask2Former and SAM segmentation models for real estate interior photos with production-grade boundary precision.
The problem
A real estate technology project needed high-precision image segmentation for interior scene photos — walls, ceilings, floors, and architectural features needed to be segmented with tight, clean boundaries. Existing off-the-shelf segmentation models produced results that were too rough for production use:
- Edge bleed — mask boundaries bled into adjacent surfaces
- Incomplete segmentation — small details and narrow features were missed entirely
- Inconsistent quality — performance varied significantly across different room types and lighting conditions
The goal was to train and refine segmentation models that could produce production-quality masks for real-world interior photos.
My role
I worked as the AI/ML engineer responsible for the full training pipeline — from dataset preparation through model training, evaluation, and refinement. This was hands-on model engineering, not just prompt engineering or API integration.
What I built
Dataset preparation pipeline
- Folder merging — unified multiple annotated datasets into a single training structure
- COCO JSON generation — converted annotations into COCO format for compatibility with modern segmentation training frameworks
- Train/val splitting — created reproducible train/validation splits with proper stratification
- Data quality validation — verified annotation integrity before training
Training infrastructure
- SAM3 training patches — patched the Segment Anything Model v3 training code to handle edge cases:
- Guarded backward pass on empty targets to prevent training crashes
- Fixed RLE dictionary decoding before resize operations
- Added fallback matcher handling for missing indices
- Implemented unused parameter guards for the semantic head
- Hydra configuration — set up Hydra-based training configs for reproducible, parameterized training runs
- Local training pipeline — configured and ran training on RTX 3090 with proper checkpoint management, TensorBoard logging, and output organization
Model training and refinement
Worked with modern segmentation architectures including:
- Mask2Former — transformer-based instance/semantic segmentation
- OneFormer — unified segmentation model
- SAM / SAM3 — Segment Anything Model variants
- YOLO-based segmentation — for faster inference alternatives
The refinement process focused on:
- Tighter boundaries — reducing edge bleed between adjacent surfaces (wall-to-ceiling, wall-to-floor transitions)
- Better small-object capture — improving detection of narrow features, trim, and architectural details
- Consistency across scenes — ensuring reliable performance across different room types, lighting conditions, and camera angles
- Production readiness — balancing mask quality against inference speed for real-world deployment
Output and checkpoint management
- Organized model checkpoints for version tracking
- Set up TensorBoard monitoring for training metrics
- Structured output directories for reproducible experiment tracking
Architecture
The training pipeline follows a clear workflow:
- Data prep — merge datasets, generate COCO annotations, create train/val splits
- Config — Hydra-based configuration defining model architecture, training hyperparameters, and data paths
- Train — local GPU training with patched model code, checkpoint saving, and metric logging
- Evaluate — boundary quality assessment, IoU metrics, visual inspection of edge cases
- Refine — iterate on training parameters, data augmentation, and model patches to improve weak areas
What this project proved
This project demonstrated a deeper level of ML engineering:
- Precision-quality computer vision — not just "can the model identify the object" but "are the boundaries tight enough for production use"
- Serious CV model ecosystem — working with Mask2Former, OneFormer, SAM, and YOLO-based segmentation shows fluency across modern architectures
- Training pipeline engineering — dataset preparation, COCO conversion, config management, and reproducible training setups are real ML engineering work
- Model debugging and patching — fixing backward pass guards, RLE decoding issues, and fallback matchers shows deeper understanding of training internals
- Production orientation — the work cared about output quality, inference speed, model size, and real-world robustness
Outcome
The project delivered refined segmentation models that produced significantly tighter, more consistent masks for interior scene photos compared to off-the-shelf alternatives. The training pipeline — from dataset preparation through Hydra-configured training and checkpoint management — was structured for reproducibility and future iteration, demonstrating ML engineering maturity beyond single-model experimentation.