Back to all work
Computer Vision / ML2025–2026

Interior Scene Segmentation Pipeline

Fine-tuned Mask2Former and SAM segmentation models for real estate interior photos with production-grade boundary precision.

Role: AI / ML Engineer
PythonMask2FormerSAM / SAM3HydraCOCO FormatRTX 3090

The problem

A real estate technology project needed high-precision image segmentation for interior scene photos — walls, ceilings, floors, and architectural features needed to be segmented with tight, clean boundaries. Existing off-the-shelf segmentation models produced results that were too rough for production use:

  • Edge bleed — mask boundaries bled into adjacent surfaces
  • Incomplete segmentation — small details and narrow features were missed entirely
  • Inconsistent quality — performance varied significantly across different room types and lighting conditions

The goal was to train and refine segmentation models that could produce production-quality masks for real-world interior photos.

My role

I worked as the AI/ML engineer responsible for the full training pipeline — from dataset preparation through model training, evaluation, and refinement. This was hands-on model engineering, not just prompt engineering or API integration.

What I built

Dataset preparation pipeline

  • Folder merging — unified multiple annotated datasets into a single training structure
  • COCO JSON generation — converted annotations into COCO format for compatibility with modern segmentation training frameworks
  • Train/val splitting — created reproducible train/validation splits with proper stratification
  • Data quality validation — verified annotation integrity before training

Training infrastructure

  • SAM3 training patches — patched the Segment Anything Model v3 training code to handle edge cases:
    • Guarded backward pass on empty targets to prevent training crashes
    • Fixed RLE dictionary decoding before resize operations
    • Added fallback matcher handling for missing indices
    • Implemented unused parameter guards for the semantic head
  • Hydra configuration — set up Hydra-based training configs for reproducible, parameterized training runs
  • Local training pipeline — configured and ran training on RTX 3090 with proper checkpoint management, TensorBoard logging, and output organization

Model training and refinement

Worked with modern segmentation architectures including:

  • Mask2Former — transformer-based instance/semantic segmentation
  • OneFormer — unified segmentation model
  • SAM / SAM3 — Segment Anything Model variants
  • YOLO-based segmentation — for faster inference alternatives

The refinement process focused on:

  • Tighter boundaries — reducing edge bleed between adjacent surfaces (wall-to-ceiling, wall-to-floor transitions)
  • Better small-object capture — improving detection of narrow features, trim, and architectural details
  • Consistency across scenes — ensuring reliable performance across different room types, lighting conditions, and camera angles
  • Production readiness — balancing mask quality against inference speed for real-world deployment

Output and checkpoint management

  • Organized model checkpoints for version tracking
  • Set up TensorBoard monitoring for training metrics
  • Structured output directories for reproducible experiment tracking

Architecture

The training pipeline follows a clear workflow:

  1. Data prep — merge datasets, generate COCO annotations, create train/val splits
  2. Config — Hydra-based configuration defining model architecture, training hyperparameters, and data paths
  3. Train — local GPU training with patched model code, checkpoint saving, and metric logging
  4. Evaluate — boundary quality assessment, IoU metrics, visual inspection of edge cases
  5. Refine — iterate on training parameters, data augmentation, and model patches to improve weak areas

What this project proved

This project demonstrated a deeper level of ML engineering:

  • Precision-quality computer vision — not just "can the model identify the object" but "are the boundaries tight enough for production use"
  • Serious CV model ecosystem — working with Mask2Former, OneFormer, SAM, and YOLO-based segmentation shows fluency across modern architectures
  • Training pipeline engineering — dataset preparation, COCO conversion, config management, and reproducible training setups are real ML engineering work
  • Model debugging and patching — fixing backward pass guards, RLE decoding issues, and fallback matchers shows deeper understanding of training internals
  • Production orientation — the work cared about output quality, inference speed, model size, and real-world robustness

Outcome

The project delivered refined segmentation models that produced significantly tighter, more consistent masks for interior scene photos compared to off-the-shelf alternatives. The training pipeline — from dataset preparation through Hydra-configured training and checkpoint management — was structured for reproducibility and future iteration, demonstrating ML engineering maturity beyond single-model experimentation.