Computer Vision / ML2025–2026

Interior Scene Segmentation Pipeline

Fine-tuned Mask2Former and SAM segmentation models for real estate interior photos with production-grade boundary precision.

Role: AI / ML Engineer

PythonMask2FormerSAM / SAM3HydraCOCO FormatRTX 3090

The problem

A real estate technology project needed high-precision image segmentation for interior scene photos — walls, ceilings, floors, and architectural features needed to be segmented with tight, clean boundaries. Existing off-the-shelf segmentation models produced results that were too rough for production use:

Edge bleed — mask boundaries bled into adjacent surfaces
Incomplete segmentation — small details and narrow features were missed entirely
Inconsistent quality — performance varied significantly across different room types and lighting conditions

The goal was to train and refine segmentation models that could produce production-quality masks for real-world interior photos.

My role

I worked as the AI/ML engineer responsible for the full training pipeline — from dataset preparation through model training, evaluation, and refinement. This was hands-on model engineering, not just prompt engineering or API integration.

What I built

Dataset preparation pipeline

Folder merging — unified multiple annotated datasets into a single training structure
COCO JSON generation — converted annotations into COCO format for compatibility with modern segmentation training frameworks
Train/val splitting — created reproducible train/validation splits with proper stratification
Data quality validation — verified annotation integrity before training

Training infrastructure

SAM3 training patches — patched the Segment Anything Model v3 training code to handle edge cases:
- Guarded backward pass on empty targets to prevent training crashes
- Fixed RLE dictionary decoding before resize operations
- Added fallback matcher handling for missing indices
- Implemented unused parameter guards for the semantic head
Hydra configuration — set up Hydra-based training configs for reproducible, parameterized training runs
Local training pipeline — configured and ran training on RTX 3090 with proper checkpoint management, TensorBoard logging, and output organization

Model training and refinement

Worked with modern segmentation architectures including:

Mask2Former — transformer-based instance/semantic segmentation
OneFormer — unified segmentation model
SAM / SAM3 — Segment Anything Model variants
YOLO-based segmentation — for faster inference alternatives

The refinement process focused on:

Tighter boundaries — reducing edge bleed between adjacent surfaces (wall-to-ceiling, wall-to-floor transitions)
Better small-object capture — improving detection of narrow features, trim, and architectural details
Consistency across scenes — ensuring reliable performance across different room types, lighting conditions, and camera angles
Production readiness — balancing mask quality against inference speed for real-world deployment

Output and checkpoint management

Organized model checkpoints for version tracking
Set up TensorBoard monitoring for training metrics
Structured output directories for reproducible experiment tracking

Architecture

The training pipeline follows a clear workflow:

Data prep — merge datasets, generate COCO annotations, create train/val splits
Config — Hydra-based configuration defining model architecture, training hyperparameters, and data paths
Train — local GPU training with patched model code, checkpoint saving, and metric logging
Evaluate — boundary quality assessment, IoU metrics, visual inspection of edge cases
Refine — iterate on training parameters, data augmentation, and model patches to improve weak areas

What this project proved

This project demonstrated a deeper level of ML engineering:

Precision-quality computer vision — not just "can the model identify the object" but "are the boundaries tight enough for production use"
Serious CV model ecosystem — working with Mask2Former, OneFormer, SAM, and YOLO-based segmentation shows fluency across modern architectures
Training pipeline engineering — dataset preparation, COCO conversion, config management, and reproducible training setups are real ML engineering work
Model debugging and patching — fixing backward pass guards, RLE decoding issues, and fallback matchers shows deeper understanding of training internals
Production orientation — the work cared about output quality, inference speed, model size, and real-world robustness

Outcome

The project delivered refined segmentation models that produced significantly tighter, more consistent masks for interior scene photos compared to off-the-shelf alternatives. The training pipeline — from dataset preparation through Hydra-configured training and checkpoint management — was structured for reproducibility and future iteration, demonstrating ML engineering maturity beyond single-model experimentation.