Back to all work
Data / Systems Engineering2026

Supplier Offer Ingestion & Normalization

Email-attachment ingestion via MS Graph, semantic field mapping, normalization engine, and exception reporting.

Role: Data & Systems Engineer
PythonMicrosoft GraphPandasExcel ParsingRule Engine

The problem

A business received supplier offers via email — as Excel spreadsheet attachments. Each supplier used different formats, column names, tab structures, and pricing conventions. Manually processing these into a standardized format was slow, error-prone, and didn't scale.

The challenge was building an automated pipeline that could:

  1. Ingest email attachments from Outlook/Exchange
  2. Parse inconsistent spreadsheet formats
  3. Semantically map fields to a standard schema
  4. Normalize pricing, product data, and supplier metadata
  5. Report exceptions clearly when data didn't fit

My role

I designed and built the entire ingestion and normalization pipeline — from email API integration to structured output generation with exception reporting.

What I built

Email ingestion layer

  • Microsoft Graph API integration — connected to Outlook/Exchange to automatically fetch incoming emails with attachments
  • Attachment extraction — filtered for Excel files, downloaded and staged them for processing
  • Metadata logging — captured sender, timestamp, subject, and attachment details for audit and debugging

Parsing engine

  • Multi-format Excel handling — parsed workbooks with multiple tabs, inconsistent headers, and varying structures
  • Column detection — identified key fields like EAN, product description, price, pieces per carton across different naming conventions
  • Tab-aware processing — handled workbooks where relevant data might be on the first tab, a named tab, or spread across multiple tabs

Normalization rules

  • Semantic field mapping — built rules to map inconsistent column names to a standardized schema (e.g., "Retail Price", "RRP", "Price incl. VAT" all mapping to the correct field)
  • Price type detection — distinguished between net price, retail price, and promotional pricing across supplier formats
  • Brand detection priority — implemented rule-based brand identification with priority ordering for cases where brand appears in multiple fields
  • Unit normalization — standardized quantities, packaging units, and per-carton counts

Exception reporting

  • Unmappable field detection — flagged columns that couldn't be automatically mapped to the standard schema
  • Data quality alerts — identified missing EANs, impossible prices, duplicate entries, and format anomalies
  • Structured exception output — generated clear reports showing what was processed, what was flagged, and what required manual review

Standardized output

  • Clean normalized spreadsheets — output in a consistent format ready for downstream systems (inventory, pricing, analytics)
  • Processing summary — per-supplier statistics showing coverage, exception rates, and confidence levels

Architecture

The pipeline follows a staged approach:

  1. Fetch — MS Graph API polls for new emails, extracts attachments
  2. Parse — Excel files are opened, tabs identified, headers detected
  3. Map — Semantic rules match source columns to target schema
  4. Normalize — Values are cleaned, prices categorized, brands identified
  5. Validate — Data quality checks flag exceptions
  6. Output — Clean data and exception reports are generated

Each stage is independent — the parser doesn't need to know about the email layer, and the normalizer doesn't care about the source format. This makes the system extensible when new supplier formats appear.

Why this project matters

This is one of my most advanced data pipeline efforts because it goes far beyond simple scraping or parsing:

  • Email API integration — not just file processing but automated email-based ingestion
  • Semantic mapping — rules that understand intent, not just exact column names
  • Business domain complexity — pricing conventions, packaging standards, and brand hierarchies
  • Exception-first design — built to report what it can't handle, not silently drop data
  • Scalable architecture — each stage is modular and independently testable

Outcome

The pipeline automated what was previously a manual, multi-hour process per supplier. New supplier formats could be onboarded by adding mapping rules rather than rewriting code. Exception reporting gave the business confidence in data quality while reducing the risk of silent errors in pricing or inventory data.