Supplier Offer Ingestion & Normalization
Email-attachment ingestion via MS Graph, semantic field mapping, normalization engine, and exception reporting.
The problem
A business received supplier offers via email — as Excel spreadsheet attachments. Each supplier used different formats, column names, tab structures, and pricing conventions. Manually processing these into a standardized format was slow, error-prone, and didn't scale.
The challenge was building an automated pipeline that could:
- Ingest email attachments from Outlook/Exchange
- Parse inconsistent spreadsheet formats
- Semantically map fields to a standard schema
- Normalize pricing, product data, and supplier metadata
- Report exceptions clearly when data didn't fit
My role
I designed and built the entire ingestion and normalization pipeline — from email API integration to structured output generation with exception reporting.
What I built
Email ingestion layer
- Microsoft Graph API integration — connected to Outlook/Exchange to automatically fetch incoming emails with attachments
- Attachment extraction — filtered for Excel files, downloaded and staged them for processing
- Metadata logging — captured sender, timestamp, subject, and attachment details for audit and debugging
Parsing engine
- Multi-format Excel handling — parsed workbooks with multiple tabs, inconsistent headers, and varying structures
- Column detection — identified key fields like EAN, product description, price, pieces per carton across different naming conventions
- Tab-aware processing — handled workbooks where relevant data might be on the first tab, a named tab, or spread across multiple tabs
Normalization rules
- Semantic field mapping — built rules to map inconsistent column names to a standardized schema (e.g., "Retail Price", "RRP", "Price incl. VAT" all mapping to the correct field)
- Price type detection — distinguished between net price, retail price, and promotional pricing across supplier formats
- Brand detection priority — implemented rule-based brand identification with priority ordering for cases where brand appears in multiple fields
- Unit normalization — standardized quantities, packaging units, and per-carton counts
Exception reporting
- Unmappable field detection — flagged columns that couldn't be automatically mapped to the standard schema
- Data quality alerts — identified missing EANs, impossible prices, duplicate entries, and format anomalies
- Structured exception output — generated clear reports showing what was processed, what was flagged, and what required manual review
Standardized output
- Clean normalized spreadsheets — output in a consistent format ready for downstream systems (inventory, pricing, analytics)
- Processing summary — per-supplier statistics showing coverage, exception rates, and confidence levels
Architecture
The pipeline follows a staged approach:
- Fetch — MS Graph API polls for new emails, extracts attachments
- Parse — Excel files are opened, tabs identified, headers detected
- Map — Semantic rules match source columns to target schema
- Normalize — Values are cleaned, prices categorized, brands identified
- Validate — Data quality checks flag exceptions
- Output — Clean data and exception reports are generated
Each stage is independent — the parser doesn't need to know about the email layer, and the normalizer doesn't care about the source format. This makes the system extensible when new supplier formats appear.
Why this project matters
This is one of my most advanced data pipeline efforts because it goes far beyond simple scraping or parsing:
- Email API integration — not just file processing but automated email-based ingestion
- Semantic mapping — rules that understand intent, not just exact column names
- Business domain complexity — pricing conventions, packaging standards, and brand hierarchies
- Exception-first design — built to report what it can't handle, not silently drop data
- Scalable architecture — each stage is modular and independently testable
Outcome
The pipeline automated what was previously a manual, multi-hour process per supplier. New supplier formats could be onboarded by adding mapping rules rather than rewriting code. Exception reporting gave the business confidence in data quality while reducing the risk of silent errors in pricing or inventory data.