Data / Systems Engineering2026

Supplier Offer Ingestion & Normalization

Email-attachment ingestion via MS Graph, semantic field mapping, normalization engine, and exception reporting.

Role: Data & Systems Engineer

PythonMicrosoft GraphPandasExcel ParsingRule Engine

The problem

A business received supplier offers via email — as Excel spreadsheet attachments. Each supplier used different formats, column names, tab structures, and pricing conventions. Manually processing these into a standardized format was slow, error-prone, and didn't scale.

The challenge was building an automated pipeline that could:

Ingest email attachments from Outlook/Exchange
Parse inconsistent spreadsheet formats
Semantically map fields to a standard schema
Normalize pricing, product data, and supplier metadata
Report exceptions clearly when data didn't fit

My role

I designed and built the entire ingestion and normalization pipeline — from email API integration to structured output generation with exception reporting.

What I built

Email ingestion layer

Microsoft Graph API integration — connected to Outlook/Exchange to automatically fetch incoming emails with attachments
Attachment extraction — filtered for Excel files, downloaded and staged them for processing
Metadata logging — captured sender, timestamp, subject, and attachment details for audit and debugging

Parsing engine

Multi-format Excel handling — parsed workbooks with multiple tabs, inconsistent headers, and varying structures
Column detection — identified key fields like EAN, product description, price, pieces per carton across different naming conventions
Tab-aware processing — handled workbooks where relevant data might be on the first tab, a named tab, or spread across multiple tabs

Normalization rules

Semantic field mapping — built rules to map inconsistent column names to a standardized schema (e.g., "Retail Price", "RRP", "Price incl. VAT" all mapping to the correct field)
Price type detection — distinguished between net price, retail price, and promotional pricing across supplier formats
Brand detection priority — implemented rule-based brand identification with priority ordering for cases where brand appears in multiple fields
Unit normalization — standardized quantities, packaging units, and per-carton counts

Exception reporting

Unmappable field detection — flagged columns that couldn't be automatically mapped to the standard schema
Data quality alerts — identified missing EANs, impossible prices, duplicate entries, and format anomalies
Structured exception output — generated clear reports showing what was processed, what was flagged, and what required manual review

Standardized output

Clean normalized spreadsheets — output in a consistent format ready for downstream systems (inventory, pricing, analytics)
Processing summary — per-supplier statistics showing coverage, exception rates, and confidence levels

Architecture

The pipeline follows a staged approach:

Fetch — MS Graph API polls for new emails, extracts attachments
Parse — Excel files are opened, tabs identified, headers detected
Map — Semantic rules match source columns to target schema
Normalize — Values are cleaned, prices categorized, brands identified
Validate — Data quality checks flag exceptions
Output — Clean data and exception reports are generated

Each stage is independent — the parser doesn't need to know about the email layer, and the normalizer doesn't care about the source format. This makes the system extensible when new supplier formats appear.

Why this project matters

This is one of my most advanced data pipeline efforts because it goes far beyond simple scraping or parsing:

Email API integration — not just file processing but automated email-based ingestion
Semantic mapping — rules that understand intent, not just exact column names
Business domain complexity — pricing conventions, packaging standards, and brand hierarchies
Exception-first design — built to report what it can't handle, not silently drop data
Scalable architecture — each stage is modular and independently testable

Outcome

The pipeline automated what was previously a manual, multi-hour process per supplier. New supplier formats could be onboarded by adding mapping rules rather than rewriting code. Exception reporting gave the business confidence in data quality while reducing the risk of silent errors in pricing or inventory data.