Ledger Automation Pipeline

At a glance

Project overview

Status: Personal Project · Prototype v2.0
Tools / methods: Python · Pandas · OpenPyXL · RapidFuzz · Streamlit · Automation

The problem started on an ordinary afternoon of manual bookkeeping

There was no assignment brief, no instructor handing this out. This project began with a very real observation from hands-on accounting work: every time a ledger gets exported out of the system, it comes out hierarchical — hard to read directly, and before it can be used for bookkeeping, tax reporting, or audit review, it always goes through the same repetitive steps: cleaning the ledger, flattening it, matching vendor names, looking up Tax IDs, filling in addresses, correctly identifying employees in payroll transaction rows, and finally compiling everything into a workbook ready for review.

Looking at that workflow, the question wasn’t “how do I do this faster in Excel” — it was: could this entire pipeline be automated, while still keeping a human checkpoint at the points where the risk of getting it wrong is highest?

Role: solo, end to end

As a personal project, the entire lifecycle — from identifying the problem, designing the pipeline, writing the processing logic, to building the interface and documentation — was handled by one person:

Reframed a real business workflow into a structured software architecture
Wrote the logic to flatten hierarchical ledgers into clean transaction-level rows
Built vendor matching using fuzzy matching (RapidFuzz) to suggest matches against a vendor master
Designed employee identification from payroll descriptions, with dedicated handling for loan-out corporations
Designed a human-in-the-loop workflow: the system only suggests, a human always makes the final call
Built both a CLI and a Streamlit interface to run the pipeline
Designed a YAML-based config so the same pipeline could be reused across different ledger formats
Prepared a synthetic dataset so the project could be published publicly on GitHub without exposing real work data

The process: automating without over-automating

The hardest part wasn’t writing the matching logic — it was deciding where the system should decide on its own, and where a human absolutely has to sign off. In accounting, a wrong match can have real consequences, so the design kept coming back to one question: if the system isn’t confident, what should it do instead of guessing?

That answer became a pipeline with clearly defined stages:

Read either a raw hierarchical ledger or a pre-flattened ledger
Flatten the hierarchical ledger into standardized transaction-level rows following a unified schema
Use fuzzy matching to suggest the corresponding vendor from the vendor master, then export a Vendor Match Review workbook for human verification
Identify payroll rows, extract employee-identifying tokens from the description, and match against the payroll master or loan-out alias — also exported as an Employee Extraction Review workbook
The reviewer updates an Approval Status for each suggestion
Only approved suggestions get carried into the Final Enriched Workbook, which includes the complete Tax ID, address, employee data, and control sheets
The entire run is logged through a run log and reconciliation summary, making it easy to trace back during an audit

This design means the system never silently decides on ambiguous cases — it only removes the repetitive grunt work and surfaces well-grounded suggestions, while the final call always stays with the accountant.

Results

A complete end-to-end ledger processing pipeline, available through both a CLI and a Streamlit interface
Supports two input modes: raw hierarchical ledgers and pre-flattened ledgers
Generates the Vendor Review, Employee Review, and Final Enriched Workbook
Clear approval gating — no data gets enriched until a human has signed off
Run log, reconciliation summary, and exception sheets included for review and audit purposes
Repository reorganized into a clean structure, with a safe public demo using synthetic data for GitHub

The biggest takeaway from this project

The real value here isn’t in the matching code — it’s in the approach: not starting from a tutorial, but starting from a genuine pain point at work. This project tells a clear story — noticing a repetitive, error-prone, manually-driven process, then redesigning it into a structured system with built-in risk control and reusability.

It also clarified an important lesson: good automation isn’t about removing humans from the loop entirely — it’s about cutting out the repetitive parts so people can spend their time on the judgment calls that actually need it.

If I did it again

Write more comprehensive automated tests for the flattening, vendor matching, and employee extraction modules
Add more detailed schema validation for input workbooks
Separate the business logic into a dedicated service layer for easier extension
Add authentication if this were ever deployed in a real internal environment
Add an architecture diagram and a demo walkthrough video to make it easier for portfolio viewers to follow