A Three-Stage Hybrid Pipeline Designed Around the Known Limitations of Large Language Models
This idea stemmed from the need to solve an outdated solution in a B2B commercial organisation and proposed a better, more scalable alternative.
The organisation maintains a large database of UK customer and prospect companies that must be sorted into trade channel categories — mobile phone specialists, tyre specialists, computer hardware resellers, stationers — to support sales, marketing, and planning for universe studies to estimate the market for each channel.
A wrong classification at step one flows into every decision built on top of it, making this task far more important than it first appears.
Data has been bought at around £3,000 per channel definition from commercial providers. The classifications are:
Because bought data does not match internal definitions, analysts manually research and reclassify companies on top of the purchased data. This means:
Three questions guide the design, evaluation, and findings of this project.
Can a three-stage pipeline combining web retrieval, rule-based logic, and LLM synthesis classify UK businesses into trade channels at accuracy levels comparable to Experian and LDC, at a fraction of the cost?
What is the per-record cost and processing time, and how many analyst hours are displaced per 1,000 records, compared to manual research and commercial subscriptions?
Does the LLM add measurable value over rules alone? How well-calibrated are its confidence scores, and what are the failure conditions of each pipeline stage that should trigger a flag for human review?
What this project expects to find.
An application that keeps retrieval, rule-based classification, and LLM reasoning as three separate, focused stages is expected to match the accuracy of commercial providers like Experian and LDC combined with human analyst review, at a much lower cost per record, while producing a clear explanation for every classification decision made.
Each stage is constrained to the task it can perform dependably. The LLM is the reasoning layer.
Applies a set of clear, channel-specific rules to the retrieved text. Produces a best-guess channel label and a confidence score. Organises the evidence into a clean, compact package so the LLM in Stage 3 receives focused, well-structured input rather than a large volume of raw text.
Receives the full structured evidence package from Stages 1 and 2 combined. Reasons over all retrieved signals and the rule engine output together to confirm or adjust the final classification, producing a calibrated confidence score alongside the decision.
How and why large language models generate confident but factually incorrect outputs, and the evidence that this is a structural property of how they work, not an occasional occurrence. This is the core reason retrieval and reasoning must be kept separate.
Research showing that LLM performance drops when important information is inside a long text, even in models built for long inputs. This is why Stage 2 combines the evidence before passing it to the LLM.
The principle that an AI system performs better when it is given retrieved facts to reason over, rather than being asked to retrieve and reason at the same time. This is the architectural foundation of the whole pipeline.
Evidence that using a set of fixed rules alongside an AI model works better than either on its own, especially for specific domains. Clear-cut cases are handled by the rules, and the LLM steps in for the ones that are harder to categorise.
LLMs can follow a completely new set of classification rules if you simply write them into the prompt, no retraining or extra data needed. This is how Stage 3 learns the organisation's specific channel definitions on the spot, without any prior preparation.
Research on when automating tasks genuinely helps knowledge workers versus simply adding new pressures. Automating high-volume, repetitive lookup work frees analysts for higher-value tasks.
The pipeline is evaluated on 200 to 300 labelled UK companies, compared across three conditions and measured across seven dimensions.
The complete system being tested to see if it works overall
How accurate are the rules on their own, without the LLM?
How much does it cost in money and time for an analyst doing it manually?
Classification Accuracy — what percentage of labels are correct?
Accuracy Score — precision, recall, and F1 per channel
Cost Per Record — API costs plus analyst time vs. Experian and LDC
Processing Speed — time to classify one record end-to-end
Analyst Hours Freed — per 1,000 records classified
Statistical Significance — is the accuracy gain real?
Failure Analysis — where and why did each stage go wrong?
Kristina Hanxhara · 200052562
MSc Data Science and Artificial Intelligence · University of Liverpool
CSCK700 — Computer Science Capstone Project