Artificial Intelligence Breakthrough In Receipt line Item Extraction

Artificial Intelligence

accurate receipt data extraction


Since early 2016 Tabscanner has been working tirelessly to solve the myriad and virtually endless formats of POS paper receipts. Many OCR companies try to solve invoice data extraction through templates.

Whilst this can work quite well with consistent and structured formats, this approach is simply unscalable when creating a general global POS receipt OCR technology. More advanced methods are required when it comes to POS receipts, and this is where Artificial Intelligence and Machine learning come into play.

By early 2017 we had developed highly accurate and intelligent models that understood totals and subtotals, and this was the foundation of our technology. The advantages of AI in this area proved to be vastly superior to basic text parsing and this is when our technology really started to take off. Our accuracy rates on global POS receipt formats were over 92% and improving steadily with more data.

Line Item Extraction

receipt line item data extraction technology

The next and biggest challenge we took on was the accurate identification of line item information on receipts. This was also a highly complex area as understanding where line items started and stopped was an extremely difficult task, sometimes even for humans.

We expanded our receipt testing platform (RTP) to around 5000 receipts and developed a guidance system for our AI to learn from. This was tested and refined on data that came into our technology from a highly successful global marketing campaign. Our Beta testers were sending receipts in from all over the world in a vast range of formats and languages.

One of the many early difficulties in solving this problem was that there was no clear identifiable patterns emerging on where retailers would place line items on receipts. Our machine learning algorithms had to become highly intelligent to solve this. Our RTP batch had clearly defined markers to identify the start and end of line items, and so we were able to measure our success rates accurately as our technology improved.

Advanced Multi Language OCR

optical character recognition

During this time we were making steady progress with our own proprietary OCR. We had begun using our OCR to train our models and were making advances in the clear understanding of line item data on our general technology.

By mid 2018 our RTP benchmark system had grown to over 40,000 receipts and was now producing highly accurate identification rates above 80% on where line items started and ended.

Whilst this was very impressive and continually improving, supplementary lines were still a huge challenge in solving and extracting POS receipt line items. How stores organize their line item data, varies greatly and we needed to develop a very large set of specific tools to compliment what our AI had achieved in this area.

By the end of 2018 and assisted greatly by the volume of data from our early partners, we had reached line item data isolation levels of over 85% on our RTP batch.

The AI Breakthrough


In 2019 our advanced multi language OCR and highly trained models made a breakthrough, achieving extremely accurate rates of extraction. Combined with our own customization tools we are now reaching over 95% accuracy on many POC batches provided from partners all over the world.

“95% accuracy on line item extraction”

These rates are vastly superior to any other technology we have tested against. Our Tokyo data science team have achieved amazing results through the power of AI and their own ingenious techniques and methods. They continue to hone and refine our technology daily, ensuring Tabscanner continues to be the World’s most advanced receipt OCR technology.