Document Understanding Demystified -

Introduction to UiPath’s new AI Tool

Learning What Things Are

Once upon a time, we learned our first language by bigger humans pointing out things to us and lovingly telling us the word for them. “That’s a cat.” Eventually, we learned to form the words ourselves and started pointing our tiny fingers at things we recognized, saying the word, and getting feedback. “Yes! That’s a cat! Good job!” If we made a mistake we were immediately corrected, “No, that’s a dog”. Eventually, with this constant training and refining, we learned to recognize that there are many objects called “cat”. They may have different colors, sizes, poses, shown as images, cartoons, or real life, but they are all called “cat”. We learned to recognize the pattern for cat.

Machine learning works in the same way. Shown enough examples and given feedback, a computer can be trained to recognize a pattern for practically anything. These learned patterns are called models. And now there’s plenty of data around to train the models. By 2025, the amount of data on the internet is expected to reach 175 zettabytes. (A zettabyte is a 1 followed by 21 zeroes!) The list of models used all around us is large and growing. There are fingerprint models, models that recognize faces at an airport or to unlock your phone, models that map your way to work, models that count suitcases on a ramp, and models that learn the best time to schedule a flight. Models now recognize different languages and proper sentence structure, both spoken and written. Document-based models include invoices, ID cards, purchase orders, free-flowing Amazon reviews, and recognizing text within images. UiPath Document Understanding (DU) leverages this rapidly expanding technology to allow automations to use documents as input and extract the desired information, just like a human would.

Document Understanding Overview

UiPath, confusingly, calls 2 things Document Understanding. First, Document Understanding (Studio) is the general term for any automation created in Studio, by professional developers, that uses any of the AI activities from various packages to read, interpret, classify and extract information from documents or images. The other, Document Understanding (Service), is UiPath’s attempt at creating an online version of the same thing. It is an attempt to simplify the steps for non-professional developers and automates some of the inter-connections, but it is more limited in its capabilities. Regardless of the tool used, the overall DU process steps remain the same.

DU has 5 primary stages: Taxonomy, Digitization, Classification, Extraction, and Export. As it analyzes each document, an ongoing human validation loop helps the computer learn the patterns better and better. The more data the computer trains on, the more refined and confident its model becomes.

The DU Process from a Robot’s POV

What am I looking for? (Taxonomy)
Is it readable? (Digitization)
- No
  - OK, use OCR
  - Go back to #2
Can I tell what kind of document it is? (Classification)
- No
  - Hey, human, can you help me? (Validation)
  - OK, got it, I’ll learn my lesson for next time (Train)
  - Go back to #3
Can I extract the desired data from it? (Extraction)
- No
  - Hey, human, can you help me? (Validation)
  - OK, got it, I’ll learn my lesson for next time (Train)
  - Go back to #4
Return Extracted Data (Export)

Let’s look briefly at each stage in a DU process to get an overall understanding of how it works.

Taxonomy

Before it can work on any document, DU needs to know what it is looking for. It receives instructions via something called a Taxonomy. Taxonomy is defined as the science of organizing and categorizing things into a structured system. In UiPath, Taxonomy is simply a JSON file that tells DU the types of documents it will read, and what data to extract from each document. Think of the taxonomy as the recipe for DU to follow. Taxonomies classify each type of document into a hierarchy up to 2 levels deep, and then defines the specific fields to look for within each type of document.

As an example, let’s say our DU process will be analyzing receipts and invoices. Our taxonomy structure might look something like this:

AccountsPayable (Group)
- Invoices (Document Type)
  - InvoiceDate (Date)
  - AccountNumber (Text)
  - Vendor (Text)
  - TotalDue (Number)
- Receipts (Document Type)
  - Vendor (Text)
  - Date (Date)
  - TotalPaid (Number)

Digitization

Once DU knows what it’s looking for, it is set up to begin to analyze each document. If the document is in a native (typed-in) format, it can proceed. However, some documents might be image-based. Then, DU must first make use of an Optical Character Recognition (OCR) engine to parse the image into text. Parsing an image into text is called Digitization.

Different OCR engines have different capabilities such as recognizing handwritten text, checkboxes, signatures, barcodes, QR codes, etc. Developers may try different engines for different projects. OCR engines are kept on the internet, and called through an API. The UiPath default OCR engines automatically connect through the UiPath internet connection.

This stage returns the text of the document as a string, and something called the Document Object Model (DOM). The DOM is a hierarchical tree-like structure with branches, nodes, and properties. This is the standard way documents today are structured in a computer-readable format.

Classification

Up to this point, we have told the DU what to extract from each type of document (taxonomy), and have turned the document into a DOM that it can read (digitization). But we haven’t yet told DU how to marry the two. How does it know where in the taxonomy each document belongs? How does it know if it is a receipt or an invoice? We humans would naturally scan the document looking out for some words like “Invoice”, “Receipt”, “Total Due”, “Total Paid”, etc. to know what it is. We have to teach DU to do the same. This process of pinpointing the document to the exact document type in the taxonomy is called Classification. The job of classification is done by Classifiers.

What is a Classifier?

A Classifier is a piece of code that does what we humans would naturally do to figure out what kind of document this is. We humans know that an Invoice is a bill to be paid, and so we look for words like Invoice and Total Due, or some variation thereof. The computer, being dumb, first needs to learn keywords, and any variations of them, that might appear in a document to connotate it is an Invoice. Then, it scans incoming documents for these keywords and makes its choice. “I think this is an invoice”. It returns its best guess along with a confidence score between 0 and 1, where 0 is not confident at all, and 1 is 100% confident.

How Does a Classifier Learn?

First and foremost, the Classifier needs a recipe of keywords to follow. This is created by the developer when setting up a particular Classifier. There are two basic ways to set up a Classifier, by manually setting up the keywords in advance, or by letting AI do it for you. Let’s look at both of these ways.

Keyword Based Classifier
Using this Classifier, the developer uses a wizard to set up combinations of keywords manually. This type of Classifier is useful if the expected documents will be in a structured, non-changing format. This type of classifier gives the developer fine-grained control over the keyword combinations.

Intelligent Keyword Classifier
Here, the developer calls on AI to analyze multiple documents of the same type to extract combinations of keywords (called word vectors) that are common among them. The more documents we “train” the Classifier on, the more confident the Classifier will be in its guesses. This type of Classifier is also able to split merged documents into separate types.

Once the developer has set up one or more Classifiers, all this keyword training is saved into a special JSON file, called something like classification.json. The recipe must be in a separate file for the Classifier to be able to learn.

Continual Learning
So we’ve trained our Classifier on an initial set of keywords to be able to recognize what type of document it is. But it’s not done learning yet.

Now we can let the Classifier ask for help if it’s not confident. We set a confidence minimum, say 80% confident, to proceed with its best guess. If it doesn’t reach that minimum level of confidence, we launch something called a Classifier Trainer. Here the human is shown the full document (in text form and the original image if it used OCR), and asks what type of document it is. Once the human gives it the answer, the Classifier updates the JSON classification file, learning the lesson for the next time.

Extraction

Now that DU recognizes the type of file it is looking at, it also needs to know how to recognize each piece of data within that file. Just like there are different Classifiers, there are also different Extractors, which use different methods for finding the data. Let’s touch on them briefly.

RegEx Based Extractor
A Regular Expression is a specific sequence of symbols and characters used to quickly match text. It looks like gobbledygook to most of us, but computers like it. For example, the RegEx for finding a Date field might look like this:

\bDate:\s*([0-9]{1,2}\/[0-9]{1,2}\/[0-9]{2,4})\b

Riiiight. Regex is complicated. It takes a lot of time to get good at putting together RegEx patterns. To help us out, UiPath gives us a RegEx wizard to experiment with until we get the right match for each piece of data we are looking to extract. Fortunately, the internet is full of accepted RegEx patterns used to extract all kinds of text like emails, phone numbers, zip codes, etc. The RegEx Based Extractor does give developers fine-grained control over what it extracts, once they get the pattern right. This type of Extractor cannot be used in a continual learning loop. For that we need an AI-based extractor.

Machine Learning Extractor
Here’s where the automation uses previously trained AI Models, set up in the AI Center. Once a model has been trained and published for use, it is called an AI Skill. An RPA developer can now select that ML Skill for use in the automation using the ML Extractor activity.

Getting Human Help

At every stage in the DU process, a confidence score is generated. This is how confident the robot is that it got it right. A level of acceptability is set ahead of time, and when the score falls below that, the process can ask a human for help. The process can load the questionable document into the Action Center, send a notification message to a human, and wait until it is validated. Once validated, its confidence score increases and the document can get uploaded to an AI Center dataset, to help retrain the model in the future.

Exporting

The final stage of the DU process is to return the pieces of data from the analyzed document. All the data extracted is stored in a collection of Data Tables, and so can be manipulated like any other UiPath Data Table. The pieces of data can then be exported to be used as inputs for other automations, called “downstream” processes.