What is Document Parser and How It Works
Document Parser is the versatile document parsing engine for accurate and easy data extraction from PDF and scanned documents. Create and maintain extraction templates without coding!
Extracts data from invoices, statements, reports, paystubs, tables, or receipts. Supports both native electronic and scanned PDF files, PNG, JPG, TIFF images. Supports English, German, French, Spanish and many other languages including dual language documents. Available via Web API, Zapier, Make, SalesForce and more.
Visual Document Parser Template Editor
The online version of PDF.co Document Parser templates editor is here. Create, test and maintain data extraction templates.
Template Objects
Template objects define objects to extract from input document. These can be:
Field mapped from Virtual Grid
- extracts value from virtual grid generated by the engine for input document. If you have documents with the same layout, this object is useful. Basically, it is similar to converting document to a spreadsheet and then telling the engine to get value from a virtual cell at (row, column).Field from Rectangle Selection
- extracts data from a rectangle selection by coordinates. Use it when you have document with objects placed at the very same place all the time. You can optionally set expression (with macros, see below) to additionally locate a matching expression (for example, a date or currency) inside a rectangle selection.Field from auto Key-Value
- runs a search for key-value pairs in the document and generates output objects as Key-value pairs. You should define expression with macros).Table from Rectangle
- reads a table from a rectangle.Field based on Text Search
- runs a text search on a whole page searching for a predefined macros (for example, you can find a date, currency, SSN, phone number) and returns it as a value/Table based on Search
- finds a table using AI or using search expressions defined by a JSON config containing a text search pattern for beginning and ending pattern for a table. With this approach you can extract multipage tables. Explore more details below.Field with Static Value
- returns a static value. Useful if you need to generate some predefined values, like a template name, or company name and return a predefined value along with other objects.
Expression parameter (for field mapped from a rectangle, text search based field, fied from auto key-value and others)
Expression
parameter can contain:
- Macros (see the list below)
- Regular Expressions (don’t forget to enable
regex
checkbox) - Mixed macros and regular expressions (don’t forget to enable
regex
checkbox) - Special Functions for AI-powered data extraction of specific values like a company name.
SPECIAL MARKERS (FOR USE INSIDE EXPRESSIONS)**
You can use special markers inside expression
parameter. Marker helps to point a specific part of expression to become output value or a field name (otherwise a whole expression is used):
?<value>
marker points to the regular expression group that must be used as a final value for the object. Example:Invoice ?<value>(\d+)
will extract12345
as final field value fromInvoice 12345
string.?<key>
marker points to the regular expression group that must be used a field name for the object. Important: multiple matches for this expression will auto-generate multiple objects for the output. Example:(?<key>{{SentenceWithSingleSpaces}}): (?<value>{{SentenceWithSingleSpaces}})
will extractkey1: value, key2: value
as two separate objects namedkey1
andkey2
accordingly.
Search-based Table Object
This object is defined by a JSON-based configuration.
Table objects return tabular data you need to extract. Table objects can be defined by
- rectangle coordinates (use
Table from Rectangle
type of object) - AI-powered automated table detection (automatically finds tables on pages, using
Table from Auto Detection
) - finds table by the set of rules using text or regular expressions search that defines the table’s start, the end, and rows (using
Table from Search
)
You can also define multitple table types inside JSON based Tables section inside this object configuration.
Table parameters (tableProperties
object):
name
- table name to distinguish different tables in the result.autoDetection
object [optional] - defines auto detection mode to use AI based table detection. IMPORTANT: when this section is set andtableIndex
is not-1
then other params likestart
,end
,row
are ignored because table is auto detected.pageIndex
[required] - sets index of the page to find table on (starts at0
(zero)).tableIndex
[required] --1
by default (means that auto detection is disabled by default. Set to0
or higher index so it will detect a table on given page, starting from top to bottom. IMPORTANT: when this section is set andtableIndex
is not-1
then other params likestart
,end
,row
are ignored because table is auto detected.
start
- group of parameters that define the start of the table:expression
- macro expression to find the start of the table, ory
- the top coordinate of the table. You can find PDF points coordinates in your PDF file using our PDF.co PDF/Edit/Add Helper.pageIndex
- index of the page containing they
coordinate.regex
- indicates if the expression parameter contains regular expression.
end
object - group of parameters that define the end of the table:expression
- macro expression to find the end of the table, ory
- the bottom coordinate of the table. You can find PDF points coordinates in your PDF file using our PDF.co PDF/Edit/Add Helper.regex
- indicates if the expression parameter contains regular expression.
subItemStart
object - [optional] parameters that define the start of the table sub-item. Sub-items are used for tables with complex multiline rows:expression
- macro expression to find the start of the sub-item.regex
- indicates if the expression parameter contains regular expression.
subItemEnd
object - [optional] parameters that define the end of the table sub-item:expression
- macro expression to find the end of the sub-item.regex
- indicates if the expression parameter contains regular expression.
introduction
object - Parameters to parse values from sub-headers. Values parsed from the introduction expression will be repeated in the beginning of every row.expression
- macro expression to parse introduction items.regex
- indicates if the expression parameter contains regular expression.
row
object - [optional] group of parameters that define table rows:expression
- the main macro expression to find a row. Named groups in this expression will go to the result table as columns. See example below.regex
- indicates ifexpression
contains regular expression.subExpression1
,subExpression2
,subExpression3
,subExpression4
,subExpression5
- additional expressions to parse some remaining parts of row data which the main expression cannot parse in one pass. Sub-expressions are executed after the main expression for the text chunks between matches of the main expression. Can be used to parse hanging rows (wrapped multiline cells).
columns
array - [optional] array that defines column properties. Names of columns should correspond to the names of the capturing groups of the row expression. Column properties:name
- defines column name.x
- [optional] X coordinate of the left column edge in PDF Points. You can find PDF points coordinates in your PDF file using our PDF.co PDF/Edit/Add Helper.type
- [optional] defines column data type. Should be one of these values:string
integer
date
decimal
- for more see also the types descriptions in fields section.
dateFormat
- [optional] See dateFormat description in fields section.outputDateFormat
- [optional] See outputDateFormat description in fields section.coalesceWith
- [optional] Name of column to merge the parsed value with.
rowMergingRule
string - [optional] For the fields of rectangle type and table data type. Defines the rule to merge multiline data in table cells. Supported values:none
- default, no rule.byBorders
- combine lines within a table cell framed by border lines.hangingRows
- join table row that contains only a single cell up to the previous row if there is no separating line between them. Useful for tables without borders between rows.
multipage
boolean - [optional] defines whether the table may continue on further pages.horizontalSeparationOffset
- offset from the tablestart
to the beginning of the first table row.horizontalSeparationStep
- row height. These two parameters help the parser distinguish rows in tables without horizontal separators. This works only with tables with fixed row height.
Example of table parsing:
Description | Interval | Quantity | Amount ($) |
Basic Plan | Jan 1 - Jan 31 | 1 | 25.00 |
Basic Plan | Feb 1 - Feb 28 | 1 | 25.00 |
Total in USD: | 50.00 |
The table above, can be parsed with macro expressions or with explicitly defined column coordinates.
- Extracting table using AI powered auto detector:
Full template:
autoDetectTableField
value:
{
"autoDetection": {
"pageIndex": 0,
"tableIndex": 0
},
"columns": [
{
"name": "description",
"type": "string"
},
{
"name": "interval",
"type": "string"
},
{
"name": "quantity",
"type": "integer"
},
{
"name": "amount",
"type": "decimal"
}
]
}
Full template:
{
"templateVersion": 4,
"templatePriority": 0,
"culture": "en-US",
"objects": [],
"templateName": "",
"options": {
"ocrMode": "auto",
"ocrLanguage": "eng"
},
"objects": [
{
"name": "AutoDetectTable",
"objectType": "table",
"tableProperties": {
"autoDetection": {
"pageIndex": 0,
"tableIndex": 0
},
"columns": [
{
"name": "description",
"type": "string"
},
{
"name": "interval",
"type": "string"
},
{
"name": "quantity",
"type": "integer"
},
{
"name": "amount",
"type": "decimal"
}
]
}
}
]
}
2. Extracting table using markers defines by macros:
`searchBasedTable` object properties:
```JSON
{
"start": {
"expression": "Amount{{Space}}{{OpeningParenthesis}}{{Dollar}}{{ClosingParenthesis}}"
},
"end": {
"expression": "Total in USD"
},
"row": {
"expression": "{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}})(?<interval>{{3Letters}}{{Space}}{{Digits}}{{Space}}{{Minus}}{{Space}}{{3Letters}}{{Space}}{{Digits}}){{Spaces}}(?<quantity>{{Digits}}){{Spaces}}(?<amount>{{Number}})",
"regex": true
},
"columns": [
{
"name": "description",
"type": "string"
},
{
"name": "interval",
"type": "string"
},
{
"name": "quantity",
"type": "integer"
},
{
"name": "amount",
"type": "decimal"
}
]
}
Full Template:
{
"templateVersion": 4,
"templatePriority": 0,
"culture": "en-US",
"objects": [],
"templateName": "",
"options": {
"ocrMode": "auto",
"ocrLanguage": "eng"
},
"objects": [
{
"name": "searchBasedTable1",
"objectType": "table",
"tableProperties": {
"start": {
"expression": "Amount{{Space}}{{OpeningParenthesis}}{{Dollar}}{{ClosingParenthesis}}"
},
"end": {
"expression": "Total in USD"
},
"row": {
"expression": "{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}})(?<interval>{{3Letters}}{{Space}}{{Digits}}{{Space}}{{Minus}}{{Space}}{{3Letters}}{{Space}}{{Digits}}){{Spaces}}(?<quantity>{{Digits}}){{Spaces}}(?<amount>{{Number}})",
"regex": true
},
"columns": [
{
"name": "description",
"type": "string"
},
{
"name": "interval",
"type": "string"
},
{
"name": "quantity",
"type": "integer"
},
{
"name": "amount",
"type": "decimal"
}
]
}
}
]
}
Macros
Built-in macros:
Macro |
Description |
|
Tries to detect the date in the most common formats. |
|
Decimal number like the following: “12.34”, “-123,456.78”, “123.456”. Decimal separator and thousands separator are automatically taken from the template culture. |
|
Decimal number with currency symbol like the following: “USD 12.34”, “$123,456.78”, “123.45 €”. Decimal separator and thousands separator are automatically taken from the template culture. |
|
Tries to detect US phone number. |
|
Single space. |
|
One or more spaces. |
|
Two spaces. |
|
Three spaces. |
|
Four spaces. |
|
Five spaces. |
|
Six spaces. |
|
Seven spaces. |
|
Eight spaces. |
|
Nine spaces. |
|
Ten spaces. |
|
One digit. |
|
One or more digits. |
|
Two digits. |
|
Three digits. |
|
Four digits. |
|
Five digits. |
|
Six digits. |
|
Seven digits. |
|
Eight digits. |
|
Nine digits. |
|
Ten digits. |
|
One digit or symbol (“_-+=/”). |
|
One or more digits or symbols (“_-+=/”). |
|
Two digits or symbols (“_-+=/”). |
|
Three digits or symbols (“_-+=/”). |
|
Four digits or symbols (“_-+=/”). |
|
Five digits or symbols (“_-+=/”). |
|
Six digits or symbols (“_-+=/”). |
|
Seven digits or symbols (“_-+=/”). |
|
Eight digits or symbols (“_-+=/”). |
|
Nine digits or symbols (“_-+=/”). |
|
Ten digits or symbols (“_-+=/”). |
|
One letter from any language. |
|
One or more letters from any language. |
|
Two letters from any language. |
|
Three letters from any language. |
|
Four letters from any language. |
|
Five letters from any language. |
|
Six letters from any language. |
|
Seven letters from any language. |
|
Eight letters from any language. |
|
Nine letters from any language. |
|
Ten letters from any language. |
|
One uppercase letter from any language. |
|
One or more uppercase letters from any language. |
|
Two uppercase letters from any language. |
|
Three uppercase letters from any language. |
|
Four uppercase letters from any language. |
|
Five uppercase letters from any language. |
|
Six uppercase letters from any language. |
|
Seven uppercase letters from any language. |
|
Eight uppercase letters from any language. |
|
Nine uppercase letters from any language. |
|
Ten uppercase letters from any language. |
|
One letter or digit. |
|
One or more letters or digits. |
|
Two letters or digits. |
|
Three letters or digits. |
|
Four letters or digits. |
|
Five letters or digits. |
|
Six letters or digits. |
|
Seven letters or digits. |
|
Eight letters or digits. |
|
Nine letters or digits. |
|
Ten letters or digits. |
|
One uppercase letter or digit. |
|
One or more uppercase letters or digits. |
|
Two uppercase letters or digits. |
|
Three uppercase letters or digits. |
|
Four uppercase letters or digits. |
|
Five uppercase letters or digits. |
|
Six uppercase letters or digits. |
|
Seven uppercase letters or digits. |
|
Eight uppercase letters or digits. |
|
Nine uppercase letters or digits. |
|
Ten uppercase letters or digits. |
|
One letter, or digit, or symbol (“_-+=/”). |
|
One or more letters, or digits, or symbols (“_-+=/”). |
|
Two letters, or digits, or symbols (“_-+=/”). |
|
Three letters, or digits, or symbols (“_-+=/”). |
|
Four letters, or digits, or symbols (“_-+=/”). |
|
Five letters, or digits, or symbols (“_-+=/”). |
|
Six letters, or digits, or symbols (“_-+=/”). |
|
Seven letters, or digits, or symbols (“_-+=/”). |
|
Eight letters, or digits, or symbols (“_-+=/”). |
|
Nine letters, or digits, or symbols (“_-+=/”). |
|
Ten letters, or digits, or symbols (“_-+=/”). |
|
One uppercase letter, or digit, or symbol (“_-+=/”). |
|
One or more uppercase letters, or digits, or symbols (“_-+=/”). |
|
Two uppercase letters, or digits, or symbols (“_-+=/”). |
|
Three uppercase letters, or digits, or symbols (“_-+=/”). |
|
Four uppercase letters, or digits, or symbols (“_-+=/”). |
|
Five uppercase letters, or digits, or symbols (“_-+=/”). |
|
Six uppercase letters, or digits, or symbols (“_-+=/”). |
|
Seven uppercase letters, or digits, or symbols (“_-+=/”). |
|
Eight uppercase letters, or digits, or symbols (“_-+=/”). |
|
Nine uppercase letters, or digits, or symbols (“_-+=/”). |
|
Ten uppercase letters, or digits, or symbols (“_-+=/”). |
|
Dollar sign ($). |
|
Euro sign (€). |
|
Pound sign (£). |
|
Yen sign (¥). |
|
Yuan sign (¥). |
|
Any currency symbol ($, €, £, ¥, etc.) |
|
Single dot symbol (“.”). |
|
Single comma symbol (“,”). |
|
Single colon symbol (“:”). |
|
Single semicolon symbol (“;”). |
|
Single minus (dash, hyphen) symbol (“-“). |
|
Slash symbol (“/”). |
|
Backslash symbol (“"). |
|
Percent symbol (“%”). |
|
Start of line (virtual symbol). |
|
End of line (virtual symbol). |
|
Single-space-separated sequence of words and symbols. Breaks on double space. |
|
Extended {{SentenceWithSingleSpaces}} macro allowing two spaces between words. Breaks on triple space. |
|
End of page or end of document. |
|
Start or end of word (virtual symbol). |
|
Opening curly brace symbol (“{“). |
|
Closing curly brace symbol (“}”). |
|
Opening parenthesis symbol (“(“). |
|
Closing parenthesis symbol (“)”). |
|
Opening square bracket symbol (“[”). |
|
Closing square bracket symbol (“]”). |
|
Opening angle bracket symbol (“<”). |
|
Closing angle bracket symbol (“>”). |
|
Date in format “01/01/19” (with leading zero). |
|
Date in format “1/1/19” (without leading zero). |
|
Date in format “01/01/2019” (with leading zero). |
|
Date in format “1/1/2019” (without leading zero). |
|
Date in format “01-01-19” (with leading zero). |
|
Date in format “1-1-19” (without leading zero). |
|
Date in format “01-01-2019” (with leading zero). |
|
Date in format “1-1-2019” (without leading zero). |
|
Date in format “01.01.19” (with leading zero). |
|
Date in format “1.1.19” (without leading zero). |
|
Date in format “01.01.2019” (with leading zero). |
|
Date in format “01.01.2019” (without leading zero). |
|
Date in format “01/01/19” (with leading zero). |
|
Date in format “1/1/19” (without leading zero). |
|
Date in format “01/01/2019” (with leading zero). |
|
Date in format “1/1/2019” (without leading zero). |
|
Date in format “01-01-19” (with leading zero). |
|
Date in format “1-1-19” (without leading zero). |
|
Date in format “01-01-2019” (with leading zero). |
|
Date in format “1-1-2019” (without leading zero). |
|
Date in format “01.01.19” (with leading zero). |
|
Date in format “1.1.19” (without leading zero). |
|
Date in format “01.01.2019” (with leading zero). |
|
Date in format “1.1.2019” (without leading zero). |
|
Date in format “20190101”. |
|
Date in format “2019/01/01” (with leading zero). |
|
Date in format “2019/1/1” (without leading zero). |
|
Date in format “2019-01-01” (with leading zero). |
|
Date in format “2019-1-1” (without leading zero). |
|
Any characters up to the next macro in the expression. |
|
Any characters up to the next macro in the expression or to the end of line. Greedy version. |
|
Enables or disables single-line mode. In single-line mode, {{Anything}} and {{AnythingGreedy}} macros do not stop at the end of the line and proceed to the next line of text. |
|
Enables or disables case-insensitive mode. |
Special Functions
You can also insert so called special function which looks like this: $$functionName
. Special fucntions are created for AI-powered value detection, like a company name, max number in a whole document, max date or even finding and decoding QR Code barcode value inside document.
All special functions are listed here.
Sample templates
Sample document text:
DigitalOcean
101 Avenue of the Americas, 10th Floor
New York, NY 10013
Date Issued: February 1, 2016
Period: January 1 - 31, 2016
Invoice Number: 1234567
Description Hours Start End USD
Website-Dev (1GB) 744 01-01 00:00 01-31 23:59 $10.00
Website-Live (1GB) 744 01-01 00:00 01-31 23:59 $10.00
Database-Live (2GB) 744 01-01 00:00 01-31 23:59 $20.00
Tasks-Dev (1GB) 744 01-01 00:00 01-31 23:59 $10.00
Total: $50.00
Bill To:
Samee Sikka <admin@meee.org>
meee.org
Gouran
If you have a credit card on file it will be automatically charged within 24 hours.
Sample template (YAML):
{
"templateVersion": 4,
"templatePriority": 0,
"templateName": "DigitalOcean Invoice",
"objects": [
{
"name": "companyName",
"objectType": "field",
"fieldProperties": {
"fieldType": "static",
"expression": "DigitalOcean"
}
},
{
"name": "invoiceId",
"objectType": "field",
"fieldProperties": {
"fieldType": "macros",
"expression": "Invoice Number: ({{Digits}})",
"regex": true
}
},
{
"name": "dateIssued",
"objectType": "field",
"fieldProperties": {
"fieldType": "macros",
"expression": "Date Issued: ({{SmartDate}})",
"dataType": "date",
"dateFormat": "auto-mdy"
}
},
{
"name": "total",
"objectType": "field",
"fieldProperties": {
"fieldType": "macros",
"expression": "Total: {{Dollar}}({{Number}})",
"dataType": "decimal"
}
},
{
"name": "currency",
"objectType": "field",
"fieldProperties": {
"fieldType": "static",
"expression": "USD"
}
},
{
"name": "table1",
"objectType": "table",
"tableProperties": {
"start": {
"expression": "Description{{Spaces}}Hours"
},
"end": {
"expression": "Total:"
},
"row": {
"expression": "{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}}){{Spaces}}(?<hours>{{Digits}}){{Spaces}}(?<start>{{2Digits}}{{Minus}}{{2Digits}}{{Space}}{{2Digits}}{{Colon}}{{2Digits}}){{Spaces}}(?<end>{{2Digits}}{{Minus}}{{2Digits}}{{Space}}{{2Digits}}{{Colon}}{{2Digits}}){{Spaces}}{{Dollar}}(?<unitPrice>{{Number}})",
"regex": true
},
"columns": [
{
"name": "hours",
"type": "integer"
},
{
"name": "unitPrice",
"type": "decimal"
}
]
}
}
]
}
Result (JSON):
{
"templateName": "DigitalOcean Invoice",
"templateVersion": "4",
"objects": [
{
"name": "companyName",
"objectType": "field",
"value": "DigitalOcean"
},
{
"name": "invoiceId",
"objectType": "field",
"value": "1234567",
"pageIndex": 0,
},
{
"name": "dateIssued",
"objectType": "field",
"value": "2016-02-01T00:00:00",
"pageIndex": 0,
},
{
"name": "total",
"objectType": "field",
"value": 50.00,
"pageIndex": 0,
},
{
"name": "currency",
"objectType": "field",
"value": "USD"
},
{
"name": "table1",
"objectType": "table",
"rows": [
{
"description": {
"value": "Website-Dev (1GB)",
"pageIndex": 0
},
"hours": {
"value": 744,
"pageIndex": 0
},
"start": {
"value": "01-01 00:00",
"pageIndex": 0
},
"end": {
"value": "01-31 23:59",
"pageIndex": 0
},
"unitPrice": {
"value": 10.00,
"pageIndex": 0
}
},
{
"description": {
"value": "Website-Live (1GB)",
"pageIndex": 0
},
"hours": {
"value": 744,
"pageIndex": 0
},
"start": {
"value": "01-01 00:00",
"pageIndex": 0
},
"end": {
"value": "01-31 23:59",
"pageIndex": 0
},
"unitPrice": {
"value": 10.00,
"pageIndex": 0
}
},
{
"description": {
"value": "Database-Live (2GB)",
"pageIndex": 0
},
"hours": {
"value": 744,
"pageIndex": 0
},
"start": {
"value": "01-01 00:00",
"pageIndex": 0
},
"end": {
"value": "01-31 23:59",
"pageIndex": 0
},
"unitPrice": {
"value": 20.00,
"pageIndex": 0
}
},
{
"description": {
"value": "Tasks-Dev (1GB)",
"pageIndex": 0
},
"hours": {
"value": 744,
"pageIndex": 0
},
"start": {
"value": "01-01 00:00",
"pageIndex": 0
},
"end": {
"value": "01-31 23:59",
"pageIndex": 0
},
"unitPrice": {
"value": 10.00,
"pageIndex": 0
}
}
]
}
]
}
Copyright (c) 2018-2024 ByteScout, Inc.