How to use PDF Multitool OCR Analyzer to create and test OCR image to text configurations for cloud and on-prem version of PDF.co
If you are working with scanned PDFs and the extracted text (text, csv, json, xml) is incomplete or inaccurate, consider using our desktop app, ByteScout PDF Multitool
(compatible with Windows 7/10/11 and higher). This app emulates most of the major functions of the PDF.co API and, more importantly, allows you to create and test configurations for PDF extraction and image-to-text functions locally.
ByteScout PDF Multitool
includes the OCR Analyzer
tool, which helps you quickly find the best combination of OCR filters and parameters to enhance the quality of PDF text extraction results.
PDF Multitool
and its OCR Analyzer
provide JSON code for profiles
that can be used with PDF.co. Simply set this JSON config to the profiles
parameters for the PDF To Text/CSV/XML/JSON API methods.
Step-by-step guide on how to start using the PDF Multitool
free app:
- First, download the free version of
PDF Multitool
from here. - Next, load your PDF/JPG/PNG document into the multitool.
- Then, in the left navigation menu, select
OCR Analyzer
. - Choose the
OCR Language
andOCR Resolution
and clickGo
. - Click
Copy To
button and selectSend to CSV..
or similar to copy this configuration into the appropriate extractor. - This will open PDF Extractor config for PDF to CSV/Text/XML/JSON accordingly.
- Try mew configuration by clicking
Preview
- If you're satisfied with the outcome, go to the
Profile for PDF.co
tab. - Click on
Copy as payload for PDF.co
. - Finally, paste this as a value to the
profiles
parameter value into your script/code or in Zapier/Make plugin accordingly. - If you are not satisifed with results, try to adjust parameters and filters on the
All Options
tab (see Tips and Tricks below).
For a demo on how to use this tool, watch this video: https://youtu.be/NSyyohNNe6E
Tips and Tricks On Finding Best OCR Settings Using PDF Multitool
-
For fuzzy or blurred scans: try to increase OCR Resolution from default
300
dpi (dots per inch) to600
or even800
or1200
dpi and try again. Note: higher resolution means more time to process the document. -
For dark scans: try to add
Gamma Correction
filter with default value of1.4
or1.5
and try again. Note: this filter will make the dark images lighter automatically. - ((To get text printed nearby borders or lines**, try to add filter that removes lines before extraction: For tables with borders or lines and if you see layout is reproduced incorrect or some words/letters are lost: try to add
Horizontal Line Removal
andVertical Line Removal
filters inAll Options
-OCRImageProcessingFilters
section. Make sure to put this filters first in the list (useUp
andDown
buttons to move filters up and down in the list). -
For non-English documents set proper recognition language: set
OCR Language
to the appropriate language you see on the document. Default selected iseng
(English). If you have a document in German, set it todeu
(German). If you have multiple languages in the same document, select 2 languages (for example,eng
anddeu
). -
If you don't need a whole page the try to limit extraction area to a specific area on a page. It will increase the quality of text extraction as well as processing speed. To set extraction area, click on the
Select
tool on the main toolbar inPDF Multitool
and use mouse to select the area with the source text. Then run extraction and preview again. - If extracted text is missing some important text snippets, try to set an extraction area to extract from. Limiting to a specific area on a page may dramatically increase the quality of the text recognition.
- If extracting from the whole page produces broken results: try to run few extractions from the same page but limiting to selected areas, for example: extract from the top area, then from the middle area, then from the bottom area. Then combine results into one file. This will help to get better results if the page has different layouts or different fonts or different font sizes.
- Setting extraction area to exclude header and footer and / or side notes in the document may simplify text analysis greatly.
-
Removing Background Noise: Lowering
Gamma
(with values below1.4
) and raisingContrast
can effectively remove background noise from images. -
Extracting text from color photos or scans. Enhancing Gamma Effect on Color Photos improves the extraction quality. Applying the
Grayscale
filter beforeGamma
may yield better gamma effects on color photos.Grayscale
alone is generally less useful. -
Removing Parasite Dots and Artifacts producing small garbled text snippets: Combining the
Median
filter with high-resolution rendering (600
+ DPI) can help remove parasite dots from scanned images or fax rasterization artifacts. However, this approach may also remove punctuation symbols. -
Fixing Etched/Distorted Letters: The
Dilate
filter can be used to repair etched or distorted letters in images.
List of OCR Image Preprocessing filters supported by PDF Multitool and PDFco API:
Contrast
- Adds the Contrast image filter, which enhances image quality for OCR by improving contrast. This filter is particularly helpful for images where the text color is gray or similar to the background color. Lowering gamma and raising contrast can effectively remove background noise from images.Deskew
- Applies theDeskew
image filter with a default angle threshold of 0.4 degrees (minimal admissible skew angle). This filter is useful for fixing slight rotatin of scanned images. For scans rotated 90, 180, 270 degrees, use theRotationAngle
parameter inprofiles
instead, for example{ 'rotationAngle': 1 }
.RotationAngle
parameters available are the following:0
no rotation (default)1
90 degrees2
180 degrees3
270 degreesDilate
- Incorporates the "Dilate" image filter, which improves image quality for OCR by thickening the letter strokes. The Dilate filter can be used to repair etched or distorted letters in images.Fit
- Adds the Fit image filter with a specified size limit. The image is proportionally resized when its width or height exceeds the limit, which improves text extraction performance from large images.Gamma
- Implements the Gamma Correction filter with a default value of 1.4. This filter enhances image quality for OCR by automatically lightening dark images.Grayscale
- Applies the "Grayscale" image filter. Applying theGrayscale
filter beforeGamma
may yield better gamma effects on color photos, althoughGrayscale
alone is less useful.HorizontalLinesRemover
- Integrates the "Horizontal Lines Remover" image filter. This filter enhances OCR text recognition quality inside borders and near borders by removing horizontal lines before text recognition. IMPORTANT: this filter is added by default in PDF.co cloud and on-prem. If you don't need it, setprofiles
to{ 'OCRImagePreprocessingFilters.Clear()': [] }
VerticalLinesRemover
- Implements the "Vertical Lines Remover" image filter. This filter enhances OCR text recognition quality inside borders and near borders by removing vertical lines before text recognition. IMPORTANT: this filter is added by default in PDF.co cloud and on-prem. If you don't need it, setprofiles
to{ 'OCRImagePreprocessingFilters.Clear()': [] }
Invert
- Adds theInvert
(negative) image filter. Sometime, scanned documents are inverted (white text on black background). This filter can be used to fix this issue by inverting all colors before extracting text.Median
- Incorporates the "Median" image filter. Combining theMedian
filter with high-resolution rendering (600
+ DPI) can help remove parasite dots from scanned images or fax rasterization artifacts. However, this approach may also remove punctuation symbols.Scale
- Adds the Scale image filter with a specified scale factor. For example, 2.0 doubles the size of the input image, improving the recognition quality of small letters.
Useful links:
- How to add
profiles
to PDF.co API request -
ByteScout PDF Multitool
- more information at https://bytescout.com/products/pdfmultitool/index.html