PDF To XML and forcing OCR for text extraction from scanned images inside PDF
For Zapier, Integromat and others plugins insert custom profiles
into profiles
field. For API calls please set value as string in profiles
parameter as string.
There might be problem with extracting XML from Scanned PDF due to special cases when file contains both scanned images and long text objects (“Generated by Foxit PDF Creator ….”). The Optical Character Recognition (OCR) runs automatically only when a document contains no text. We can force the OCR for such documents. It can be done with a custom profile by using DetectNewColumnBySpacesRatio
option.
Following profile will force OCR.
{ "OCRMode": "TextFromImagesOnly" }
We can also combile profiles like below.
private const string Profiles = { "DetectNewColumnBySpacesRatio": "2.0" } }, { "profile2": { "OCRMode": "TextFromImagesOnly" };
Applies To:
/pdf/convert/to/csv
/pdf/convert/to/xml
/pdf/convert/to/json
/pdf/convert/to/xls
/pdf/convert/to/xlsx