PDF To XML and forcing OCR for text extraction from scanned images inside PDF

PDF To XML and forcing OCR for text extraction from scanned images inside PDF

For Zapier, Integromat and others plugins insert custom profiles into profiles field. For API calls please set value as string in profiles parameter as string.

There might be problem with extracting XML from Scanned PDF due to special cases when file contains both scanned images and long text objects (“Generated by Foxit PDF Creator ….”). The Optical Character Recognition (OCR) runs automatically only when a document contains no text. We can force the OCR for such documents. It can be done with a custom profile by using DetectNewColumnBySpacesRatio option.

Following profile will force OCR.

{ "OCRMode": "TextFromImagesOnly" }

We can also combile profiles like below.

private const string Profiles = { "DetectNewColumnBySpacesRatio": "2.0" } }, { "profile2": { "OCRMode": "TextFromImagesOnly" };

Applies To:

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.