PDF to JSON/PDF to Text - fixing malformed PDF or incorrectly embedded font
For Zapier, Integromat and others plugins insert custom profiles
into profiles
field. For API calls please set value as string in profiles
parameter as string.
Sometimes PDF file used is malformed. The embedded font used to draw characters has modified character table that doesn’t allow to get correct symbol codes of any relevant charset. In this case we can ensure that if document opens in Adobe Reader and copy-paste the text from it. If all characters are garbled too, This might be some sort of extraction protection.
If we need to get the text from this kind of file at any cost, we can try a special mode that renders document page and pass it to Optical Character Recognition (OCR). This allows to “repair” the text. In Web API you can enable this mode using profiles
parameter allowing to change advanced options of underlying PDF Extractor engine.
{"OCRMode": "TextFromImagesAndVectorsAndRepairedFonts" }
or
{'OCRMode": 3}
If you are running pdf/convert/to/json
then you can check the output JSON for ocrWasPerformed
to check if OCR was performed on given pages. If this JSON reponse has this property set to true
then it means that the engine detected malformed font on that page and ran OCR engine to extract correct text from this page.
Applies To:
/pdf/convert/to/csv
/pdf/convert/to/xml
/pdf/convert/to/json
/pdf/convert/to/xls
/pdf/convert/to/xlsx