What is the cost effective way to find text in documents?
The best approach that our customers use is the following:
- run a batch to extract text from pdf files
- put these extracted text snippets into the database (along with page markers or one record per page)
- use that database with text records with full text search enabled as a reference for text search
Running text search for every pdf file every time after user query will be slower because pdf files are not designed to store text, they are designed for printing mainly.
Also, with using a separate text database your costs will be low and you can implement functions like showing related documents, grouped by topic based on text analysis and classification, support for scanned documents (text extraction from scanned documents is painfully slow because of the machine learning / AI unblocked) and so on.