What is the cost effective way to find text in documents?

What is the cost effective way to find text in documents?

The best approach that our customers use is the following:

  • run a batch to extract text from pdf files
  • put these extracted text snippets into the database (along with page markers or one record per page)
  • use that database with text records with full text search enabled as a reference for text search

Running text search for every pdf file every time after user query will be slower because pdf files are not designed to store text, they are designed for printing mainly.

Also, with using a separate text database your costs will be low and you can implement functions like showing related documents, grouped by topic based on text analysis and classification, support for scanned documents (text extraction from scanned documents is painfully slow because of the machine learning / AI unblocked) and so on.

Have more questions? Submit a request

1 Comments

  • 0
    Avatar
    Jfitchett56

    This is very interesting. However, I do not see any bytescount functions to find all text and save it with its location as described. How are bullets one and two accomplished.

Please sign in to leave a comment.