tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Randal Moss (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1790) Enhancement for extracting text from pdfs
Date Mon, 09 Nov 2015 16:42:11 GMT
Randal Moss created TIKA-1790:

             Summary: Enhancement for extracting text from pdfs
                 Key: TIKA-1790
                 URL: https://issues.apache.org/jira/browse/TIKA-1790
             Project: Tika
          Issue Type: Improvement
          Components: example, parser
            Reporter: Randal Moss
            Priority: Minor

This enhancement would attempt to extract more text from multicolored background images in
PDFs by using adaptive threshold binarization before applying Tesseract for OCR. It also tries
to extract text from vector images inside PDFs by first rasterizing them (using Ghostscript)
and then applying Tesseract to the flattened images. The final output would be a text file
containing all previously extracted text.

I would want to integrate this as a separate library from Tika that is similar to how the
[GeoTopicParser|https://wiki.apache.org/tika/GeoTopicParser] is handled.

The code that I have is still a work in progress and can be found [here|https://github.com/RandalMoss/pdf-search].

This message was sent by Atlassian JIRA

View raw message