tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
Subject Re: apache tikka is not working for scanned image documents
Date Wed, 05 Apr 2017 06:23:36 GMT
Hi Vadivelhan,

As Chris mentioned, please visit https://wiki.apache.org/tika/TikaOCR and
install Tesseract in your machine. To check the availability of Tesseract
in your machine, type this command without quotes "tesseract test.jpg out " in
the terminal and check whether you can OCR an image and output it to a
file.

This is a code snippet to OCR a pdf, give it a run.

public void doOCR() throws Exception {

    String resource = "yourPDF.pdf";

    TesseractOCRConfig config = new TesseractOCRConfig();

    Parser parser = new RecursiveParserWrapper(new AutoDetectParser(),
            new BasicContentHandlerFactory(
                    BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1));

    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);

    ParseContext parseContext = new ParseContext();
    parseContext.set(TesseractOCRConfig.class, config);
    parseContext.set(Parser.class, parser);
    parseContext.set(PDFParserConfig.class, pdfConfig);

    try (InputStream stream =
TesseractOCRParserTest.class.getResourceAsStream(resource)) {
        parser.parse(stream, new DefaultHandler(), new Metadata(),
parseContext);
    }
    List<Metadata> metadataList = ((RecursiveParserWrapper)
parser).getMetadata();

    StringBuilder contents = new StringBuilder();
    for (Metadata m : metadataList) {
        contents.append(m.get(RecursiveParserWrapper.TIKA_CONTENT));
    }

    System.out.println(contents.toString());
}


On Wed, Apr 5, 2017 at 9:07 AM, Vadivelhan <
vadivelcommunicationid@rediffmail.com> wrote:

> Hi ,
>
> I tested with Apache Tikka with OCR configuration. It is not able to
> provide extracted text from the pdf document. I attached the same
> document.please check and update me with Result. This is very urgent. It
> would be really appreciated.
>
>
> Best Regards,
> M.Vadivelhan
> Cell No:+91 7708435395 <+91%2077084%2035395>
>
> On Tue, 04 Apr 2017 23:09:15 +0530 Chris Mattmann wrote
> > Hi,Have you checked out:http://wiki.apache.org/tika/TikaOCRWhat
> specifically isn’t working?Moving this to dev@t.a.o:Cheers,ChrisFrom:
> on behalf of Vadivelhan
> Date: Tuesday, April 4, 2017 at 8:25 AM
> To: "mattmann@apache.org"
> Subject: apache tikka is not working for scanned image documentsHI
>
> apache tikka is not working for scanned image documents. please suggest
> your help
>
> Regards,
> M.Vadivelhan
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message