tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
Subject Re: apache tikka is not working for scanned image documents
Date Wed, 05 Apr 2017 06:25:41 GMT
btw it's not tikka. It's Tika :)

On Wed, Apr 5, 2017 at 11:53 AM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

> Hi Vadivelhan,
>
> As Chris mentioned, please visit https://wiki.apache.org/tika/TikaOCR and
> install Tesseract in your machine. To check the availability of Tesseract
> in your machine, type this command without quotes "tesseract test.jpg out
>  " in the terminal and check whether you can OCR an image and output it
> to a file.
>
> This is a code snippet to OCR a pdf, give it a run.
>
> public void doOCR() throws Exception {
>
>     String resource = "yourPDF.pdf";
>
>     TesseractOCRConfig config = new TesseractOCRConfig();
>
>     Parser parser = new RecursiveParserWrapper(new AutoDetectParser(),
>             new BasicContentHandlerFactory(
>                     BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1));
>
>     PDFParserConfig pdfConfig = new PDFParserConfig();
>     pdfConfig.setExtractInlineImages(true);
>
>     ParseContext parseContext = new ParseContext();
>     parseContext.set(TesseractOCRConfig.class, config);
>     parseContext.set(Parser.class, parser);
>     parseContext.set(PDFParserConfig.class, pdfConfig);
>
>     try (InputStream stream = TesseractOCRParserTest.class.getResourceAsStream(resource))
{
>         parser.parse(stream, new DefaultHandler(), new Metadata(), parseContext);
>     }
>     List<Metadata> metadataList = ((RecursiveParserWrapper) parser).getMetadata();
>
>     StringBuilder contents = new StringBuilder();
>     for (Metadata m : metadataList) {
>         contents.append(m.get(RecursiveParserWrapper.TIKA_CONTENT));
>     }
>
>     System.out.println(contents.toString());
> }
>
>
> On Wed, Apr 5, 2017 at 9:07 AM, Vadivelhan <vadivelcommunicationid@
> rediffmail.com> wrote:
>
>> Hi ,
>>
>> I tested with Apache Tikka with OCR configuration. It is not able to
>> provide extracted text from the pdf document. I attached the same
>> document.please check and update me with Result. This is very urgent. It
>> would be really appreciated.
>>
>>
>> Best Regards,
>> M.Vadivelhan
>> Cell No:+91 7708435395 <+91%2077084%2035395>
>>
>> On Tue, 04 Apr 2017 23:09:15 +0530 Chris Mattmann wrote
>> > Hi,Have you checked out:http://wiki.apache.org/tika/TikaOCRWhat
>> specifically isn’t working?Moving this to dev@t.a.o:Cheers,ChrisFrom:
>> on behalf of Vadivelhan
>> Date: Tuesday, April 4, 2017 at 8:25 AM
>> To: "mattmann@apache.org"
>> Subject: apache tikka is not working for scanned image documentsHI
>>
>> apache tikka is not working for scanned image documents. please suggest
>> your help
>>
>> Regards,
>> M.Vadivelhan
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message