tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
Subject Re: Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
Date Fri, 03 Mar 2017 19:56:35 GMT
Hi Thamme,

I am happy to say that I have stated working on your suggestion for creating
a simpler Java version of TesseractOCRParser using Tess4J. I ran a few
tests with the existing TesseractOCRParser and found out that I'm getting
null for the stream of an image although I could extract the content of the
image without a problem. That particular code snippet is attached below.
I'm not sure whether I'm missing something. This is important for me to
know this because I'm planning to extract metadata as well through the API
that I'm going to write using Tess4j.

public static void main(final String[] args) throws IOException,
SAXException, TikaException {

    // CLI implementation
    File imageFile = new File("/home/thejan/Desktop/test.jpg");
    FileInputStream stream = new FileInputStream(imageFile);
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    TesseractOCRParser tessParser = new TesseractOCRParser();
    tessParser.parse(stream, handler, metadata, context);
    stream.close();
    // The content gets printed correctly
    System.out.println(handler.toString());

    // But I get "X-Parsed-By : org.apache.tika.parser.EmptyParser" for metadata
    String[] metadataNames = metadata.names();

    for(String name : metadataNames) {
        System.out.println(name+ " : " + metadata.get(name));
    }


On Fri, Mar 3, 2017 at 6:16 AM, Thamme Gowda <thammegowda@apache.org> wrote:

> Thejan,
>
> Yes, send your questions to us, and cc dev list.
> Looking forward to working with you!
>
> Best,
> TG
>
> --
> Thamme Gowda
> TG | @thammegowda
> ~Sent via somebody's IMAP server
>
> On Mar 2, 2017 11:50 AM, "Thejan Wijesinghe" <
> thejan.k.wijesinghe@gmail.com>
> wrote:
>
> > Dear Thamme and Chris,
> >
> > I have commented on the particular JIRA page and subscribed to the
> > dev-mailing list as Thamme suggested. I am really interested in looking
> > into the challenges that Thamme has provided. Thank you for guiding me
> this
> > way. If I get any issues while working on these problems, is it alright
> to
> > contact you this way (directly mailing to you two while CCing the
> > dev-mail)? or is there any other suitable way of doing that? Pardon me
> for
> > asking such a question, I am really concerned about the protocol that
> > mailing should happen.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message