tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thamme Gowda <thammego...@apache.org>
Subject Re: Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
Date Sat, 04 Mar 2017 04:01:35 GMT
Hi Thejan,

I tried running your code snippet on my machine. It worked!

It looks to me that you missed setting up tesseract or your setup is
incomplete.
You have to have tesseract and imagemagick installed and make it available
in $PATH  to get it work.
You can verify by using command:
$tesseract test.jpg stdout
$convert --help

I see from your path to image that that you're running it on linux. If it
is ubuntu try installing tesseract and imagemagick using apt-get.

There is some documentation on wiki [1] for setup on OSX. After you make
these changes, you are requested to get permissions to edit wiki and update
the OCR page accordingly. Please keep this in your TODO list for now :-)

Let me know if this solved your problem.

Best,
TG

[1] https://wiki.apache.org/tika/TikaOCR

*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Fri, Mar 3, 2017 at 12:36 PM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

> Update: not "getting null for a stream", it should be "getting nothing as
> metadata for the image"
>
> On 4 Mar 2017 01:26, "Thejan Wijesinghe" <thejan.k.wijesinghe@gmail.com>
> wrote:
>
>> Hi Thamme,
>>
>> I am happy to say that I have stated working on your suggestion for creating
>> a simpler Java version of TesseractOCRParser using Tess4J. I ran a few
>> tests with the existing TesseractOCRParser and found out that I'm getting
>> null for the stream of an image although I could extract the content of the
>> image without a problem. That particular code snippet is attached below.
>> I'm not sure whether I'm missing something. This is important for me to
>> know this because I'm planning to extract metadata as well through the API
>> that I'm going to write using Tess4j.
>>
>> public static void main(final String[] args) throws IOException, SAXException, TikaException
{
>>
>>     // CLI implementation
>>     File imageFile = new File("/home/thejan/Desktop/test.jpg");
>>     FileInputStream stream = new FileInputStream(imageFile);
>>     ContentHandler handler = new BodyContentHandler();
>>     Metadata metadata = new Metadata();
>>     ParseContext context = new ParseContext();
>>
>>     TesseractOCRParser tessParser = new TesseractOCRParser();
>>     tessParser.parse(stream, handler, metadata, context);
>>     stream.close();
>>     // The content gets printed correctly
>>     System.out.println(handler.toString());
>>
>>     // But I get "X-Parsed-By : org.apache.tika.parser.EmptyParser" for metadata
>>     String[] metadataNames = metadata.names();
>>
>>     for(String name : metadataNames) {
>>         System.out.println(name+ " : " + metadata.get(name));
>>     }
>>
>>
>> On Fri, Mar 3, 2017 at 6:16 AM, Thamme Gowda <thammegowda@apache.org>
>> wrote:
>>
>>> Thejan,
>>>
>>> Yes, send your questions to us, and cc dev list.
>>> Looking forward to working with you!
>>>
>>> Best,
>>> TG
>>>
>>> --
>>> Thamme Gowda
>>> TG | @thammegowda
>>> ~Sent via somebody's IMAP server
>>>
>>> On Mar 2, 2017 11:50 AM, "Thejan Wijesinghe" <
>>> thejan.k.wijesinghe@gmail.com>
>>> wrote:
>>>
>>> > Dear Thamme and Chris,
>>> >
>>> > I have commented on the particular JIRA page and subscribed to the
>>> > dev-mailing list as Thamme suggested. I am really interested in looking
>>> > into the challenges that Thamme has provided. Thank you for guiding me
>>> this
>>> > way. If I get any issues while working on these problems, is it
>>> alright to
>>> > contact you this way (directly mailing to you two while CCing the
>>> > dev-mail)? or is there any other suitable way of doing that? Pardon me
>>> for
>>> > asking such a question, I am really concerned about the protocol that
>>> > mailing should happen.
>>> >
>>>
>>
>>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message