tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From keithrbennett <keithrbenn...@gmail.com>
Subject Re: Moving Functionality from CLI to ParseUtils
Date Sun, 12 Jul 2009 23:03:45 GMT

Jukka and All -

I think a Tika facade would be awesome.

I guess where I mentioned streams, I should be mentioning readers
and writers instead.

BTW, how can I insert new text into quoted sections of a message
in Nabble?

Regarding having a method that returns a Reader rather than
taking a Writer being better for Lucene, for other use cases a
Writer might be more convenient (for writing to files, for
example).  Having a method that takes a Writer would, I think, be
more useful than having a method returning a string because it
could 1) support sizes larger than memory capacity, 2) easily
support output to files, and 3) still support strings (by using a
StringWriter).

Speaking of Lucene, I have never used Lucene directly, so I lack
the context to understand the Tika/Lucene integration.  All my
input is from the point of view of someone who just wants to
parse text from documents and do things other than text search.
So if I neglect to include Lucene in my outlook, rest assured
that it is just ignorance and nothing more. ;)

Regarding XHTML, we already support it on the command line. My
sense is that Excel spreadsheet parsing would be used more often
for structured data than for raw text (that's certainly true for
me), so I hope we could keep that.  I understand your suggestion
to use Poi directly for more sophisticated document handling,
though.

Everything else sounded good to me.

Regards,
Keith

-- 
View this message in context: http://www.nabble.com/Moving-Functionality-from-CLI-to-ParseUtils-tp24337541p24453304.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Mime
View raw message