tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: Using URL's for Input Resource Specifiers: How can I help?
Date Thu, 20 Sep 2007 15:37:11 GMT
Hi Guys,

 I am going to work on getting this in today. I had wanted to wait on this
because while doing the work for TIKA-6, I ran into a few issues that I
think that we need to close the loop on w.r.t. to Tika before we go too far
down any path. I wanted to summarize these ideas more coherently in a longer
email, but since there is some interest right now, I'll try my best to get
some of the ideas out there right now and then talk through them in more
detail as soon as I get a chance.

 Ok, so right now the biggest thing that I see is that the Parsing system
that we have is based on the code for Luis, which is great. It works, and
has a lot of capability, etc., however, Luis introduced some things for now
that we haven't (as the "Tika" team) thought through in greater detail.
Here's a practical example. We have code in Tika right now called
"LuisConfig". LuisConfig is being used right now as the main "configuration"
representation for Tika parsers. This is fine, however, it imposes a data
model, e.g., a "Parser" should have a "name", a "class", a "namespace", etc.
One of the issues with this is that, as a developer, I wasn't exactly sure
while I was doing the mime database patch for TIKA-6 w.r.t. configuration.
The mime database requires configuration parameters (e.g, "where is the
location of the mime database XML files"), however, we don't really have a
place in Tika for an overall configuration right now. We have the
LuisConfig, (which should probably be renamed to "TikaParserConfig" or
something like that), however that is very parser specific as far as I can
tell, and not specific to the configuration of the Tika toolkit. So, the
question becomes the following:

 1. How do we configure Tika? This somewhat relates to the prior discussion
on the Tika parser interface, but it extends beyond that.
 2. What are the right data attributes to configure a parser? Could we get
some documentation on them? For instance, is the "mime" attribute in the
config.xml file something that defines the "acceptable" mime types that a
parser can parse? We need to do some simple data engineering/documentation
here, so that people know what they are doing.
 3. What are the entry points into Tika? As far as I can tell, there is a
ParserFactory that can be used to get a Parser for a particular file or Url,
etc. This implies that the ParserFactory performs some sort of mime type
resolution (which it does), however, mime type resolution (using the new
mime framework) requires the ability for Tika to have a configuration.

 So, here's my proposition to address these issues:

 1. We define a TikaConfiguration object, and an xml file location/format
within CM for Tika configuration properties. I'm fine with using the
Configuration object class from Nutch/Hadoop, and their associated file
format. This was included with the patch from TIKA-6 that I attached,
however, it was incomplete, and was stored in probably the wrong place
(org.apache.tika.utils). What do others think?

 2. We sit down and baseline a set of properties, including documentation on
them, for tika parsers. We should also change everything in CM right now
that says "Luis" to "Tika". Though Luis is really cool and a neat project,
we are working on Tika here, not Luis.

 3. Solving #1 and #2 above will help to address #3. Then I can link the
mime system from TIKA-6 into the ParserFactory and we'll probably have
enough capability and functionality to really start testing the Tika
library, and maybe even be ready for an 0.1-alpha release.
 What do you guys think about this?



P.S. More to come later, and sorry about the stream of consciousness style
writing! ;)

On 9/20/07 7:46 AM, "Bertrand Delacretaz" <bdelacretaz@apache.org> wrote:

> Hi Keith,
>> ...A few days ago I posted a patch that would have enabled the use of URL's
>> as
>> input specifiers in Tika (see TIKA-17).  However, given the code changes
>> since then, I expect that applying the patch would now fail....
> I reviewed your patch and it looks ok to me, with just minor comments
> (see JIRA).
> I haven't committed it because Chris has assigned the issue to
> himself. I don't know if he's working on it, but IMHO we could commit
> it as is.
>> ...Tika is
>> an extremely useful product, and it can be functional momentarily without a
>> great deal of change....
> Agreed, as long as we don't release it, there are no promises about
> API stability or whatever, so having more usable code is certainly a
> good thing.
> -Bertrand

Chris Mattmann, Ph.D.
Cognizant Development Engineer
Early Detection Research Network Project

Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

View raw message