tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yves Zoundi" <yveszou...@gmail.com>
Subject Re: OSGI bundle for Tika
Date Tue, 20 May 2008 14:12:39 GMT
Hello Jukka,

  Yes, I am afraid of carrying too much tika dependencies. For now, the
project is still in the incubator and I believe the code base will grow
significantly sooner or later. It might be difficult to extract the mime
detection library.  I think that the mime detection code is worth having
it's own maven project. I didn't see any dependency but commons-codec.

I really like the idea of a tika-core containing the main interfaces.
Partitionning is good, but at this point I guess it would add extra
complexity and extra work. When most interfaces, are well defined, I think
it will be easy to know what to partition and how to do it without worrying
about the architecture.

 I will fill a feature request for a mimetypes serializer. If I have time, I
might write the serializer and send it to you guys for evaluation.

Yves Zoundi

2008/5/20 Jukka Zitting <jukka.zitting@gmail.com>:

> Hi,
> On Mon, May 19, 2008 at 3:05 PM, Yves Zoundi <yveszoundi@gmail.com> wrote:
> > It would be nice to create sub-projects from Apache Tika main maven
> > project. The mime detection part is pretty useful and its code could be
> > in a separate project. That would allow people to use it without the
> > rest of the Tika's code.
> I think we can do that. Are you more worried about the size of the
> tika jar or all the parser dependencies you don't need?
> We might want to split Tika into two parts, say tika-core and
> tika-parsers, where tika-core would contain all the core interfaces
> and classes with no dependencies to external libraries (except of
> course the standard Java 5 class libraries). We could go even further
> by partitioning the core library by function, but I'm not sure if that
> is worth the extra complexity.
> > I removed few classes from the source code and created a jar with the
> > mime detection code. I needed to use Tika in an OSGI environment and it
> > was a bit painful to use Tika out of the box(without embedding it in an
> > OSGI bundle which would export Tika packages later).
> >
> > I had to create a manifest and as Tika's code is not huge, I was able to
> > export the packages quickly. I need to import javax.xml.parsers, sax and
> > dom packages as Tika use them to load the mimetypes configuration file.
> It should be possible to add the OSGi bundle information automatically
> in the normal Maven build. You might want to file an improvement
> request for this.
> > The thing I didn't see in the mime detection code was a serializer to
> > save the mimetypes.
> Our use cases so far have had only manual modifications of the
> configuration files, but I don't see why we couldn't make it possible
> to programmatically modify the configuration. In fact I've already
> done some work towards making the media type registry easier to
> manage, and a serializer for the configuration file would be a nice
> addition. Could you file a feature request for that?
> > In a typical application, people usually :
> > - Want a mime type configuration file somewhere that they can load
> > - Want to be able to add/remove mimetypes
> > - Add file extensions patterns to existing mime types
> > - Store back the mime types to its location.
> >
> > So my questions are :
> > - If I load the mimetypes from a file, and add some mimetype entries at
> > runtime, how can I save back the file without doing it manually with
> > dom, jdom or dom4j?
> Currently the only way is to modify the XML file directly, but as
> mentioned above a higher level serialization feature would be nice.
> > - Would it be possible to create an OSGI bundle for the mime detection
> > library?
> Certainly.
> BR,
> Jukka Zitting

Your attitude, not your aptitude, will determine your altitude
Zig Ziglar

You have to learn the rules of the game. And then you have to play better
than anyone else.
Albert Einstein

Act as if it was impossible to fail.
Dorothea Brande

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message