tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoni Mylka <antoni.my...@gmail.com>
Subject Re: Appending Mime Types
Date Tue, 23 Aug 2011 13:40:30 GMT
W dniu 2011-08-22 20:37, Tom Grant pisze:
> Here's the use case that I'm attempting to solve.  I have a customer with
> many legacy systems, some of which are completely custom.  These systems
> have data files that will never be seen outside of their environment.  For
> example, some are XML files with their own schemas.  Some are similar to the
> new office documents and are zip files containing xml and other goodies.
> Others are serialized-objects dumped to disk.  Some are similar to EDI with
> a header and data body with prescribed offsets. The choices of the past
> can't be undone and I'm stuck with about 30 or 40 different file types.  I
> want to use Tika as the standard API to exploit those old formats.  The
> customer's developers know the internals of the formats, I just need to give
> them an API to map them to instead of developing stovepipes to load each
> format.  The quantity of file types means that its going to take a few
> months to complete and will happen a few at a time.  So I'd like to
> co-locate the mimetype definition with the parser code for maintainability.

FWIW. My use case is exactly the same. Old XML formats, internal to a 
given organization, with custom Parsers for them. The plain, generic XML 
parser is insufficient (too much garbage, no metadata). We use a sort of 
DSL to define the XML->RDF mapping. One single mapping file describes 
the transformation (for our transformer) and the detection rules (for 
tika MimeTypes).

Antoni Myłka
antoni.mylka@gmail.com


Mime
View raw message