tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rida Benjelloun" <rida.benjell...@doculibre.com>
Subject Re: Tika discussions in Amsterdam
Date Thu, 03 May 2007 14:46:04 GMT
Lius is currently under apache licence.  If people are interested on it we
can use it as starting point for the development of tika.
I think that Lius could quickly be adapted to meet all the needs that Jakka
has  mentioned in his email.
Processing phases: Lius makes it possible to detect the mimetype document
but also to extract the contents and the metadata. However we should develop
the extraction of the metadata for Word Excel and PowerPoint, we could use
POI to do this. We can also use one of the mimetype detector that Jukka has
Structure plugin: Lius doesn't have plugin architecture, it will be
necessary to develop it . However all the parseurs can be configured using
an XML file
Structured text: lius use JDOM, XPATH and namespaces for the extraction of
structured contents. Sax could be more powerful but does not offer  XPATH
for the extraction of contents. Lius support  opendocument format.
Lius can process ZIP files, mp3 and MPEG.
Actually Lius is  connected to Lucene but could easily adapted for Solr,
UIMA and others.
I think that Tika should also extract outlinks from documents.
If you have have questions about Lius do not hesitate to communicate with
me. The source code is available: http://sourceforge.net/projects/lius/
Best regards.

Rida Benjelloun

On 5/3/07, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
> Quick summary of the Tika discussions from yesterday's text analysis
> BOF at the ApacheCon EU. It's the next morning now, so I'm probably
> missing a lot of stuff...
> * release early: The consensus seemed to be that we should do an early
> release using one of the existing codebases (most likely the Nutch
> parser framework or the Lius codebase) instead of trying to come up
> with the "perfect design" up front before doing the first release. We
> perhaps should do some initial work to avoid excessive API changes
> later on, but that's a secondary consideration to releasing working
> code early.
> * processing phases: Tika has basically two main phases of operation:
> content detection and content extraction. The content detection phase
> tries to detect the content type of a document given a binary stream
> and optional typing metadata. The main output from this phase is the
> content type of the given document, but it is also possible to output
> some easily accessible metadata (image size, etc.) already during this
> phase. The content extraction phase is given the binary stream and the
> detected typing information, and the expected output is the structured
> text content and any available metadata from the document.
> * processing pipeline: There was a quick idea on possibly organizing
> the Tika framework as a pipeline of content detection and extraction
> components.
> * plugin framework: We should design Tika so that existing plugin
> frameworks can easily be used to assemble and configure Tika
> components.
> * structured text: We should optimally use XHTML Basic as a SAX event
> stream as the "structured text content" to be extracted from
> documents. Namespaces can be used to extend the format to contain
> extra metadata like PDF offsets for highlighting. Other options
> expressed were using plain text, custom annotations or markup
> mechanisms, full document formats like odf, or perhaps a DOM tree.
> * additional processing components: It should be possible to embed
> additional content extraction tools like thumbnail generators, image
> index generators, etc. as plugins in the Tika framework. We might even
> want to add support for tools like virus and spam detectors.
> * container formats: It should be possible to use Tika recursively to
> process container formats like zip files or many of the video formats.
> The same mechanism could also be used to handle compressed or
> encrypted files.
> * integration: We should design the output format of Tika so that it
> is easy to map to whatever Lucene, Solr, UIMA, and other similar
> projects expect as input.
> * security: Tika should contain some safeguards against
> denial-of-service attacks that trick a parser library to spend excess
> memory or processing power on a single document.
> * parser libraries: We might want to encourage external parser
> libraries to join us at the ASF, but Tika by itself should not try to
> reimplement or compete with existing parsers.
> BR,
> Jukka Zitting

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message