tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Using Tika/Nutch to analyze a website
Date Mon, 16 Apr 2007 07:46:25 GMT

On 4/16/07, Ian Holsman <lists@holsman.net> wrote:
> I was planning on using nutch and UIMA to analyze to perform entity
> extraction, and noticed that you mention that Tika would be designed
> to do this.
> i was wondering how things were going with Tika, as it doesn't seem
> like there is any code/design plans checked in (except for the
> proposal).

Thanks for the interest! As you noticed, we're just getting started
and haven't yet achieved much.

> So I would like to spark the discussion.
> i would like to:
> - use nutch to fetch the pages (HTML) from the site
> - UIMA to analyze them and extract interesting information.
> - mysql, or possibly HBase to store versioned/historical output of
> this analysis, for possible further reporting on (stats, and page
> timelines)
> is Tika going to be able to do this for me?

Certainly not all of it. In this scheme Tika would most naturally fit
as a component used by UIMA to parse the HTML pages. The main benefit
of using Tika instead of a native HTML parser in this case would be
that you could easily extend the application to also analyze other
types of document like PDFs, etc.


Jukka Zitting

View raw message