tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Using Tika/Nutch to analyze a website
Date Mon, 16 Apr 2007 07:46:25 GMT
Hi,

On 4/16/07, Ian Holsman <lists@holsman.net> wrote:
> I was planning on using nutch and UIMA to analyze to perform entity
> extraction, and noticed that you mention that Tika would be designed
> to do this.
>
> i was wondering how things were going with Tika, as it doesn't seem
> like there is any code/design plans checked in (except for the
> proposal).

Thanks for the interest! As you noticed, we're just getting started
and haven't yet achieved much.

> So I would like to spark the discussion.
>
> i would like to:
> - use nutch to fetch the pages (HTML) from the site
> - UIMA to analyze them and extract interesting information.
> - mysql, or possibly HBase to store versioned/historical output of
> this analysis, for possible further reporting on (stats, and page
> timelines)
>
> is Tika going to be able to do this for me?

Certainly not all of it. In this scheme Tika would most naturally fit
as a component used by UIMA to parse the HTML pages. The main benefit
of using Tika instead of a native HTML parser in this case would be
that you could easily extend the application to also analyze other
types of document like PDFs, etc.

BR,

Jukka Zitting

Mime
View raw message