Hi,
On 4/16/07, Ian Holsman <lists@holsman.net> wrote:
> I was planning on using nutch and UIMA to analyze to perform entity
> extraction, and noticed that you mention that Tika would be designed
> to do this.
>
> i was wondering how things were going with Tika, as it doesn't seem
> like there is any code/design plans checked in (except for the
> proposal).
Thanks for the interest! As you noticed, we're just getting started
and haven't yet achieved much.
> So I would like to spark the discussion.
>
> i would like to:
> - use nutch to fetch the pages (HTML) from the site
> - UIMA to analyze them and extract interesting information.
> - mysql, or possibly HBase to store versioned/historical output of
> this analysis, for possible further reporting on (stats, and page
> timelines)
>
> is Tika going to be able to do this for me?
Certainly not all of it. In this scheme Tika would most naturally fit
as a component used by UIMA to parse the HTML pages. The main benefit
of using Tika instead of a native HTML parser in this case would be
that you could easily extend the application to also analyze other
types of document like PDFs, etc.
BR,
Jukka Zitting
|