nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albin Vigier <albinsc...@gmail.com>
Subject Re: Generic xsl parser plugin
Date Thu, 25 Sep 2014 08:24:51 GMT
Hello everybody,

I'm just wondering if it is possible to fetch specific metadata with
an existing nutch plugin.

Let's take an example.
I want to extract some metadata from "div" or "td" tags from html
pages that have specific ids and name them the way I like (this is
done at parser time).
Then, at indexer time, I would use index-metadata (a very good plugin)
to add my custom metadata.

Currently from what I've seen on the wiki and by quickly analyzing
plugins I suppose I have to code my own plugin each time I've got a
new site (with a new html structure). I've already done that by using
a node walker in a custom htmlParseFilter but the extraction can be a
little bit boring :)

So on my side i've coded a little plugin that enables me to specify
xpaths in an xml file. But before diving into more functionalities I'm
just wondering if I did not missed something.
This work allowed me to explore some nutch aspects but I don't want to
reinvent the wheel or miss something.

Albin

Mime
View raw message