nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albin Vigier <albinsc...@gmail.com>
Subject Re: Generic xsl parser plugin
Date Thu, 25 Sep 2014 14:18:51 GMT
Ok, perfect, so I didn't waste my time. I'm finishing my basic
implementation for my own needs and I'll post it to google code or other
repo if the community is interested.
I'll work on a small doc too.
Thank you for your answer.

On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Albin,
>
> You don't have to have a separate plugin for each html structure you want
> to parse. You can have a single plugin with multiple HTMLParseFilters.
>
> Having a generic extractor with the extraction logic configured in an
> external file is definitely a good idea and would make a great contribution
> to the project. In a nutshell, you haven't missed anything and that wheel
> definitely needs inventing ;-)
>
> Best
>
> Julien
>
>
> On 25 September 2014 09:24, Albin Vigier <albinscode@gmail.com> wrote:
>
>> Hello everybody,
>>
>> I'm just wondering if it is possible to fetch specific metadata with
>> an existing nutch plugin.
>>
>> Let's take an example.
>> I want to extract some metadata from "div" or "td" tags from html
>> pages that have specific ids and name them the way I like (this is
>> done at parser time).
>> Then, at indexer time, I would use index-metadata (a very good plugin)
>> to add my custom metadata.
>>
>> Currently from what I've seen on the wiki and by quickly analyzing
>> plugins I suppose I have to code my own plugin each time I've got a
>> new site (with a new html structure). I've already done that by using
>> a node walker in a custom htmlParseFilter but the extraction can be a
>> little bit boring :)
>>
>> So on my side i've coded a little plugin that enables me to specify
>> xpaths in an xml file. But before diving into more functionalities I'm
>> just wondering if I did not missed something.
>> This work allowed me to explore some nutch aspects but I don't want to
>> reinvent the wheel or miss something.
>>
>> Albin
>>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Mime
View raw message