nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <>
Subject Re: Generic xsl parser plugin
Date Fri, 03 Oct 2014 04:59:47 GMT
Agree with Sebastian, if we could make this part of Nutch it
would be great, as I think it would help us do page scraping
a lot better!

What do you think Albin?


Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

-----Original Message-----
From: Sebastian Nagel <>
Reply-To: "" <>
Date: Thursday, October 2, 2014 at 3:03 PM
To: "" <>
Subject: Re: Generic xsl parser plugin

>Hi Albin,
>the plugin looks very nice!
>I like the clean and extensible way how
>fields are filled by XPath statements.
>To use XSLT functions to do the cleansing
>of extracted text (you hardly ever can do without!)
>is an excellent idea!
>I hope to find the time soon to look at it more detail
>and give it a trial.
>Even more I would like to see the plugin as part of Nutch.
>Are you willing to open a Jira for it and provide a patch?
>Thanks a lot,
>On 10/02/2014 10:26 AM, Albinscode wrote:
>> Hi all,
>> I've created two posts on my blog to describe and use the xsl plugin:
>> The source code is available on
>> I'll update the google code wiki to gather information from my blog.
>> If you have any comment feel free.
>> As I'm currently using it to crawl different web sites related to
>>searching friends I'll have lots
>> of examples to provide.
>> Have a nice day!
>> Albin
>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <
>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>implementation for my own needs
>>     and I'll post it to google code or other repo if the community is
>>     I'll work on a small doc too.
>>     Thank you for your answer.
>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>>     <>> wrote:
>>         Hi Albin,
>>         You don't have to have a separate plugin for each html
>>structure you want to parse. You can
>>         have a single plugin with multiple HTMLParseFilters.
>>         Having a generic extractor with the extraction logic configured
>>in an external file is
>>         definitely a good idea and would make a great contribution to
>>the project. In a nutshell,
>>         you haven't missed anything and that wheel definitely needs
>>inventing ;-)
>>         Best
>>         Julien
>>         On 25 September 2014 09:24, Albin Vigier <
>>         <>> wrote:
>>             Hello everybody,
>>             I'm just wondering if it is possible to fetch specific
>>metadata with
>>             an existing nutch plugin.
>>             Let's take an example.
>>             I want to extract some metadata from "div" or "td" tags
>>from html
>>             pages that have specific ids and name them the way I like
>>(this is
>>             done at parser time).
>>             Then, at indexer time, I would use index-metadata (a very
>>good plugin)
>>             to add my custom metadata.
>>             Currently from what I've seen on the wiki and by quickly
>>             plugins I suppose I have to code my own plugin each time
>>I've got a
>>             new site (with a new html structure). I've already done
>>that by using
>>             a node walker in a custom htmlParseFilter but the
>>extraction can be a
>>             little bit boring :)
>>             So on my side i've coded a little plugin that enables me to
>>             xpaths in an xml file. But before diving into more
>>functionalities I'm
>>             just wondering if I did not missed something.
>>             This work allowed me to explore some nutch aspects but I
>>don't want to
>>             reinvent the wheel or miss something.
>>             Albin
>>         -- 
>>         *
>>         *Open Source Solutions for Text Engineering

View raw message