nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Generic xsl parser plugin
Date Fri, 03 Oct 2014 04:59:47 GMT
Agree with Sebastian, if we could make this part of Nutch it
would be great, as I think it would help us do page scraping
a lot better!

What do you think Albin?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Sebastian Nagel <wastl.nagel@googlemail.com>
Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
Date: Thursday, October 2, 2014 at 3:03 PM
To: "dev@nutch.apache.org" <dev@nutch.apache.org>
Subject: Re: Generic xsl parser plugin

>Hi Albin,
>
>the plugin looks very nice!
>I like the clean and extensible way how
>fields are filled by XPath statements.
>To use XSLT functions to do the cleansing
>of extracted text (you hardly ever can do without!)
>is an excellent idea!
>
>I hope to find the time soon to look at it more detail
>and give it a trial.
>
>Even more I would like to see the plugin as part of Nutch.
>Are you willing to open a Jira for it and provide a patch?
>
>Thanks a lot,
>Sebastian
>
>On 10/02/2014 10:26 AM, Albinscode wrote:
>> Hi all,
>> 
>> I've created two posts on my blog to describe and use the xsl plugin:
>> 
>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
>> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
>> 
>> The source code is available on
>>https://code.google.com/p/nutch-parse-xsl-plugin/.
>> I'll update the google code wiki to gather information from my blog.
>> 
>> If you have any comment feel free.
>> As I'm currently using it to crawl different web sites related to
>>searching friends I'll have lots
>> of examples to provide.
>> 
>> Have a nice day!
>> 
>> Albin
>> 
>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinscode@gmail.com
>><mailto:albinscode@gmail.com>>:
>> 
>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>implementation for my own needs
>>     and I'll post it to google code or other repo if the community is
>>interested.
>>     I'll work on a small doc too.
>>     Thank you for your answer.
>> 
>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>><lists.digitalpebble@gmail.com
>>     <mailto:lists.digitalpebble@gmail.com>> wrote:
>> 
>>         Hi Albin,
>> 
>>         You don't have to have a separate plugin for each html
>>structure you want to parse. You can
>>         have a single plugin with multiple HTMLParseFilters.
>> 
>>         Having a generic extractor with the extraction logic configured
>>in an external file is
>>         definitely a good idea and would make a great contribution to
>>the project. In a nutshell,
>>         you haven't missed anything and that wheel definitely needs
>>inventing ;-)
>> 
>>         Best
>> 
>>         Julien
>> 
>> 
>>         On 25 September 2014 09:24, Albin Vigier <albinscode@gmail.com
>>         <mailto:albinscode@gmail.com>> wrote:
>> 
>>             Hello everybody,
>> 
>>             I'm just wondering if it is possible to fetch specific
>>metadata with
>>             an existing nutch plugin.
>> 
>>             Let's take an example.
>>             I want to extract some metadata from "div" or "td" tags
>>from html
>>             pages that have specific ids and name them the way I like
>>(this is
>>             done at parser time).
>>             Then, at indexer time, I would use index-metadata (a very
>>good plugin)
>>             to add my custom metadata.
>> 
>>             Currently from what I've seen on the wiki and by quickly
>>analyzing
>>             plugins I suppose I have to code my own plugin each time
>>I've got a
>>             new site (with a new html structure). I've already done
>>that by using
>>             a node walker in a custom htmlParseFilter but the
>>extraction can be a
>>             little bit boring :)
>> 
>>             So on my side i've coded a little plugin that enables me to
>>specify
>>             xpaths in an xml file. But before diving into more
>>functionalities I'm
>>             just wondering if I did not missed something.
>>             This work allowed me to explore some nutch aspects but I
>>don't want to
>>             reinvent the wheel or miss something.
>> 
>>             Albin
>> 
>> 
>> 
>> 
>>         -- 
>>         *
>>         *Open Source Solutions for Text Engineering
>> 
>>         http://digitalpebble.blogspot.com/
>>         http://www.digitalpebble.com
>>         http://twitter.com/digitalpebble
>> 
>> 
>> 
>


Mime
View raw message