nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Generic xsl parser plugin
Date Mon, 06 Oct 2014 07:27:06 GMT
Great work!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Albinscode <albinscode@gmail.com>
Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
Date: Sunday, October 5, 2014 at 1:09 PM
To: "dev@nutch.apache.org" <dev@nutch.apache.org>
Subject: Re: Generic xsl parser plugin

>@Chris Thank you for your suggestion too.
>
>As requested I've created the
>https://issues.apache.org/jira/browse/NUTCH-1870 and provided a patch.
>
>Feel free to give me feedbacks. I'll continue work on my branch ;)
>
>2014-10-03 10:03 GMT+02:00 Albinscode <albinscode@gmail.com>:
>> Hello Sebastian,
>>
>> Thank you for having taken a look to the global mechanism.
>> I've tried to make as simple as possible to focus on "what to extract?".
>>
>> Currently I've got lots of needs (and so ideas). The code will
>> naturally evolve (support of XSLT 2.0) and I would be happy to fully
>> give this code to the community.
>>
>> Of course, I'll create a JIRA and prepare a patch. I'll take the time
>> to provide it as clean as possible.
>>
>> Thank you for your interest.
>>
>> 2014-10-03 6:59 GMT+02:00 Mattmann, Chris A (3980)
>> <chris.a.mattmann@jpl.nasa.gov>:
>>> Agree with Sebastian, if we could make this part of Nutch it
>>> would be great, as I think it would help us do page scraping
>>> a lot better!
>>>
>>> What do you think Albin?
>>>
>>> Cheers,
>>> Chris
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel <wastl.nagel@googlemail.com>
>>> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
>>> Date: Thursday, October 2, 2014 at 3:03 PM
>>> To: "dev@nutch.apache.org" <dev@nutch.apache.org>
>>> Subject: Re: Generic xsl parser plugin
>>>
>>>>Hi Albin,
>>>>
>>>>the plugin looks very nice!
>>>>I like the clean and extensible way how
>>>>fields are filled by XPath statements.
>>>>To use XSLT functions to do the cleansing
>>>>of extracted text (you hardly ever can do without!)
>>>>is an excellent idea!
>>>>
>>>>I hope to find the time soon to look at it more detail
>>>>and give it a trial.
>>>>
>>>>Even more I would like to see the plugin as part of Nutch.
>>>>Are you willing to open a Jira for it and provide a patch?
>>>>
>>>>Thanks a lot,
>>>>Sebastian
>>>>
>>>>On 10/02/2014 10:26 AM, Albinscode wrote:
>>>>> Hi all,
>>>>>
>>>>> I've created two posts on my blog to describe and use the xsl plugin:
>>>>>
>>>>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nut
>>>>>ch/
>>>>> 
>>>>>http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
>>>>>
>>>>> The source code is available on
>>>>>https://code.google.com/p/nutch-parse-xsl-plugin/.
>>>>> I'll update the google code wiki to gather information from my blog.
>>>>>
>>>>> If you have any comment feel free.
>>>>> As I'm currently using it to crawl different web sites related to
>>>>>searching friends I'll have lots
>>>>> of examples to provide.
>>>>>
>>>>> Have a nice day!
>>>>>
>>>>> Albin
>>>>>
>>>>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinscode@gmail.com
>>>>><mailto:albinscode@gmail.com>>:
>>>>>
>>>>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>>>>implementation for my own needs
>>>>>     and I'll post it to google code or other repo if the community is
>>>>>interested.
>>>>>     I'll work on a small doc too.
>>>>>     Thank you for your answer.
>>>>>
>>>>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>>>>><lists.digitalpebble@gmail.com
>>>>>     <mailto:lists.digitalpebble@gmail.com>> wrote:
>>>>>
>>>>>         Hi Albin,
>>>>>
>>>>>         You don't have to have a separate plugin for each html
>>>>>structure you want to parse. You can
>>>>>         have a single plugin with multiple HTMLParseFilters.
>>>>>
>>>>>         Having a generic extractor with the extraction logic
>>>>>configured
>>>>>in an external file is
>>>>>         definitely a good idea and would make a great contribution to
>>>>>the project. In a nutshell,
>>>>>         you haven't missed anything and that wheel definitely needs
>>>>>inventing ;-)
>>>>>
>>>>>         Best
>>>>>
>>>>>         Julien
>>>>>
>>>>>
>>>>>         On 25 September 2014 09:24, Albin Vigier
>>>>><albinscode@gmail.com
>>>>>         <mailto:albinscode@gmail.com>> wrote:
>>>>>
>>>>>             Hello everybody,
>>>>>
>>>>>             I'm just wondering if it is possible to fetch specific
>>>>>metadata with
>>>>>             an existing nutch plugin.
>>>>>
>>>>>             Let's take an example.
>>>>>             I want to extract some metadata from "div" or "td" tags
>>>>>from html
>>>>>             pages that have specific ids and name them the way I like
>>>>>(this is
>>>>>             done at parser time).
>>>>>             Then, at indexer time, I would use index-metadata (a very
>>>>>good plugin)
>>>>>             to add my custom metadata.
>>>>>
>>>>>             Currently from what I've seen on the wiki and by quickly
>>>>>analyzing
>>>>>             plugins I suppose I have to code my own plugin each time
>>>>>I've got a
>>>>>             new site (with a new html structure). I've already done
>>>>>that by using
>>>>>             a node walker in a custom htmlParseFilter but the
>>>>>extraction can be a
>>>>>             little bit boring :)
>>>>>
>>>>>             So on my side i've coded a little plugin that enables me
>>>>>to
>>>>>specify
>>>>>             xpaths in an xml file. But before diving into more
>>>>>functionalities I'm
>>>>>             just wondering if I did not missed something.
>>>>>             This work allowed me to explore some nutch aspects but I
>>>>>don't want to
>>>>>             reinvent the wheel or miss something.
>>>>>
>>>>>             Albin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         --
>>>>>         *
>>>>>         *Open Source Solutions for Text Engineering
>>>>>
>>>>>         http://digitalpebble.blogspot.com/
>>>>>         http://www.digitalpebble.com
>>>>>         http://twitter.com/digitalpebble
>>>>>
>>>>>
>>>>>
>>>>
>>>


Mime
View raw message