nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albinscode <albinsc...@gmail.com>
Subject Re: Generic xsl parser plugin
Date Thu, 25 Sep 2014 19:43:02 GMT
Oh thanks Nima, I did found this topic last year but I thought the project
was dead. I think there is a little reference in the nutch wiki too I
cannot find it now.

It looks like we have the same xsl approach so it can be interesting to
share. I'll try to contact Emir while continuing documenting my small
plugin.

Thanks again for the valuable information!

2014-09-25 19:19 GMT+02:00 Nima Falaki <nfalaki@popsugar.com>:

> And the reason why I think this is because of this ticket (Look at the
> conversation at the bottom between Emmanuel and Lewis John)
>
> https://issues.apache.org/jira/browse/NUTCH-978
>
> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nfalaki@popsugar.com> wrote:
>
>> Hi Julien:
>>
>> I was under the impression that the nutch community was going to use a
>> generic xls parser? This one.
>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>> the nutch community going to use this?
>>
>>
>>
>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> Hi Albin,
>>>
>>> You don't have to have a separate plugin for each html structure you
>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>
>>> Having a generic extractor with the extraction logic configured in an
>>> external file is definitely a good idea and would make a great contribution
>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>> definitely needs inventing ;-)
>>>
>>> Best
>>>
>>> Julien
>>>
>>>
>>> On 25 September 2014 09:24, Albin Vigier <albinscode@gmail.com> wrote:
>>>
>>>> Hello everybody,
>>>>
>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>> an existing nutch plugin.
>>>>
>>>> Let's take an example.
>>>> I want to extract some metadata from "div" or "td" tags from html
>>>> pages that have specific ids and name them the way I like (this is
>>>> done at parser time).
>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>> to add my custom metadata.
>>>>
>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>> new site (with a new html structure). I've already done that by using
>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>> little bit boring :)
>>>>
>>>> So on my side i've coded a little plugin that enables me to specify
>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>> just wondering if I did not missed something.
>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>> reinvent the wheel or miss something.
>>>>
>>>> Albin
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> nfalaki@popsugar.com
>>
>>
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> nfalaki@popsugar.com
>
>

Mime
View raw message