nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors
Date Wed, 07 Jun 2017 11:33:18 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040736#comment-16040736
] 

ASF GitHub Bot commented on NUTCH-2389:
---------------------------------------

kaidul opened a new pull request #192: NUTCH-2389 Precise data extractor implemented for 2.x
URL: https://github.com/apache/nutch/pull/192
 
 
   Webpage-wise precise data extractor based on jsoup CSS-selector API and configurable using
XML file. Parse filter and complementary indexing filter plugin implemented. Functionality
of defining custom normalizers on specific extracted data implemented. I've successfully tested
this module on my large project and unit testing is added as well.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>
>                 Key: NUTCH-2389
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2389
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 2.3
>            Reporter: Kaidul Islam
>            Assignee: Kaidul Islam
>             Fix For: 2.4
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact
contents for specific websites. I've developed a plugin {{parse-jsoup}} using Jsoup for my
current project to extract precise content for site specific crawling using detailed XML configuration(field
name, CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present in Nutch.
I have also plan to export it into Nutch 1.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message