nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kaidul Islam (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors
Date Fri, 02 Jun 2017 19:20:04 GMT


Kaidul Islam commented on NUTCH-2389:

Hi [~lewismc] I've changed my design and re-implement extensively in some part. I will send
the pull request soon and also provide the unit test too :)

> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>                 Key: NUTCH-2389
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 2.3
>            Reporter: Kaidul Islam
>            Assignee: Kaidul Islam
>             Fix For: 2.4
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
> As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact
contents for specific websites. I've developed a plugin {{parse-jsoup}} using Jsoup for my
current project to extract precise content for site specific crawling using detailed XML configuration(field
name, CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present in Nutch.
I have also plan to export it into Nutch 1.x.

This message was sent by Atlassian JIRA

View raw message