nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1980) Jexl expressions for CrawlDbReader
Date Thu, 02 Apr 2015 09:16:53 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1980:
---------------------------------
    Attachment: NUTCH-1980-1.9.patch

New patch should be more efficient by reusing the Expression object.

> Jexl expressions for CrawlDbReader
> ----------------------------------
>
>                 Key: NUTCH-1980
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1980
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: NUTCH-1980-1.9.patch, NUTCH-1980-1.9.patch
>
>
> We are already using Jexl expressions to filter records from HostDb dumps and it is really
helpful when your CrawlDb is stuffed with metadata generated by parser filters, in our case
mostly scores generated by classification plugins that run on text or structure.
> In the case of the HostDb, it operates on hosts only, so it is easy to collect a set
of sites that host mostly a specific language, pornographic content, or just host topics that
your classifiers are trained for.
> By adding this magic to the CrawlDbReader, you can get lists of actual records that contain
the stuff you are looking for.
> Most work is already in the HostDb patch so it is easy to translate to individual records.
Patch tomorrow, probably...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message