nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2058) Indexer plugin that allows RegEx replacements on the NutchDocument field values
Date Sat, 04 Jul 2015 10:37:04 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613677#comment-14613677
] 

ASF GitHub Bot commented on NUTCH-2058:
---------------------------------------

GitHub user PeterCiuffetti opened a pull request:

    https://github.com/apache/nutch/pull/44

    Nutch 2058 - New index-replace plugin that allows regexp field value replacements

    Modifies the NutchDocument during the IndexingFilter phase to do regexp replacements on
specified fields.
    
    See https://issues.apache.org/jira/browse/NUTCH-2058

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/PeterCiuffetti/nutch NUTCH-2058

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/44.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #44
    
----
commit dc32ce6dd66b4e712b1e9693a4e726febbc8171e
Author: PeterCiuffetti <pciuffetti@astreetpress.com>
Date:   2015-07-01T13:31:03Z

    Initial checkin got parse-replace

commit 2eebd285232bd0595bf321add1d35ae1a60e7d07
Author: PeterCiuffetti <pciuffetti@astreetpress.com>
Date:   2015-07-01T13:31:11Z

    Merge branch 'trunk' of github.com:apache/nutch into parse-replace

commit a2c1851e096bfd528b722778671490d4fd610a4b
Author: PeterCiuffetti <pciuffetti@astreetpress.com>
Date:   2015-07-02T14:27:19Z

    Refactored from a parse filter to an index filter

commit 57748e0de2e7fc60d349462144c3ed7703ac0957
Author: PeterCiuffetti <pciuffetti@astreetpress.com>
Date:   2015-07-04T09:22:02Z

    Updated tests. Feature set complete

commit e80e7b1e59a0025a1e5ed266e06546e97b7c2770
Author: PeterCiuffetti <pciuffetti@astreetpress.com>
Date:   2015-07-04T09:23:23Z

    Merge branch 'trunk' of github.com:apache/nutch into NUTCH-2058

commit 81368fe08193a365a6ca6f2179eb46e96ef0f7c5
Author: PeterCiuffetti <pciuffetti@astreetpress.com>
Date:   2015-07-04T09:34:18Z

    README doc change

commit d2d534c1a9a48dd7a29147453f4c4e1fc78f11fb
Author: PeterCiuffetti <pciuffetti@astreetpress.com>
Date:   2015-07-04T10:17:27Z

    Updated documentation

commit 0455d9119b694ccb9274a43dba392b76771a9da1
Author: PeterCiuffetti <pciuffetti@astreetpress.com>
Date:   2015-07-04T10:23:19Z

    Undoing build.xml change

----


> Indexer plugin that allows RegEx replacements on the NutchDocument field values
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-2058
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2058
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Peter Ciuffetti
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> This is the description of a IndexingFilter plugin I'm developing that allows regex replacements
on field values prior to indexing to your search engine.
> *Plugin name*: index-replace
> *Property name*: index.replace.regexp
> *Use case example:*
> I'm indexing Nutch-created documents to a pre-existing SOLR core.  In this case I need
to coerce the documents into the schema and field formats expected by the existing core. 
The features of index-static and solrindex-mapping.xml get me most of the way.  Among other
things, I need to generate identifiers from the web URLs.  So I need to do something like
a regex replace on the id provided and then (with solrindex-mapping.xml) move this to the
field name defined by the existing core.
> Another use case might be to refactor all URLs stored in the document so they route through
a redirector gateway.
> The following is from the draft description in nutch-default.xml
> *Description:*
> Allows indexing-time regexp replace manipulation of metadata fields. The format of the
property is a list of regexp replacements, one line per field being modified.  To use this
property, add index-replace to your list of activated plugins.
>     
> *Example:*
> {code:xml}
> <property>
>   <name>index.replace.regexp</name>
>   <value>
>         fldname1=/regexp/replacement/flags
>         fldname2=/regexp/replacement/flags
>   </value>
> </property>
> {code}
> Field names would be one of those from https://wiki.apache.org/nutch/IndexStructure.
The replacements will happen in the order listed. If a field needs multiple replacement operations
they may be listed more than once.
> The *field name* precedes the equal sign.  The first character after the equal sign signifies
the delimiter for the regexp, the replacement value and the flags.
> The *regexp* and the optional *flags* should correspond to Pattern.compile(String regexp,
int flags) defined here: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29
> The *flags* is an integer sum of the flag values defined in http://docs.oracle.com/javase/7/docs/api/constant-values.html
(Sec: java.util.regex.Pattern)
> Patterns are compiled when the plugin is initialized for efficiency.
> *Escaping*: since the regexp is being read from a config file, any escaped values must
be double escaped.  Eg:  {code}
>   id=/\\s+//
> {code} will cause the escaped \s+ match pattern to be used.
> The *replacement* value should correspond to Java Matcher(CharSequence input).replaceAll(String
replacement):  http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29
>     
> *Multi-valued Fields*
> If a field has multiple values, the replacement will be applied to each value in turn.
> *Non-string Datatypes*
> Replacement is possible only on String field datatypes.  If the field you name in the
property is not a String datatype, it will be silently ignored.
> *Host and URL specific replacements*
> If the replacements should apply only to specifc pages, then add a sequence like
> {code}
>     hostmatch=hostmatchpattern
>     fld1=/regexp/replace/flags
>     fld2=/regexp/replace/flags
> {code}
>     or
> {code}
>     urlmatch=urlmatchpattern
>     fld1=/regexp/replace/flags
>     fld2=/regexp/replace/flags
> {code}
> When using Host and URL replacements, all replacements preceding the first hostmatch
or urlmatch will apply to all Nutch documents.  Replacements following a hostmatch or urlmatch
will be applied to Nutch documents that match the host or url field (up to the next hostmatch
or urlmatch line).  hostmatch and urlmatch patterns must be unique in this property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message