nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: Add Field to crawled content for indexing
Date Wed, 02 Apr 2014 20:30:09 GMT
Hi Yann,

> In Parse type, we don't have "getData()" so we can't add new metadata.
...
> So what is the new way to add custom field to index ? Maybe i miss
> something ...

In 2.x data for custom fields can be added to the WebPage's metadata
in ParseFilter via
 page.putToMetadata(Utf8 key, ByteBuffer value)
It's then read in IndexingFilter by
 page.getFromMetadata(Utf8 key)

Sebastian

On 04/02/2014 05:42 PM, Yann Levreau wrote:
> Hello,
> 
> Maybe this is the wrong place to post a request so forgive me, but I really
> need some help (Nutch 2.2.1) :
> 
> I need to add a new field to be indexed by ElasticSearch.
> 
> in 1.7, we had :
> The HtmlParseFilter extension with :
> ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
> *filter
> <http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html#filter%28org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.ParseResult,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
> (Content<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/protocol/Content.html>
> content,
> ParseResult<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/ParseResult.html>
> parseResult,
> HTMLMetaTags<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HTMLMetaTags.html>
> metaTags,
> DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
>  doc)
> 
> The IndexingFilter extension with :
> NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
> *filter
> <http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20org.apache.nutch.parse.Parse,%20org.apache.hadoop.io.Text,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.crawl.Inlinks%29>*
> (NutchDocument<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/indexer/NutchDocument.html>
> doc,
> Parse<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/Parse.html>
> parse,
> org.apache.hadoop.io.Text url,
> CrawlDatum<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/CrawlDatum.html>
> datum,
> Inlinks<http://nutch.apache.org/apidocs-1.7/org/apache/nutch/crawl/Inlinks.html>
>  inlinks)
> 
> All was ok to add field.
> 
> in 2.2.1 we have :
> The ParseFilter extension :
>   Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> *filter
> <http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html#filter%28java.lang.String,%20org.apache.nutch.storage.WebPage,%20org.apache.nutch.parse.Parse,%20org.apache.nutch.parse.HTMLMetaTags,%20org.w3c.dom.DocumentFragment%29>*
> (String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
> url,
> WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
> page,
> Parse<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/Parse.html>
> parse,
> HTMLMetaTags<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/HTMLMetaTags.html>
> metaTags,
> DocumentFragment<http://java.sun.com/javase/6/docs/api/org/w3c/dom/DocumentFragment.html?is-external=true>
>  doc)
> In Parse type, we don't have "getData()" so we can't add new metadata.
> 
> The IndexingFilter extension :
> NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
> *filter
> <http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/IndexingFilter.html#filter%28org.apache.nutch.indexer.NutchDocument,%20java.lang.String,%20org.apache.nutch.storage.WebPage%29>*
> (NutchDocument<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/indexer/NutchDocument.html>
> doc,
> String<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true>
> url,
> WebPage<http://nutch.apache.org/apidocs-2.2/org/apache/nutch/storage/WebPage.html>
>  page)
> We don't have Parse type in parameter to add field to NutchDocument type.
> 
> So what is the new way to add custom field to index ? Maybe i miss
> something ...
> Thank you very much !
> 


Mime
View raw message