lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Is deduplication possible during Tika extract?
Date Tue, 18 Jan 2011 00:42:39 GMT
In my opinion it should work for every update handler. If you're really sure 
your configuration if fine and it still doesn't work you might have to file an 
issue.

Your configuration looks alright but don't forget you've configured 
overwriteDupes=false!

> Hello,
> 
> here is an excerpt of my solrconfig.xml:
> 
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> startup="lazy">
> <lst name="defaults">
> 
> <str name="update.processor">dedupe</str>
> 
> <!-- All the main content goes into "text"... if you need to return
>             the extracted text or do highlighting, use a stored field. -->
> <str name="fmap.content">text</str>
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
> 
> <!-- capture link hrefs but ignore div attributes -->
> <str name="captureAttr">true</str>
> <str name="fmap.a">links</str>
> <str name="fmap.div">ignored_</str>
> </lst>
> </requestHandler>
> 
> and
> 
> <updateRequestProcessorChain name="dedupe">
> <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <str name="signatureField">signature</str>
> <bool name="overwriteDupes">false</bool>
> <str name="fields">text</str>
> <str
> name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
> </str> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> 
> deduplication works when I use only "/update" but not when solr does an
> extract with Tika!
> Is deduplication possible during Tika extract?
> 
> Thanks in advance,
> Arno

Mime
View raw message