lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hubbard <charlie.hubb...@gmail.com>
Subject Problem with multi-valued field using Solr CEL
Date Tue, 04 Apr 2017 16:13:02 GMT
So I'm trying to index documents using Solr CEL and Tika on Solr 5.4.1.
I'm using the default configuration, but when I import my docs I'm getting
this error:

125973 INFO  (qtp840863278-17) [   x:fusearchiver] o.a.s.c.PluginBag Going
to create a new requestHandler with {type = requestHandler,name =
/update/extract,class = solr.extraction.ExtractingRequestHandler,args =
{defaults={lowernames=true,uprefix=ignored_,captureAttr=true,fmap.a=links,fmap.div=ignored_}}}

127134 INFO  (qtp840863278-17) [   x:fusearchiver]
o.a.s.u.p.LogUpdateProcessorFactory [fusearchiver] webapp=/solr
path=/update/extract
params={literal.archiveDate_dt=Mon+Apr+03+21:16:48+EDT+2017&literal._accountId=2&literal.categories=taxes&literal.categories=5498&
literal.id=b5701a36-0dec-4746-bb5d-3c307a557cd7&literal._batchId=25&literal._type=document&literal._filename=2016-0664-Form-5498.pdf&literal._employeeNumber=1411&wt=javabin&literal._employeeFuseId=1&literal.effectiveDate_dt=Sat+Dec+31+00:00:00+EST+2016&literal._json={"accountId":2,"archiveDate":1491268608431,"batchId":25,"categories":["taxes","5498"],"effectiveDate":1483160400000,"employeeFuseId":1,"employeeNumber":"1411","fileName":"2016-0664-Form-5498.pdf","id":"b5701a36-0dec-4746-bb5d-3c307a557cd7","imageUrl":null,"path":"2016-0664-Form-5498.pdf","uploadedBy":null,"url":null}&version=2}
{} 0 1161

127135 ERROR (qtp840863278-17) [   x:fusearchiver]
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR:
[doc=b5701a36-0dec-4746-bb5d-3c307a557cd7] multiple values encountered for
non multiValued field meta: [dcterms:modified, 2017-03-16T23:14:41Z,
meta:creation-date, 2017-03-16T23:14:41Z, meta:save-date,
2017-03-16T23:14:41Z, pdf:PDFVersion, 1.4, dcterms:created,
2017-03-16T23:14:41Z, Last-Modified, 2017-03-16T23:14:41Z, date,
2017-03-16T23:14:41Z, X-Parsed-By, org.apache.tika.parser.DefaultParser,
X-Parsed-By, org.apache.tika.parser.pdf.PDFParser, modified,
2017-03-16T23:14:41Z, xmpTPg:NPages, 2, Creation-Date,
2017-03-16T23:14:41Z, pdf:encrypted, false, created, Thu Mar 16 23:14:41
UTC 2017, stream_size, null, dc:format, application/pdf; version=1.4,
producer, Ricoh Americas Corporation, AFP2PDF, Content-Type,
application/pdf, xmp:CreatorTool, Ricoh Americas Corporation, AFP2PDF Plus
Version: 1.014.10, Last-Save-Date, 2017-03-16T23:14:41Z]

at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:92)

at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:83)

at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:273)

at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:207)

at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)

at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)

at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:49)

at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:924)

at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1079)

at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:702)

at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)

at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:126)

at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:131)

at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:237)

at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:70)

at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)

Here is my solrconfig.xml of the extract module:

<requestHandler name="/update/extract"
                startup="lazy"
                class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>

I thought this would basically mark everything that wasn't a field as
ignored so meta shouldn't be imported.  I've searched through my solr
schema, and I have no meta field declared hence I thought CEL would throw
it out.

I'm using Solrj to import the docs.  I'm also adding a lot of literals to
the document.  You can see above the data that I'm providing in literals.

Why am I seeing this error?

Can I simply have it only extract the information and I'll put it in a text
field and have it process the HTML in the same manner to work around this
issue?

TIA
Charlie

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message