lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <j...@ece.ubc.ca>
Subject UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?
Date Fri, 02 Jan 2015 20:43:49 GMT
Happy New Year Everyone :)

I am trying to automatically generate document Id when indexing a csv
file that contains multiple lines of documents. The desired case: if the
csv file contains 2 lines (each line is a document), then the index
should contain 2 documents.

 What I observed: If the csv files contains 2 lines, then the index
contains 3 documents, because the 1st document is repeated once, an
example output:
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId1</str>
</doc>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId2</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId3</str>
</doc>

And if the csv file contains 3 lines, then the index contains 6 elements,
because document 1 is repeated 3 times and document 2 is repeated twice,
as following:
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId1</str>
</doc>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId2</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId3</str>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId4</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId5</str>
</doc>
<doc>
<sr name ="col1"> doc3 </str>
<sr name= "col2"> rank3 </str>
<str name="id"> randomlyGeneratedId6</str>
</doc>

Here's what I have done:
1. In my solrConfig:
<updateRequestProcessorChain name="autoGenId">
		<processor class="solr.UUIDUpdateProcessorFactory">
		<str name="fieldName">doc_key</str>
		</processor>
		<processor class="solr.LogUpdateProcessorFactory" />
		<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler">
       <lst name="defaults">
	    <str name="update.chain">autoGenId</str>
       </lst>
  </requestHandler>
2. in schema.xml:
<field name="doc_key" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
	<field name = "col1" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
	<field name = "col2" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
 <uniqueKey>id</uniqueKey>

This problem doesn't exist when I assign an Id field, instead of using
the UUIDUpdateProcessorFactory, so I assumed the problem is there? Looks
like the csv file is processed one line at a time, and the index shows
the entire process: so we see each previous line repeated in the output.
Is there a way to not show the 'appending of previous lines', and
rather just the 'final results' - so the total number of indexed
document would match the input number of documents from the csv file?

Many thanks,
Jia

Mime
View raw message