lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike L." <javaone...@yahoo.com>
Subject Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
Date Wed, 03 Jul 2013 21:05:05 GMT
Hey Shawn / Solr User Group,
 
This makes perfect sense to me. Thanks for the thorough answer.  
     "The CSV update handler works at a lower level than the DataImport handler, and doesn't

have "clean" or "full-import" options, which defaults to clean=true. The DIH is like a full
application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When clean=true or using full-import
without a clean option, DIH itself sends 
a "delete all documents" update request."
 
And similiarly, my assumption is in the event of a non-syntactical failure/interuption  (such
as a server crash) during the CSV Update a rollback (stream.body=<rollback/>) would
also need to be manually requested (or automatted but outside of Solr) where as the DIH automates
this Request on my behalf as well...? Is there anyway to detect this failure or interuption?...A
real example is, I was in the process of indexing data via the CSV Update and somebody bounced
the server before it completed. No actual errors were produced but it appeared that the CSV
Update process stopped at the point of the reboot. My assumption is, if I had passed in a
rollback, I'd get the previously indexed data , given I didn't request a delete beforehand
(haven't yet tested this). But wondering, how I could automatically detect this? This I guess
is where DIH starts gaining some merit. Also - the response that the DIH produces when the
indexing process is complete appears
 to be a lot more mature in that it explicity suggest the index completed and that information
can can be re-queried. It would be nice if the CSV Update provided a similiar response..my
assumption is it would first need to know how many lines exist on the file in order to know
whether or not the job actually completed...
 
 Also - outside of solr initiating a delete due to encountering the same UniqueKey, is there
anything else that could cause a delete to be initiated by Solr? 

Lastly, is there any concern of running multiple Update CSV requests on different data files
containing different data? 

Thanks in advance. This was very helpful.

Mike
 

________________________________
From: Shawn Heisey <solr@elyograg.org>
To: solr-user@lucene.apache.org 
Sent: Monday, July 1, 2013 2:30 PM
Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


On 7/1/2013 12:56 PM, Mike L. wrote:
>  Hey Ahmet / Solr User Group,
>
>    I tried using the built in UpdateCSV and it runs A LOT faster than a FileDataSource
DIH as illustrated below. However, I am a bit confused about the numDocs/maxDoc values when
doing an import this way. Here's my Get command against a Tab delimted file: (I removed server
info and additional fields.. everything else is the same)
>
> http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields
>
>
> My response from solr
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int name="QTime">591</int></lst>
> </response>
>
> I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to see If
I can get this to run correctly before running my entire collection of data. I initially loaded
the first 1000 records to an empty core and that seemed to work, however, but when running
the above with a csv file that has 10 records, I would like to see only 10 active records
in my core. What I get instead, when looking at my stats page:
>
> numDocs 1000
> maxDoc 1010
>
> If I run the same url above while appending an 'optimize=true', I get:
>
> numDocs 1000,
> maxDoc 1000.

A discrepancy between numDocs and maxDoc indicates that there are 
deleted documents in your index.  You might already know this, so here's 
an answer to what I think might be your actual question:

If you want to delete the 1000 existing documents before adding the 10 
documents, then you have to actually do that deletion.  The CSV update 
handler works at a lower level than the DataImport handler, and doesn't 
have "clean" or "full-import" options, which defaults to clean=true. 
The DIH is like a full application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When 
clean=true or using full-import without a clean option, DIH itself sends 
a "delete all documents" update request.

If you didn't already know the bit about the deleted documents, then 
read this:

It can be normal for indexing "new" documents to cause deleted 
documents.  This happens when you have the same value in your UniqueKey 
field as documents that are already in your index.  Solr knows by the 
config you gave it that they are the same document, so it deletes the 
old one before adding the new one.  Solr has no way to know whether the 
document it already had or the document you are adding is more current, 
so it assumes you know what you are doing and takes care of the deletion 
for you.

When you optimize your index, deleted documents are purged, which is why 
the numbers match there.

Thanks,
Shawn
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message