lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike L." <>
Subject Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
Date Mon, 01 Jul 2013 18:56:38 GMT
 Hey Ahmet / Solr User Group,
   I tried using the built in UpdateCSV and it runs A LOT faster than a FileDataSource DIH
as illustrated below. However, I am a bit confused about the numDocs/maxDoc values when doing
an import this way. Here's my Get command against a Tab delimted file: (I removed server info
and additional fields.. everything else is the same)


My response from solr 

<?xml version="1.0" encoding="UTF-8"?>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">591</int></lst>
I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to see If I
can get this to run correctly before running my entire collection of data. I initially loaded
the first 1000 records to an empty core and that seemed to work, however, but when running
the above with a csv file that has 10 records, I would like to see only 10 active records
in my core. What I get instead, when looking at my stats page: 

numDocs 1000 
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000, 
maxDoc 1000.

Perhaps the commit=true is not doing what its supposed to or am I missing something? I also
trying passing a commit afterward like this:
http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't seem to do
anything either)

From: Ahmet Arslan <>
To: "" <>; Mike L. <>

Sent: Saturday, June 29, 2013 7:20 AM
Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

Hi Mike,

You could try 

And make sure you commit at the very end.

From: Mike L. <>
To: "" <> 
Sent: Saturday, June 29, 2013 3:15 AM
Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

I've been working on improving index time with a JdbcDataSource DIH based config and found
it not to be as performant as I'd hoped for, for various reasons, not specifically due to
solr. With that said, I decided to switch gears a bit and test out FileDataSource setup...
I assumed by eliminiating network latency, I should see drastic improvements in terms of import
time..but I'm a bit surprised that this process seems to run much slower, at least the way
I've initially coded it. (below)
The below is a barebone file import that I wrote which consumes a tab delimited file. Nothing
fancy here. The regex just seperates out the fields... Is there faster approach to doing
this? If so, what is it?
Also, what is the "recommended" approach in terms of index/importing data? I know thats may
come across as a vague question as there are various options available, but which one would
be considered the "standard" approach within a production enterprise environment.
(below has been cleansed)
     <dataSource name="file" type="FileDataSource" />
         <entity name="entity1"
 <field column="rawLine"
Thanks in advance,

Thanks in advance,
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message