lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
Date Sat, 29 Jun 2013 12:20:27 GMT
Hi Mike,


You could try http://wiki.apache.org/solr/UpdateCSV 

And make sure you commit at the very end.




________________________________
 From: Mike L. <javaone123@yahoo.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> 
Sent: Saturday, June 29, 2013 3:15 AM
Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
 

 
I've been working on improving index time with a JdbcDataSource DIH based config and found
it not to be as performant as I'd hoped for, for various reasons, not specifically due to
solr. With that said, I decided to switch gears a bit and test out FileDataSource setup...
I assumed by eliminiating network latency, I should see drastic improvements in terms of import
time..but I'm a bit surprised that this process seems to run much slower, at least the way
I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited file. Nothing
fancy here. The regex just seperates out the fields... Is there faster approach to doing
this? If so, what is it?
 
Also, what is the "recommended" approach in terms of index/importing data? I know thats may
come across as a vague question as there are various options available, but which one would
be considered the "standard" approach within a production enterprise environment.
 
 
(below has been cleansed)
 
<dataConfig>
     <dataSource name="file" type="FileDataSource" />
   <document>
         <entity name="entity1"
                 processor="LineEntityProcessor"
                 url="[location_of_file]/file.csv"
                 dataSource="file"
                 transformer="RegexTransformer,TemplateTransformer">
 <field column="rawLine"
        regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
        groupNames="field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12"
/>
         </entity>
   </document>
</dataConfig>
 
Thanks in advance,
Mike
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message