lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shalin Shekhar Mangar <shalinman...@gmail.com>
Subject Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
Date Thu, 04 Jul 2013 04:43:48 GMT
The split/group implementation in RegexTransformer is not as efficient
as CSVLoader. Perhaps we need a specialized csv loader in DIH.
SOLR-2549 aims to add this support. I'll take a look.

On Tue, Jul 2, 2013 at 12:26 AM, Mike L. <javaone123@yahoo.com> wrote:
>  Hey Ahmet / Solr User Group,
>
>    I tried using the built in UpdateCSV and it runs A LOT faster than a FileDataSource
DIH as illustrated below. However, I am a bit confused about the numDocs/maxDoc values when
doing an import this way. Here's my Get command against a Tab delimted file: (I removed server
info and additional fields.. everything else is the same)
>
> http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields
>
>
> My response from solr
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int name="QTime">591</int></lst>
> </response>
>
> I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to see If
I can get this to run correctly before running my entire collection of data. I initially loaded
the first 1000 records to an empty core and that seemed to work, however, but when running
the above with a csv file that has 10 records, I would like to see only 10 active records
in my core. What I get instead, when looking at my stats page:
>
> numDocs 1000
> maxDoc 1010
>
> If I run the same url above while appending an 'optimize=true', I get:
>
> numDocs 1000,
> maxDoc 1000.
>
> Perhaps the commit=true is not doing what its supposed to or am I missing something?
I also trying passing a commit afterward like this:
> http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't seem to
do anything either)
>
>
> From: Ahmet Arslan <iorixxx@yahoo.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Mike L. <javaone123@yahoo.com>
> Sent: Saturday, June 29, 2013 7:20 AM
> Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
>
>
> Hi Mike,
>
>
> You could try http://wiki.apache.org/solr/UpdateCSV
>
> And make sure you commit at the very end.
>
>
>
>
> ________________________________
> From: Mike L. <javaone123@yahoo.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Saturday, June 29, 2013 3:15 AM
> Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
>
>
>
> I've been working on improving index time with a JdbcDataSource DIH based config and
found it not to be as performant as I'd hoped for, for various reasons, not specifically due
to solr. With that said, I decided to switch gears a bit and test out FileDataSource setup...
I assumed by eliminiating network latency, I should see drastic improvements in terms of import
time..but I'm a bit surprised that this process seems to run much slower, at least the way
I've initially coded it. (below)
>
> The below is a barebone file import that I wrote which consumes a tab delimited file.
Nothing fancy here. The regex just seperates out the fields... Is there faster approach to
doing this? If so, what is it?
>
> Also, what is the "recommended" approach in terms of index/importing data? I know thats
may come across as a vague question as there are various options available, but which one
would be considered the "standard" approach within a production enterprise environment.
>
>
> (below has been cleansed)
>
> <dataConfig>
>      <dataSource name="file" type="FileDataSource" />
>    <document>
>          <entity name="entity1"
>                  processor="LineEntityProcessor"
>                  url="[location_of_file]/file.csv"
>                  dataSource="file"
>                  transformer="RegexTransformer,TemplateTransformer">
>  <field column="rawLine"
>         regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
>         groupNames="field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12"
/>
>          </entity>
>    </document>
> </dataConfig>
>
> Thanks in advance,
> Mike
>
> Thanks in advance,
> Mike



-- 
Regards,
Shalin Shekhar Mangar.

Mime
View raw message