lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Build index from Oracle, adding fields
Date Thu, 26 Mar 2015 23:42:05 GMT
On 3/26/2015 5:19 PM, Julian Perry wrote:
> I have an index with, say, 10 fields.
>
> I load that index directly from Oracle - data-config.xml using
> JDBC.  I can load 10 million rows very quickly.  This direct
> way of loading from Oracle straight into SOLR is fantastic -
> really efficient and saves writing loads of import/export code
> (e.g. via a CSV file).
>
> Of those 10 fields - two of them (set to multiValued) come from
> a separate table and there are anything from 1 to 10 rows per
> row from the main table.
>
> I can use a nested entity to extract the child rows for each of
> the 10m rows in the main table - but then SOLR generates 10m
> separate SQL calls - and the load time goes from a few minutes
> to several days.
>
> On smaller tables - just a few thousand rows - I can use a
> second nested entity with a JDBC call - but not for very large
> tables.
>
> Could I load the data in two steps:
> 1)  load the main 10m rows
> 2)  load into the existing index by adding the data from a
>     second SQL call into fields for each existing row (i.e.
>     an UPDATE instead of an INSERT).
>
> I don't know what syntax/option might achieve that.  There
> is incremental loading - but I think that replaces whole rows
> rather then updating individual fields.  Or maybe it does
> do both?

If those child tables do not have a large number of entries, you can
configure caching on the inner entities so that the information doesn't
need to actually be requested from the database server.  If there are a
large number of entries, then that may not be possible due to memory
constraints.

https://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

If that's not practical, then the only real option you have is to drop
back to one entity, and build a single SELECT statement (using JOIN and
some form of CONCAT) that will gather all the information from all the
tables at the same time, and combine multiple values together into one
SQL result field with some kind of delimiter.  Then you can use the
RegexTransformer's "splitBy" functionality to turn the concatenated data
back into multiple values for your multi-valued field.  Database servers
tend to be REALLY good at JOIN operations, so the database would be
doing the heavy lifting.

https://wiki.apache.org/solr/DataImportHandler#RegexTransformer

Solr does have an equivalent concept to SQL's UPDATE, but there are
enough caveats to using it that it may not be a good option:

https://wiki.apache.org/solr/Atomic_Updates

Thanks,
Shawn


Mime
View raw message