lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amrit Sarkar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-12854) Document steps to improve delta import via DataImportHandler
Date Thu, 11 Oct 2018 17:48:00 GMT

     [ https://issues.apache.org/jira/browse/SOLR-12854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Amrit Sarkar updated SOLR-12854:
--------------------------------
    Issue Type: Improvement  (was: Bug)

> Document steps to improve delta import via DataImportHandler 
> -------------------------------------------------------------
>
>                 Key: SOLR-12854
>                 URL: https://issues.apache.org/jira/browse/SOLR-12854
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - DataImportHandler
>    Affects Versions: 7.5
>            Reporter: Amrit Sarkar
>            Priority: Major
>
> Delta imports in DataImportHandler is sometimes slower than full imports where the delta
import makes multiple queries compare to full import and hence making it time complex. Listed
in: https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
> In the mailing list; http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-td4338162.html
one of the Solr users have noted a workaround which works perfectly and improves delta import
performance, where we need to specify ${dataimporter.last_index_time} in the delta_import_query,
and not delta_sql_query.
> {code}
> I found a hacky way to limit the number of 
> times deltaImportQuery was executed.
> As designed, solr executes deltaQuery to get a list of ids that need to be indexed. For
each of those, it executes deltaImportQuery, which is typically very similar to the full query.
> I constructed a deltaQuery to purposely only return 1 row. E.g.
> deltaQuery = "SELECT id FROM table WHERE rownum=1" // written for 
> oracle, likely requires a different syntax for other dbs. Also, it occurred 
> to you could probably include the date>= '${dataimporter.last_index_time}' 
> filter here so this returns 0 rows if no data has changed
> Since deltaImportQuery now *only gets called once I needed to add the filter logic to
*deltaImportQuery *to only select the changed rows (that logic is normally in *deltaQuery).
E.g.
> deltaImportQuery = [normal import query] WHERE date >= 
> '${dataimporter.last_index_time}'
> {code}
> A number of other users have adopted the strategy and DIH delta import performance has
improved, and henceforth documenting this strategy as TIP will help other users too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message