lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reitzel, Charles" <>
Subject RE: Validate data Indexed and versioning
Date Mon, 02 Mar 2015 18:11:22 GMT
First, I would invest the largest effort towards developing good test cases and a good test
harness for your ETL software itself.   If validation in production does encounter errors,
it should be considered a bug in your code!  So be sure to always add these cases to your
test harness.

Also, the row level validation can and should be driven by metadata.   I'm assuming you have
a mapping between RDBMS table names and Solr entity types?   And, for any given entity type,
a table that maps solr field names and datatypes to their RDBMS equivalents?   My assumption
would be that the ETL process itself uses such metadata.  The same data could be used for
production data validation.  My inclination would be to integrate granular / row-level validation
into the ETL job itself.

For summary validation, if re-indexing from scratch every time, just run some facet queries
and compare to the equivalent summaries for the SQL input data (assuming you are familiar
with SQL "group by" and "having" clauses).    If using incremental loads, make sure you can
associate the loaded data with the ETL job that loaded it (timestamp, batch ID, etc.).   Then
simply scope the facet queries by the batch in question and compare to the SQL summary.

-----Original Message-----
From: marotosg [] 
Sent: Monday, March 02, 2015 6:32 AM
Subject: Validate data Indexed and versioning


I am trying to define a way of validating if my index has the same content than my database.
I am indexing a very complex denormalized version of the database with many items and nested
documents. I have an indexation service which pulls records from a staging table(created based
on a ETL process), transforms this table into xml which will be posted to Solr.

Is there any general approach to check if your indexed document matches the database row?.

One option I see is to create an additional service to run against solr and database and validate
if has the same data but this is going to be very intensive.
I was more on the opinion of solr telling the record indexed and content like number of nested
docs of type A,B etc.,

Any suggestions would help.



View this message in context:
Sent from the Solr - User mailing list archive at

This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete


View raw message