lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Potter (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-445) Update Handlers abort with bad documents
Date Thu, 24 Mar 2016 19:28:25 GMT

    [ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210827#comment-15210827
] 

Timothy Potter commented on SOLR-445:
-------------------------------------

LGTM +1 Nice test coverage of all this!  This will be very useful for streaming applications
(such as from Spark and Storm) where re-trying individual docs after an error is less than
ideal. Now we'll be able to pin-point exactly which docs had issues!

I'd prefer this to be baked into the default chain but can understand the rationale for leaving
it out for now too. So long as we put up an example of how to enable it using the Config API
in the ref guide.



> Update Handlers abort with bad documents
> ----------------------------------------
>
>                 Key: SOLR-445
>                 URL: https://issues.apache.org/jira/browse/SOLR-445
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>            Reporter: Will Johnson
>            Assignee: Hoss Man
>             Fix For: master, 6.1
>
>         Attachments: SOLR-445-3_x.patch, SOLR-445-alternative.patch, SOLR-445-alternative.patch,
SOLR-445-alternative.patch, SOLR-445-alternative.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch,
SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch,
SOLR-445_3x.patch, solr-445.xml
>
>
> This issue adds a new {{TolerantUpdateProcessorFactory}} making it possible to configure
solr updates so that they are "tolerant" of individual errors in an update request...
> {code}
>   <processor class="solr.TolerantUpdateProcessorFactory">
>     <int name="maxErrors">10</int>
>   </processor>
> {code}
> When a chain with this processor is used, but maxErrors isn't exceeded, here's what the
response looks like...
> {code}
> $ curl 'http://localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent=true&maxErrors=-1'
-H "Content-Type: application/json" --data-binary '{"add" : { "doc":{"id":"1","foo_i":"bogus"}},
"delete": {"query":"malformed:["}}'
> {
>   "responseHeader":{
>     "errors":[{
>         "type":"ADD",
>         "id":"1",
>         "message":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string:
\"bogus\""},
>       {
>         "type":"DELQ",
>         "id":"malformed:[",
>         "message":"org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered
\"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n
   <RANGE_GOOP> ...\n    "}],
>     "maxErrors":-1,
>     "status":0,
>     "QTime":1}}
> {code}
> Note in the above example that:
> * maxErrors can be overridden on a per-request basis
> * an effective {{maxErrors==-1}} (either from config, or request param) means "unlimited"
(under the covers it's using {{Integer.MAX_VALUE}})
> If/When maxErrors is reached for a request, then the _first_ exception that the processor
caught is propagated back to the user, and metadata is set on that exception with all of the
same details about all the tolerated errors.
> This next example is the same as the previous except that instead of {{maxErrors=-1}}
the request param is now {{maxErrors=1}}...
> {code}
> $ curl 'http://localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent=true&maxErrors=1'
-H "Content-Type: application/json" --data-binary '{"add" : { "doc":{"id":"1","foo_i":"bogus"}},
"delete": {"query":"malformed:["}}'
> {
>   "responseHeader":{
>     "errors":[{
>         "type":"ADD",
>         "id":"1",
>         "message":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string:
\"bogus\""},
>       {
>         "type":"DELQ",
>         "id":"malformed:[",
>         "message":"org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered
\"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n
   <RANGE_GOOP> ...\n    "}],
>     "maxErrors":1,
>     "status":400,
>     "QTime":1},
>   "error":{
>     "metadata":[
>       "org.apache.solr.common.ToleratedUpdateError--ADD:1","ERROR: [doc=1] Error adding
field 'foo_i'='bogus' msg=For input string: \"bogus\"",
>       "org.apache.solr.common.ToleratedUpdateError--DELQ:malformed:[","org.apache.solr.search.SyntaxError:
Cannot parse 'malformed:[': Encountered \"<EOF>\" at line 1, column 11.\nWas expecting
one of:\n    <RANGE_QUOTED> ...\n    <RANGE_GOOP> ...\n    ",
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","java.lang.NumberFormatException"],
>     "msg":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\"",
>     "code":400}}
> {code}
> ...the added exception metadata ensures that even in client code like the various SolrJ
SolrClient implementations, which throw a (client side) exception on non-200 responses, the
end user can access info on all the tolerated errors that were ignored before the maxErrors
threshold was reached.
> ----
> {panel:title=Original Jira Request}
> Has anyone run into the problem of handling bad documents / failures mid batch.  Ie:
> <add>
>   <doc>
>     <field name="id">1</field>
>   </doc>
>   <doc>
>     <field name="id">2</field>
>     <field name="myDateField">I_AM_A_BAD_DATE</field>
>   </doc>
>   <doc>
>     <field name="id">3</field>
>   </doc>
> </add>
> Right now solr adds the first doc and then aborts.  It would seem like it should either
fail the entire batch or log a message/return a code and then continue on to add doc 3.  Option
1 would seem to be much harder to accomplish and possibly require more memory while Option
2 would require more information to come back from the API.  I'm about to dig into this but
I thought I'd ask to see if anyone had any suggestions, thoughts or comments.    
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message