lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-445) Update Handlers abort with bad documents
Date Fri, 25 Mar 2016 21:09:26 GMT

    [ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212415#comment-15212415
] 

ASF subversion and git services commented on SOLR-445:
------------------------------------------------------

Commit 5b6eacb80bca5815059cd50a1646fa4ecb146e43 in lucene-solr's branch refs/heads/branch_6x
from [~hossman_lucene@fucit.org]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5b6eacb ]

SOLR-445: new ToleranteUpdateProcessorFactory to support skipping update commands that cause
failures when sending multiple updates in a single request.

SOLR-8890: New static method in DistributedUpdateProcessorFactory to allow UpdateProcessorFactories
to indicate request params that should be forwarded when DUP distributes updates.

This commit is a squashed merge from the jira/SOLR-445 branch (as of b08c284b26b1779d03693a45e219db89839461d0)


> Update Handlers abort with bad documents
> ----------------------------------------
>
>                 Key: SOLR-445
>                 URL: https://issues.apache.org/jira/browse/SOLR-445
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>            Reporter: Will Johnson
>            Assignee: Hoss Man
>             Fix For: master, 6.1
>
>         Attachments: SOLR-445-3_x.patch, SOLR-445-alternative.patch, SOLR-445-alternative.patch,
SOLR-445-alternative.patch, SOLR-445-alternative.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch,
SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch,
SOLR-445_3x.patch, solr-445.xml
>
>
> This issue adds a new {{TolerantUpdateProcessorFactory}} making it possible to configure
solr updates so that they are "tolerant" of individual errors in an update request...
> {code}
>   <processor class="solr.TolerantUpdateProcessorFactory">
>     <int name="maxErrors">10</int>
>   </processor>
> {code}
> When a chain with this processor is used, but maxErrors isn't exceeded, here's what the
response looks like...
> {code}
> $ curl 'http://localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent=true&maxErrors=-1'
-H "Content-Type: application/json" --data-binary '{"add" : { "doc":{"id":"1","foo_i":"bogus"}},
"delete": {"query":"malformed:["}}'
> {
>   "responseHeader":{
>     "errors":[{
>         "type":"ADD",
>         "id":"1",
>         "message":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string:
\"bogus\""},
>       {
>         "type":"DELQ",
>         "id":"malformed:[",
>         "message":"org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered
\"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n
   <RANGE_GOOP> ...\n    "}],
>     "maxErrors":-1,
>     "status":0,
>     "QTime":1}}
> {code}
> Note in the above example that:
> * maxErrors can be overridden on a per-request basis
> * an effective {{maxErrors==-1}} (either from config, or request param) means "unlimited"
(under the covers it's using {{Integer.MAX_VALUE}})
> If/When maxErrors is reached for a request, then the _first_ exception that the processor
caught is propagated back to the user, and metadata is set on that exception with all of the
same details about all the tolerated errors.
> This next example is the same as the previous except that instead of {{maxErrors=-1}}
the request param is now {{maxErrors=1}}...
> {code}
> $ curl 'http://localhost:8983/solr/techproducts/update?update.chain=tolerant-chain&wt=json&indent=true&maxErrors=1'
-H "Content-Type: application/json" --data-binary '{"add" : { "doc":{"id":"1","foo_i":"bogus"}},
"delete": {"query":"malformed:["}}'
> {
>   "responseHeader":{
>     "errors":[{
>         "type":"ADD",
>         "id":"1",
>         "message":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string:
\"bogus\""},
>       {
>         "type":"DELQ",
>         "id":"malformed:[",
>         "message":"org.apache.solr.search.SyntaxError: Cannot parse 'malformed:[': Encountered
\"<EOF>\" at line 1, column 11.\nWas expecting one of:\n    <RANGE_QUOTED> ...\n
   <RANGE_GOOP> ...\n    "}],
>     "maxErrors":1,
>     "status":400,
>     "QTime":1},
>   "error":{
>     "metadata":[
>       "org.apache.solr.common.ToleratedUpdateError--ADD:1","ERROR: [doc=1] Error adding
field 'foo_i'='bogus' msg=For input string: \"bogus\"",
>       "org.apache.solr.common.ToleratedUpdateError--DELQ:malformed:[","org.apache.solr.search.SyntaxError:
Cannot parse 'malformed:[': Encountered \"<EOF>\" at line 1, column 11.\nWas expecting
one of:\n    <RANGE_QUOTED> ...\n    <RANGE_GOOP> ...\n    ",
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","java.lang.NumberFormatException"],
>     "msg":"ERROR: [doc=1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\"",
>     "code":400}}
> {code}
> ...the added exception metadata ensures that even in client code like the various SolrJ
SolrClient implementations, which throw a (client side) exception on non-200 responses, the
end user can access info on all the tolerated errors that were ignored before the maxErrors
threshold was reached.
> ----
> {panel:title=Original Jira Request}
> Has anyone run into the problem of handling bad documents / failures mid batch.  Ie:
> <add>
>   <doc>
>     <field name="id">1</field>
>   </doc>
>   <doc>
>     <field name="id">2</field>
>     <field name="myDateField">I_AM_A_BAD_DATE</field>
>   </doc>
>   <doc>
>     <field name="id">3</field>
>   </doc>
> </add>
> Right now solr adds the first doc and then aborts.  It would seem like it should either
fail the entire batch or log a message/return a code and then continue on to add doc 3.  Option
1 would seem to be much harder to accomplish and possibly require more memory while Option
2 would require more information to come back from the API.  I'm about to dig into this but
I thought I'd ask to see if anyone had any suggestions, thoughts or comments.    
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message