lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-5944) Support updates of numeric DocValues
Date Thu, 01 Dec 2016 22:21:59 GMT

     [ https://issues.apache.org/jira/browse/SOLR-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-5944:
---------------------------
    Attachment: SOLR-5944.patch


Ok -- i've been reviewing Ishan's latest patch.

In a conversation I had with him offline, he mentioned that he hadn't been doing any further
work on this locally since his last update, so w/o any risk of collision I went ahead and
made a bunch of small updates directly...

{panel:title=Small changes I made directly in latest patch}
* misc javadoc tweaks & improvements

* some logging cleanup & template improvements
** Anytime we log anything that includes a complex object (that we expect to be toString()-ified)
is a situation where we should _definitely_ be using templated log messages to avoid potentially
expensive calls to toString() in the case that the log level results in the method being a
No-Op.
** I probably missed some, but i tried to catch as many as i could

* some simple method renames:
** UpdateLog.addUpdateCommandFromTlog -> convertTlogEntryToAddUpdateCommand
*** older name was ambigiuous because of dual meaning of "add", now verb is clear.
** AtomicUpdateDocumentMerger.getSearcherNonStoredDVs -> getNonStoredDocValueFieldNamesFromSearcher

* refactored some duplicate code into new helper methods:
** DirectUpdateHandler2.updateDocOrDocValues

* refactor some loops where methods were being called unneccessarily in every iteration
** DocumentBuilder.toDocument
** AtomicUpdateDocumentMerger.isInPlaceUpdate (IndexSchema.getUniqueKeyField and IndexWriter.getFieldNames())

* added some comments containing important details/reasoning that had only been captured in
jira comments so far
** AtomicUpdateDocumentMerger.doInPlaceUpdateMerge

* increased usage of expectThrows, beefed up asserts related to expected Exceptions

* added some nocommit comments regarding some questionable lines of code (either new questions,
or things I've provided feedback on before that keep slipping through the cracks)

* updated to master/98f75723f3bc6a718f1a7b47a50b820c4fb408f6

{panel}


More in-depth/concrete responses/comments to the previous patch are listed below -- anything
mentioned below that still seems problematic to me should also have nocommit comments in the
code associated with them, but i wanted to elaborate a bit here in the jira comments.

(The reverse is not neccessarily true however: I added nocommits for some small things that
I didn't bother to add jira comments aboute)


{panel:title=schema.xml}
FWIW: I was initially concerned that change all these _existing_ dynamicFields from indexed=true
to indexed=false might be weakening other tests, but I did a quick grep for their usage in
other test methods (prior to this jira) and the only impacts i found were in AtomicUpdatesTest,
where the only concern was docValues=true and stored=false.  So changing these to indexed=false
doesn't impact that test at all (they are never queried against)

Based on the "dvo" naming, (which i think a reasonable person would read as "doc values only")
making these fields indexed=false seems like a good idea

{panel}


{panel:title=DistributedUpdateProcessor}

* In general, i think there's a gap in the DUP logic related to how the results of fetchFullUpdateFromLeader
are used when replicas never get dependent updates
** see detailed nocommit comments in the body of fetchFullUpdateFromLeader for details
*** they apply to both call stacks where the method is used, but i wnated ot write it once
in one place

* versionAdd
** {{if (fetchedFromLeader == null)}}
*** I'm a little concerned/confused by the "else" block of this condition
**** grep code for "nocommit: OUTER IF", "nocommit: OUTER ELSE" & "nocommit: INNER ELSE"
*** at this point, we've replaced our "in-place" update with a regular update using the full
SolrInputDocument from the leader (by overwritting cmd.solrDoc, cmd.prevVersion, cmd.setVersion(...))
*** but this if/else is itself wrapped in a bigger {{if (cmd.isInPlaceUpdate())}} whose else
clause does some sanity checking / processing logic for "// non inplace update, i.e. full
document update"
*** shouldn't the logic in that outer else clause also be applied to our "no longer an in-place"
update?

* fetchFullUpdateFromLeader
** this method still calls {{forceUpdateCollection}} ... note shalin's comments back in August...{quote}
Under no circumstances should we we calling `forceUpdateCollection` in the indexing code path.
It is just too dangerous in the face of high indexing rates. Instead we should do what the
DUP is already doing i.e. calling getLeaderRetry to figure out the current leader. If the
current replica is partitioned from the leader then we have other mechanisms for taking care
of it and the replica has no business trying to determine this.
{quote}

{panel}


{panel:title=DirectUpdateHandler2}

* I previously asked about LUCENE-7344...{quote}
* so what's our Solr solution / workaround?  do we have any?
{color:green}Our workaround is to reopen an RT searcher during every DBQ (in the DUH2), as
per Yonik's suggestion.{color}
{quote}
** I gather, reading this patch and knowing the context of this issue, that the DUH2 line
i added a "nocommit: LUCENE-7344" to is the line in question?
** in general, any code that exists purely to work around some other bug should have a have
a big, bold, loud, comment explaining the purpose for it's existence and a link to the known
bug
** specifically: the fact that {{grep LUCENE-7344 SOLR-5944.patch}} didn't match *ANY* lines
in the last patch was a big red flag to me
*** the code should have a comment making it clear where/how/why/what the work around is
*** the test(s) written to ensure the workaround works should also have comments refering
to the issue at hand (offhand, i personally couldn't identify them)

{panel}


{panel:title=DocumentBuilder}

* addField
** previous feedback/response...{quote}
* if {{true==forInPlaceUpdate}} and {{createFields(...)}} returns anything that is *not* a
NumericDocValuesField (or returns more then one) shouldn't that trip an assert or something?
(ie: doesn't that mean this SchemaField isn't valid for using with an in place update, and
the code shouldn't have gotten this far?)
** {color:green}This is fine, since createFields() generates both NDV and non NDV fields for
an indexable field, and the intent is to drop the non NDV one. Added a comment to this effect{color}
{quote}
** I couldn't make sense of this until i realized what you're describing is evidently a bug
in TrieField: SOLR-9809
** I've updated addField with a comment to make it clear we're working around that bug
** unless i'm missunderstanding something, fixing that bug could theoretically allow us to
throw away most of the new DocumentBuilder logic in this patch (that's conditional on forInPlaceUpdate)
*** notable exception: optimizing away the default field construction
*** also: leaving the new code in will be handy for the the purposes of asserting that we
only have the expected NumericDocValuesField instances

{panel}


{panel:title=UpdateLog}

* applyPartialUpdates
** in reply to some previous questions/feedback...{quote}
{color:green}...It is my understanding is that due to the incref/decref being inside the synchronized(ulog)
block, no other thread's action would affect whatever happens to these tlog once we've entered
the block....{color}
{quote}
*** the problem in previous patches was that the incref/decref were *NOT* inside the sync
block -- which is why i asked the questions.  In this patch you moved the {{getEntryFromTLog}}
call inside the same sync block where the {{List<TransactionLog>}} is created, so _now_
as {{getEntryFromTLog}} loops over the TransactionLog, we're safe -- but that wasn't happening
before, hence the question.
**** I _think_ the way the code is writen right now is safe/valid, but i'd feel better about
it if we could get some more eyeballs to review the sync usage.
*** Even if it is currently correct/valid, I still think it would be good to make the {{getEntryFromTLog}}
method itself declared as {{synchronized}} -- it should have no impact on performance/correctness
since the only existing usage is the {{synchronized (this)}} block, but it will protect us
against future dev mistakes since hte _only_ valid usage of that method is when you have a
lock on {{this}} (otherwise some other thread might be decrefing/closing the TransactionLog
instances used in that method.
** some (different) feedback about this method from before...{quote}
"in general i'm wondering if lookupLogs should be created outside of the while loop, so that
there is a consistent set of "logs" for the duration of the method call ... what happens right
now if some other thread changes tlog/prevMapLog/prevMapLog2 in between iterations of the
while loop?" {color:green}<-- I added them inside the while block so that if there have
been multiple commits during the course of iterating the while loop, those tlogs wouldn't
have grown stale. Do you think we should pull it out outside the while loop?{color}
{quote}
*** The concern i have is that AFAICT rebuilding the {{Arrays.asList(tlog, prevMapLog, prevMapLog2)}}
in every iteration of the while loop seems to be at cross purposes with the reason why the
while loop might have more then one iteration.
*** as you said: if commits happen in other threads while the code in the loop is evaluating
an entry to apply partial updates, then the _next_ iteration of the loop (if any) will get
the updated tlog with new data
**** but the _reason_ there might be a "next" iteration of the loop is if the {{prevPointer}}
of the our current {{IN_PLACE}} update identified an early (ancestor) {{IN_PLACE}} update
-- having _newer_ commits on the "next" iteration of the loop doesn't help us. (does it?)
*** in other words: the purpose of the while loop is to be able to iterate backwards through
time in the tlogs when many {{IN_PLACE}} updates are applied on top of eachother, while the
purpose you listed for defining {{lookupLogs}} inside the loop is to support moving forward
in time as new tlogs are created -- which actually _decreases_ the likelihood that we'll find
some ancestor entry we have a cascading dependency on.
*** does that make sense?
**** Am i missunderstanding/missing some possible benefit of having the "newer" tlogs available
as the while loop iterates back in time over older and older prevPointers?
** assuming i'm correct about both of the above points...
*** {{List<TransactionLog> lookupLogs}} should be created once, outside of any looping,
and never redefined to ensure that even if subsequent commits occur in concurrent threads,
we can still trace the chain of dependent updates back as far as possible using the oldest
known tlogs available to us when the method was called.
*** {{List<TransactionLog> lookupLogs}} and any calls to {{getEntryFromTLog}} that use
that List *MUST* happen in the same sync block.
*** ergo the sync block must wrap the while loop ... ie: we might as well declare the entire
method syncrhonized.

{panel}


{panel:AtomicUpdateDocumentMerger}

* isInPlaceUpdate
** regarding some previous feedback...{quote}
couldn't this list of fields be created by the caller and re-used at least for the entire
request (ie: when adding multiple docs) ? {color:green}The set returned is precomputed upon
the opening of a searcher. The only cost I see is to create a new unmodifiableSet every time.
I'd prefer to take up this optimization later, if needed.{color}{quote}
*** the method you're talking about isn't even used in this patch -- as you noted in the comment
you made immediately after that one, you changed it to use {{IndexWriter.getFieldNames()}}
instead.
*** {{IndexWriter.getFieldNames()}} delegates to a syncronized method, which means there is
definitely overhead in calling it over and over for every field name in every document in
the request.
*** even if you don't think it's worth refactoring this method so that {{IndexWriter.getFieldNames()}}
only needs to be called once per SolrQueryRequest, it seems irresponsible not to at least
call it *once* in this method (ie: once per doc) before looping over all the fields ... so
i made that change as well.
** regarding some other previous feedback...{quote}
* {{if (indexFields.contains(fieldName) == false && schema.isDynamicField(fieldName))}}
** why does it matter one way or the other if it's a dynamicField? {color:green}Changed the
logic to check in the IW for presence of field. Added a comment: "// if dynamic field and
this field doesn't exist, DV update can't work"{color}
{quote}
*** that doesn't really answer my question at all: *why* does it matter if it's a dynamic
field?
*** Doesn't the same problematic situation happen if it's an explicitly defined {{<field
.. />}} in the schema.xml, but just doesn't happen to exist yet in any docs? -- ie: if
the very first update to the index is:{code}
[{"id":"HOSS","price":{"set":42.10}}]
{code}???
*** even if i'm wrong: consider this code (and the comment you added) from the point of view
of someone completely unfamiliar with this issue.  if they come along, and see this code jumping
through some hoops related to dynamicFields, and the only comment is (essential) "doesn't
work for dynamic fields" that comment isn't going to help them make any sense of the code
-- it would be like if some code had a comment that said {{// set the value of the variable
will_not_work to true}}, it doesn't explain _why_ that line of code exists. 
**** I've been working on this issue with you for months, I'm intimately familiar with this
patch, but reading this code now I can't recall any particular reason *why* "DV update can't
work" specifically in the case of "*dynamic field* _and this field doesn't exist_"
**** i thought the only known problem like this was just "this field doesn't yet exist in
the index" in general? (and that we had a check for that problem at a higher level then this
method)
*** if i am wrong, and there is an important distinction between dynamicFields and "static"
fields when dealing with the situation of in-place updates being attempted against an index
that doesn't yet contain them, then that makes me very concerned that most of the existing
testing depends solely on dynamicFields (or in some cases solely on static fields)
**** an easy solution would be to replace {{schema-inplace-updates.xml}} with 2 variants:
one that _only_ has the {{<dynamicField/>}} declarations needed to match all the field
names used in the tests, and one that has an explicit {{<field />}} declaration for
every field name used in the tests.  and then all the tests could randomize between {{schema-inplace-updates-dynamicfields.xml}}
and {{schema-inplace-updates-staticfields.xml}} in their {{@BeforeClass}}
** explicitly throwing {{new RuntimeException}} is a big red flag.
*** if we're going to catch {{IOException}} and wrap it in a {{RuntimeException}} we should
at least use {{SolrException}} instead and use a message that provides context
*** since the only caller of this method ({{DUP.getUpdatedDocument}}) already declares that
it throws {{IOException}}, why not just let this method propogate the {{IOException}} as is?

* doInPlaceUpdateMerge
** a previous question...{quote}
why are we copying all supported DVs over? {color:green}If we update dv1 in one update, and
then update dv2, and again update dv1 (without commits in between), the last update would
fetch from the tlog the partial doc for dv2 update. If that doc doesn't copy over the previous
updates to dv1, then a full resolution (by following previous pointers) would need to be done
to calculate the dv1 value. Hence, I decided to copy over older DV updates.{color}
{quote}
*** that makes perfect sense -- that explanation should be part of the code comment so a future
dev doesn't have the same mistaken impression i did that it was copying DVs that aren't needed
-- so i added it to the code as a new java comment.

{panel}

{panel:title=schema-inplace-updates.xml}

* in general there was a lot of unneccessary stuff in this schema copy/pasted from schema-minimal-atomic-stress.xml
** even some comments that refered to test classes that were using schema-minimal-atomic-stress.xml
and had nothing to do with this config
* my previous suggestion was "add a new, small, purpose built & well commented schema-inplace-atomic-updates.xml
file just for the new InPlace tests – containing only the fields/dynamicFields needed" ...
it's not "small and purpose build" if it's got useless stuff copy/pasted from other files.
** leaving unused fields/types here defeats the entire point of my suggestion: "having smaller
single purpose test configs makes it a lot easier for people reviewing tests to understand
what the minimum expectations are for the test to function"
** i went ahead and cleaned up everything that wasn't needed for the new tests, and added
a few comments as appropriate to make it clear to future readers why some stuff is the way
it is.

* regarding the copyField testing & schema comment you had...
** i had to re-read the schema and the test involved at least 10 times before i could make
sense of it
** the crux of my confusion was that in the second copyField:
*** the src field named {{copyfield_not_updateable_src}} *did* support in place updates (in
spite of it's name)
*** the dest field named {{copyfield_updatable_target_i}} *did not* support in place updates
(in spite of it's name)
** i renamed the fields used by the test to try and be more clear what was going on

{panel}

{panel:title=TestRecovery}
* previous feedback...{quote}
* why is {{// assert that in-place update is retained}} at teh end of the test, instead of
when the (first) log replay happens and \*:\* -> numFound=3 is asserted? (ie: before "A2"
is added) {color:green}FIXED{color}
* what about mixing in-place updates with deletes (both by id and query?) {color:green}FIXED{color}
{quote}
* neither of those responses made any sense to me.
** RE: {{// assert that in-place update is retained}}
*** the latest patch included an additional copy of the same assert -- but it was being checked
after several other updates
*** I couldn't see any reason why that same assert (that the in-place update is retained)
shouldn't happen as soon as recovery is finished, before doc "A2" was added?
**** So I added this -- and everything seems to be working fine.
** RE: mixing in-place updates with deletes
*** i don't understand how this is "FIXED" ... there's been no additions to this test since
the last patch that relate to deletes at all ???
*** as far as i can tell, there is still no tests actually verifying that log replay will
do the correct when deletes are intermixed with in-place updates, ie...
**** DBQs against a field that had in-place updates
**** a delete by id of a document inbetween two "atomic updates" of that document (to ensure
the replay recognizes that the second update can't be done in-place)
**** etc...

{panel}

{panel:title=UpdateLogTest}

* lifecycle of requests
** in my last feedback, i mentioned that there are a lot of LocalSolrQueryRequest that were
created and never closed, and said...{quote}
the convinience methods that build up UpdateCommands (ulogCommit & getDeleteUpdate &
getAddUpdate) sould be refactored to take in a SolrQueryRequest which should be closed when
the logical bit of code related to testing that UpdateCommand is finished. something like...
{code}
try (SolrQueryRequest req = req()) {
  ulog.add(getAddUpdate(req, ...));
  // TODO: anything else that needs to be part of this same request for test purposes?
}
// TODO ... assert stuff about ulog
...
{code}
{quote}
** in your updated patch, the SolrQueryRequest's are getting auto-closed -- but that's happening
in the static helper methods (ie: {{getAddUpdate}}, {{getDeleteUpdate}}) that then return
an UpdateCommand which still refrences & uses the SolrRequest.  So the lifecycle of the
request is still violated, just in a diff way
*** now instead of not closing them, we're closing them before we're done using them
** any logic that's being tested that would normally be pat of the request -- like adding
an UpdateCommand to the ulog -- needs to happen before the SolrQueryRequest is closed, or
the test isn't valid -- it could be hiding bugs that exist in the normal lifecycle (ie: related
to re-opened searchers, etc...)
** based on the current code, the easiest fix maybe...
*** rename {{getAddUpdate}}, {{getDeleteUpdate}} --> {{applyAddUpdate}}, {{applyDeleteUpdate}}
*** make both methods take in an {{UpdateLog}} param and call the appropirate {{ulog.foo(...)}}
method inside their try blocks
*** change {{applyDeleteUpdate}} to return void
*** change {{applyAddUpdate}} to return {{cmd.getIndexedId()}} so it can be used in subsequent
{{ulog.lookup(...)}} calls for assertions about the state of the ulog
** but i'm not certain, because i think these static methods are also used in other tests?
*** so perhaps my original suggestion is still the only valid one: caller needs to pass in
the SolrQueryRequest ?
** grep test for "nocommit: req lifecycle bug" 

* regarding using expectThrows:{quote}
{color:green}Couldn't use expectThrows as suggested, since it required variables like prevPointer,
prevVersion, partialDoc etc. to be final variables, and changing them to such would make things
ugly. Maybe I missed something?{color}
{quote}
** i'm not sure what you tried, but variables used in lambda bodies must be *effectively*
final, and in this case all of the variables you're refering to are effectively final at the
point where we need to be using expectThrows.
*** as a general rule of thumb: if you want to use {{expectThrows}} and the variables you
need to assert things about aren't effectively final, that's a code smell that you should
consider breaking up your unit test into more discreete test methods.
*** perhaps this is exactly the situation you ran into? you tried using expectThrows before
splitting up the method and had difficulties, but now that the test methods are more discrete
I was able to switch it to use expectThrows() no problem.

* regarding DBQ testing: {quote}
what about testing DBQs in the ulog and how they affect partial updates? {color:green}Any
DBQ causes ulog to be cleared after it (irrespective of in-place updates, irrespective of
whether any documents were deleted or not). Only that effect can be tested, and I've added
a test for it.{color}{quote}
** i ment doing DBQs against a field/value that has in-place updates already in the tlog,
and verifing that subsequent ulog.lookup calls do/don't match on that docId depending on wether
the DBQ matched the old/new value after the in-place update.
*** the test you added (testApplyPartialUpdatesWithDBQ) just does a DBQ against the uniqueKey
field -- in spite of it's name, there isn't a single partial update anywhere in the test!
** if there's really nothing we can do with an UpdateLog after adding *any* DBQ to it (ie:
all lookups will fail), then that's not at all obvious from this test -- instead:
*** if the behavior of a DBQ is completley independent of any in-place updates the method
should be renamed
*** the asserts should make it clear the ulog has been cleared -- add more docs first and
then check that all lookups fail, verify directly that a new RT searcher really has been opened
(compare the currently used RT searcher to an instance from before the DBQ?), etc...

{panel}



{panel:title=TestInPlaceUpdatesStandalone}

* some previous feedback i gave on one particular test...{quote}
please use expectThrows here, and actually validate that the Exception we get is a SolrException
with code=409 {color:green}FIXED: checked message contains "conflict"{color}
{quote}
** that's a really weak assertion considering it's more correct, less brittle, and less code,
to just use something like {{assertEquals(409, exception.code())}} like i suggested.
** I went ahead and audited the all the new test code to ensure all of the expectThrows usage
includes an assert that it has the expected error code (400, 409, 500, etc...) in addition
to what ever existing message assertions make sense.


* getFieldValueRTG
** this method seems to have a lot of complexity, and requires a lot of complexity in all
of the caller code, to work around the fact that it's using really low level access and dealing
directly with the StoredField/NumericDocValuesField/LegacyField type objects that exist in
the SolrInputDocuments that exist in the tlog
** it seems like it would be _MASSIVELY_ simpler to just initialize a static SolrCLient in
the \@BeforeClass method along the lines of: {{solrClient = new EmbeddedSolrServer(h.getCoreContainer(),
h.coreName);}} and then just use {{SolrClient.getById}} when we need values from an RTG
*** grep other solr/core tests for {{EmbeddedSolrServer}} to see examples of this

* testReplay
** i previously gave the following feedback...{quote}
testReplay3, testReplay4, testReplay6, testReplay7
* these should have better names {color:green}Fixed{color}
{quote}
** instead of giving the methods descriptive names making it clear what they tested, you seem
to have combined them into one big generic {{testReplay}} method
** that's pretty much the opposite of what i suggested
** if {{testReplay}} failed, it won't be particularly clear what aspect of the functinality
encoutered a problem w/o reading through all of the steps, and if it fails early in the test
(ie: the first {{checkReplay}} call) we won't have any idea if only that particularly sequence
failed, or if the other sequences futher down in the test would also fail
** that's the main value add in having discrete, independent, tests: being able to see when
more then one of them fail as a result of some change.
*** {{testReplay}} as written was weaker then the 4 previous individual tests -- even if they
didn't have particularly distinctive names.
*** at least before, if something caused testReplay3 and testReplay6 to fail, but testReplay4
and testReplay7 still passed, that would have given us a starting point for investigating
what those methods had in common, and how they differed.
** I went ahead and re-split this into 4 distinct test methods

* previous comment...{quote}
I'd still like to see some 100% randomized testing using checkReplay.  It's a really powerful
helper method -- we should be taking full advantage of it. {color:green}TODO: Can be added
later{color}
{quote}
** considering how many weird edge case bugs we've found (both in the new solr code and in
the existing IndexWriter level DB updating code) over the lifespan of this jira, that required
very particular and specific circumnstances to manifest, I'm in no way comfortable committing
this patch as is with the level of testing it has currently
** that's why i keep saying we need more (solid) randomized & reproducible (ie: not cloud
based where node routing might introduce subtle non-reproducibility) testing, and {{checkReplay}}
is the starting point of doing that -- it already takes care of the hard work, we just need
some scaffolding to generate a random sequence of updates to pass to it, and a loop to do
lots of iterations with diff random sequences.

{panel}

{panel:title=TestStressInPlaceUpdates}

* some previously feedback & responses...
** {quote}"operations" comment is bogus (it's not just for queries) {color:green}FIXED{color}
{quote}
*** how was this fixed? the comment on the operations variable in the last patch still had
the exact same comment indicating it's a number of queries -- i corrected it
** {quote}
I'm not convinced the "{{synchronize \{...\}; commit stuff; syncrhonize \{ ... \};}}" sequence
is actually thread safe...
* T-W1: commit sync block 1: newCommittedModel = copy(model), version = snapshotCount++;
* T-W2: updates a doc and adds it to model
* T-W1: commit
* T-W1: commit sync block 2: committedModel = newCommittedModel
* T-R3: read sync block: get info from committedModel
* T-R3: query for doc
* ...
{color:green}Copied the logic straight from TestStressReorder. Maybe we can revisit both together,
later, in light of your concerns?{color}
{quote}
*** broken code is broken code.  It's bad enough it exists at all, it shouldn't be copied
if we know it's broken.
*** I've described in detail a specific example of how the synchronization approach in the
code is invalid acording to my understanding of the code and the docs for the relevant methods
-- if you can explain to me some way(s) in which my example is inaccurate, or involves some
missunderstanding of the relevant documentation then great! -- we can leave the code the way
it is.  Otherwise it's broken and needs to be fixed.
*** there's no point in adding (more!) tests to Solr that we know for a fact are unreliable
as designed.
** {quote}
having at least one pass over the model checking every doc at the end of the test seems like
a good idea no matter what {color:green}FIXED{color}
{quote}
*** how is this fixed?
*** the very last thing in the {{stressTest}} method is still the loop that joins on all the
threads.
** {quote}I'm certain the existing "synchronized (model)" block is not thread safe relative
to the synchronized blocks that copy the model into commitedModel, because the "model.put(...)"
calls can change the iterator and trigger a ConcurrentModificationException {color:green}Never
saw a CME occur in thousands of rounds of this test. I think if a CME were to be thrown ever,
it would've been quite frequent.{color}
{quote}
*** FWIW: Just because you never saw a CME in your testing doesn't make the test valid --
it could be more likely to trigger on diff OS/architectures
**** You can't *disprove* a concurrency bug in code just by saying "i ran it a bunch and never
saw a problem"
**** Hell: I could write a test that does {{assertTrue(0 != random().nextInt()}} and run it
for thousands of rounds w/o ever getting a failure -- but that doesn't make the test correct.
*** That said: I reviewed this particular concern again just to be sure and I'm pretty sure
my initial concern was invalid: I'm guessing i didn't notice before that {{model}} is a {{ConcurrentHashMap}}
so {{HashMap}}'s copy constructor will get a weakly consistent iterator and never throw CME.
**** I added a comment making this obvious to the next person to make the same mistake i did.

{panel}


{panel:title=TestInPlaceUpdatesDistrib}

* starting with a random index to help test edge cases
** a key piece of feedback i gave about this test before was:{quote}
* even if we're only going to test -one- _a few_ doc, we should ensure there are a random
num docs in the index (some before the doc we're editing, and some after) {color:green}FIXED{color}
* _2 docs before/after is not a random number ... random means random: we need to test edge
cases of first docid in index, last docid in index, first/last docid in segment, etc..._
{quote}
** I mentioned this about multiple test methods, including outOfOrderUpdatesIndividualReplicaTest
** this is how outOfOrderUpdatesIndividualReplicaTest started in the last patch...{code}
  private void outOfOrderUpdatesIndividualReplicaTest() throws Exception {
    del("*:*");
    commit();

    index("id", 0, "title_s", "title0", "id_i", 0);
    commit();

    float ratings = 1;

    // update doc, set
    index("id", 0, "ratings", map("set", ratings));
...
{code}
*** the index contained exactly one document: id=0
*** that document started w/o a value in the field we want to update in place
*** then we do an inplace update the document to set a value on it, and then "..." more testing
of id=0
** this is what the start of that method looks like in the current patch...{code}
  private void outOfOrderUpdatesIndividualReplicaTest() throws Exception {
    del("*:*");
    commit();

    index("id", 0, "title_s", "title0", "id_i", 0);
    commit();

    float inplace_updatable_float = 1;

    // Adding random number of docs before adding the doc to be tested
    int numDocsBefore = random().nextInt(1000);
    for (int i=0; i<numDocsBefore; i++) {
      index("id", 1000 + i, "title_s", "title" + (1000+i), "id_i", 1000 + i);
    }

    // update doc, set
    index("id", 0, "inplace_updatable_float", map("set", inplace_updatable_float));

    // Adding random number of docs after adding the doc to be tested
    int numDocsAfter = random().nextInt(1000);
    for (int i=0; i<numDocsAfter; i++) {
      index("id", 5000 + i, "title_s", "title" + (5000+i), "id_i", 5000 + i);
    }
...
{code}
*** the test still starts out by adding doc id=0 first before any other docs
*** that means that we're still only realy testing a in-place update of the first document
in the index
*** we're only getting marginal benefit out of the "randomzied" index -- namely that the size
of the index is random. but the doc getting updated in-place is always the *first* doc in
the index.  we're never testing an in-place update of a doc in the second segment, or the
last doc in a segment, or the last doc in an index, etc...
** as written, this basically defeats the entire point of why we should use a random index
** that said: at least it gives us *some* randomization
** most other test methods in this class still don't even have that
** ensureRtgWorksWithPartialUpdatesTest does seem to have good initial initial index setup
code:
*** adding a random number of docs
*** then add a "special" doc to be tested (w/o the field to be updated inplace)
*** then add some more random documents
** why not refactor that index creation code from ensureRtgWorksWithPartialUpdatesTest into
a helper method and use it in all of the other test methods?
*** these tests all still uses an index containing exactly 1 doc and should also have a random
index:
**** outOfOrderDBQsTest 
**** outOfOrderDeleteUpdatesIndividualReplicaTest
**** reorderedDBQsWithInPlaceUpdatesShouldNotThrowReplicaInLIRTest
**** delayedReorderingFetchesMissingUpdateFromLeaderTest
*** docValuesUpdateTest could use the same index building helper method even if doesn't follow
the same pattern of only testing a single "special" doc

* ExecutorService
** I feel like i've mentioned this every time i've given feedback: {quote}
* if we are going to use an ExecutorService, then the result of awaitTermination has to be
checked {color:green}FIXED{color}
* ... and shutdown & awaitTermination have to be called in that order
{quote}
** the latest patch had 2 places where threadpools were shutdown & awaitTermination correctly
-- but there were still 4 more that were broken.
** I went ahead and fixed them.


* docValuesUpdateTest
** knowing when all the updates have propogated...
*** I had previously (twice) made the following suggestion: {quote}
* In general, this seems like a convoluted way to try and kill two birds with one stone: 1)
make sure all replicas have opened searchers with the new docs; 2) give us something to compare
to later to ensure the update was truely in place
** i really think the first problem should be addressed the way i suggested previously:{color:green}Done{color}
bq. * if initially no docs have a rating value, then make the (first) test query be for {{rating:\[\*
TO \*\]}} and execute it in a rety loop until the numFound matches numDocs.{color:green}Done{color}
bq. * likewise if we ensure all ratings have a value such that abs(ratings) < X, then the
second update can use an increment such that abs(inc) > X\*3 and we can use {{-ratings:\[-X
TO X\]}} as the query in a retry loop {color:green}Done{color}
** the second problem should be solved by either using the segments API, or by checking the
docids on _every_ replica (w/o any retries) ... _after_ independently verifying the searcher
has been re-opened.
{quote}
** in the current patch you updated the {{// Keep querying until the commit is distributed
and expected number of results are obtained}} code to use a range query like i suggested,
but: that code is useless because it assumes that those docs won't match the query until the
updates propogate, which isn't true the way the test is currently written...
*** the index starts off with every document having a value in the field being queried against,
so even if none of the in-place updates happen, the query will still succeed.
*** either all of the docs need to have no-values in the field before the in-place updates
are attempted, or the test query needs to be tightened to only match the docs if they contain
the updated values
**** example: currently all docs start with a value of {{101.0}} and the in-place updates
set every doc to a value <= {{5.0}}, so we can make the verification of the first update
be {{"inplace_updatable_float:\[\* TO 5.0\]}}
** The verification of the second update looks  seems mostly ok, but it seems a little more
complicated then it needs to be and i think there is at least one subtle bug...
*** {{Math.max(Collections.max(valuesList), Math.abs(Collections.min(valuesList)));}} ?
**** {{Collections.max(valuesList)}} might also be negative, we need to take the {{abs}} of
both before values before deciding which is bigger.
*** in general it seems overly complicated to worry about determining X dynamically...
**** the code that does the original update ({{valuesList.add(r.nextFloat()*5.0f);}}) ensures
that the values are all between 0 and 5.0 -- let's assume we tweak that to randomly use -5.0
sometimes so our range is -5.0 to 5.0
**** the code that does the second update can just use {{inc = atLeast(10) * (r.nextBoolean()
? 1 : -1);}}
**** then the code verifying the second update propogates can just be {{"+inplace_updatable_float:\[\*
TO \*\] -inplace_updatable_float:\[-15 TO 15\]"}}
** in both cases however, only executing the query against {{clientToTest}} is a bad idea.
 We need to be running the verification against *ALL* the replicas, otherwise we can get false
positives from the subsequent validation (ie: the thread scheduling might result in {{clientToTest}}
being completely finished with all the updates and the commits, while some other replica is
still working on it and hasn't committed yet when we do our validation queries.
*** there's also the problem that the {{clients}} list includes the LEADER, so 1/3 of the
time these queries aren't even hitting a replica and are useless.
** the second part of my suggestion doesn't seem to have been implemented, nor did it get
any comment:{quote}
* the second problem should be solved by either using the segments API, or by checking the
docids on _every_ replica (w/o any retries) ... _after_ independently verifying the searcher
has been re-opened.
{quote}
*** ditto the problem of picking a random {{clients}} and 1/3 of the time geting hte leader
making the {{matchResults()}} call useless.
*** even if we switch the randomization to use {{NONLEADERS}} there's no good reason to be
comparing with only one replica -- looping over {{NONLEADERS}} is at worst "not harder" then
picking a random replica, and makes the test twice as likely to catch any potential bugs that
might cause a replica to get out of sync

* getInternalDocIds
** this looks like it exists because of this prior feedback for ensureRtgWorksWithPartialUpdatesTest...{quote}
checking {{fl=\[docid\]}} in cloud RTG is currently broken (SOLR-9289) but we can/should be
checking via the segments API anyway (if we have a general helper method for comparing the
segments API responses of multiple replicas betwen multiple calls, it could be re-used in
every test in this class){color:green}Done by opening a new RT searcher and getting the docid
for it{color}
{quote}
** but SOLR-9289 has been fixed since July ... there's no reason to jump through hoops like
this instead of just doing an RTG request with {{fl=\[docid\]}} ... it just makes the test
more brittle to refactoring (into a true cloud test) later.

* ensureRtgWorksWithPartialUpdatesTest
** as mentioned breviously: comparing results from {{LEADER}} with results from a random {{clients}}
is useless 1/3 of the time since that list contains the {{LEADER}}
*** the test will be a lot stronger and no longer if instead it loops over the clients and
does a getById & asserting the results inside the loop.
** see comments above regarding getInternalDocIds - rather then fixing that method it might
be easier to just do direct {{getById}} calls here
*** already doing {{LEADER.getById("100");}} -- just add {{fl=\[docid\]}} to those and add
some new similar {{getById}} calls to the replicas to compare with the existing {{getById}}
validation calls to them as well 
*** although: shouldn't *ALL* the {{getById}} calls in this method be replaced with {{getReplicaValue}}
or {{assertReplicaValue}} calls ... isn't that what those methods are for? don't they ensure
that RTG calls to replicas aren't internally forwarded to the LEADER?


* addDocAndGetVersion...{quote}
{{synchronized (cloudClient)}}
* why are we synchronized on cloud client but updating via LEADER?
* why are we synchronized at all?
{color:green}Fixed{color}
{quote}
** Nothing was fixed about this -- it's still synchronizing on cloudClient .. does it need
sync at all?????

{panel}


I plan to keep working on this issue for at least a few more days -- starting with some of
the ranodmized test improvements I mentioned (above) being really concerned with.



> Support updates of numeric DocValues
> ------------------------------------
>
>                 Key: SOLR-5944
>                 URL: https://issues.apache.org/jira/browse/SOLR-5944
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Shalin Shekhar Mangar
>         Attachments: DUP.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch, SOLR-5944.patch,
SOLR-5944.patch, SOLR-5944.patch, TestStressInPlaceUpdates.eb044ac71.beast-167-failure.stdout.txt,
TestStressInPlaceUpdates.eb044ac71.beast-587-failure.stdout.txt, TestStressInPlaceUpdates.eb044ac71.failures.tar.gz,
defensive-checks.log.gz, hoss.62D328FA1DEA57FD.fail.txt, hoss.62D328FA1DEA57FD.fail2.txt,
hoss.62D328FA1DEA57FD.fail3.txt, hoss.D768DD9443A98DC.fail.txt, hoss.D768DD9443A98DC.pass.txt
>
>
> LUCENE-5189 introduced support for updates to numeric docvalues. It would be really nice
to have Solr support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message