lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: TestUTF32ToUTF8.testRandomRegexes fails
Date Tue, 27 Jul 2010 08:53:26 GMT
As reported on the issue, the patch solves the problem.

However, I was wondering whether that doesn't expose a bug in
CharacterRunAutomaton -- it handles characters that the JVM ignores when
dealing w/ the string (at least when converting them to bytes). Is that ok?
Shouldn't we check somewhere that that character should be handled at all?

Shai

On Tue, Jul 27, 2010 at 12:41 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Shai can you try the patch on LUCENE-2568?  Thanks.
>
> Mike
>
> On Mon, Jul 26, 2010 at 4:25 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
> > OK I think likely this is a bug in RAS.  And we are just seeing the
> > difference in how Oracle's & IBM's JREs handle an unpaired
> > surrogate...
> >
> > Lemme work out a patch...
> >
> > Mike
> >
> > On Mon, Jul 26, 2010 at 4:13 PM, Michael McCandless
> > <lucene@mikemccandless.com> wrote:
> >> Yeah that char is a high surrogate which is unpaired, which is no good
> >> -- it's invalid.  Cool, though, that Google puts us first when you
> >> search on this character :)
> >>
> >> Can you figure out how that bad string was created?  That "if
> >> (random.nextBoolean())" either creates the string randomly (which
> >> should never return unpaired surrogate), or, calls
> >> RandomAcceptedString.getRandomAcceptedString... maybe the bug is in
> >> RAS.
> >>
> >> Mike
> >>
> >> On Mon, Jul 26, 2010 at 3:41 PM, Shai Erera <serera@gmail.com> wrote:
> >>> From here: http://www.fileformat.info/info/unicode/char/d9ff/index.htm
> >>>
> >>> Looks like that character is not a valid Unicode character, and perhaps
> the
> >>> IBM's JVM behaves correctly? Robert - you're the Unicode expert :).
> >>>
> >>> Shai
> >>>
> >>> On Mon, Jul 26, 2010 at 10:40 PM, Shai Erera <serera@gmail.com> wrote:
> >>>>
> >>>> I don't know what was the thing w/ the strings generated before, but
> now I
> >>>> ran the test again w/ the same seed and it generates the same strings.
> So at
> >>>> least it seems there are no problems w/ the Random class :).
> >>>>
> >>>> However, the string l.E fails w/ the IBM JVM and succeeds w/ SUN's.
> Any
> >>>> ideas why? What does the test check anyway?
> >>>>
> >>>> I ran TRR2, and set the regexp to always be "l.E" and the test passes.
> The
> >>>> failure comes from
> >>>>
> >>>> junit.framework.AssertionFailedError: expected:<true> but was:<false>
> >>>>     at
> >>>>
> org.apache.lucene.util.automaton.TestUTF32ToUTF8.assertAutomaton(TestUTF32ToUTF8.java:199)
> >>>>     at
> >>>>
> org.apache.lucene.util.automaton.TestUTF32ToUTF8.testRandomRegexes(TestUTF32ToUTF8.java:171)
> >>>>
> >>>> I've set regexp to "l.E", and also 'string' inside assertAutomaton to
> >>>> "\u006C\uD9FF\u0045". The byte[] returned from
> string.getBytes("UTF-8") are
> >>>> [108, 69]. It just ignores the middle character. Perhaps that's why
> the test
> >>>> fails?
> >>>>
> >>>> When I run this w/ SUN's JVM, the bytes returned are [108, 63, 69].
> >>>>
> >>>> If I manually set the bytes, using IBM's, to [108, 63, 69], then the
> test
> >>>> passes.
> >>>>
> >>>> Interestingly, Googling for \uD9FF brings back LUCENE-2019 as the
> first
> >>>> result :). I'll dig some more into this character, and why the IBM and
> SUN
> >>>> JVMs return different byte[] representation for the same sequence of
> >>>> characters. If you already spot the problem, please let me know.
> >>>>
> >>>> BTW, the test calls _TestUtil.getRandomMultiplier on every iteration
> loop,
> >>>> which goes and checks a system property. Perhaps we can extract it to
> a
> >>>> variable, or include a static constant in LuceneTestCase(J4) or
> something?
> >>>>
> >>>> Shai
> >>>>
> >>>> On Mon, Jul 26, 2010 at 9:22 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> >>>>>
> >>>>> maybe there is a bug in ibm's random generator :)
> >>>>>
> >>>>> On Mon, Jul 26, 2010 at 11:50 AM, Michael McCandless
> >>>>> <lucene@mikemccandless.com> wrote:
> >>>>>>
> >>>>>> That's VERY spooky that w/ a fixed seed you see different random
> >>>>>> regexps being made.
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>> On Mon, Jul 26, 2010 at 11:40 AM, Shai Erera <serera@gmail.com>
> wrote:
> >>>>>> > Ok I've dug deeper into the test. I set the random seed
to
> >>>>>> > -9029631602016965389L in setUp(), and discovered that on
the 4th
> >>>>>> > iteration
> >>>>>> > it breaks. For some reason though, AutomatonTestUtil.randomRegex
> >>>>>> > generates
> >>>>>> > different strings every time I run the test, even though
it uses
> the
> >>>>>> > same
> >>>>>> > Random object w/ the same seed ...
> >>>>>> >
> >>>>>> > Anyway, one of the regex that failed was this "l.E" (w/o
the
> quotes)
> >>>>>> > and I
> >>>>>> > think it's a lowercase L, '.' (dot) and 'E' (uppercase).
Hope this
> >>>>>> > helps.
> >>>>>> >
> >>>>>> > Shai
> >>>>>> >
> >>>>>> > On Mon, Jul 26, 2010 at 6:23 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> >>>>>> >>
> >>>>>> >> sounds nasty... its good you are running the tests
with this
> >>>>>> >> different
> >>>>>> >> jvm...
> >>>>>> >>
> >>>>>> >> On Mon, Jul 26, 2010 at 11:21 AM, Shai Erera <serera@gmail.com>
> >>>>>> >> wrote:
> >>>>>> >>>
> >>>>>> >>> Tried to run it w/ SUN JRE6 and it succeeds ! I've
tried several
> >>>>>> >>> times
> >>>>>> >>> and it succeeds every time. However, when I revert
back to
> IBM's, it
> >>>>>> >>> fail
> >>>>>> >>> immediately.
> >>>>>> >>>
> >>>>>> >>> I can help w/ the debug, if you give me a hint
where to look :).
> >>>>>> >>>
> >>>>>> >>> Shai
> >>>>>> >>>
> >>>>>> >>> On Mon, Jul 26, 2010 at 5:57 PM, Shai Erera <serera@gmail.com>
> >>>>>> >>> wrote:
> >>>>>> >>>>
> >>>>>> >>>> Sorry for the delayed response.
> >>>>>> >>>>
> >>>>>> >>>> I ran it a couple more times, from Eclipse
and Ant, and each
> time
> >>>>>> >>>> it
> >>>>>> >>>> fails (amazing !), w/ different seeds. More
seeds that fail:
> >>>>>> >>>> NOTE: random seed of testcase 'testRandomRegexes'
was:
> >>>>>> >>>> -4244174191361080127
> >>>>>> >>>> NOTE: random seed of testcase 'testRandomRegexes'
was:
> >>>>>> >>>> -7059086272401721644
> >>>>>> >>>> NOTE: random seed of testcase 'testRandomRegexes'
was:
> >>>>>> >>>> -1314734215611104147
> >>>>>> >>>>
> >>>>>> >>>> I use IBM JVM, tried w/ both 1.5 and 1.6 ...
> >>>>>> >>>>
> >>>>>> >>>> Mike, can we use LUCENE-2565 to track this,
or would you prefer
> >>>>>> >>>> that I
> >>>>>> >>>> open a separate one?
> >>>>>> >>>>
> >>>>>> >>>> Shai
> >>>>>> >>>>
> >>>>>> >>>> On Mon, Jul 26, 2010 at 3:26 PM, Michael McCandless
> >>>>>> >>>> <lucene@mikemccandless.com> wrote:
> >>>>>> >>>>>
> >>>>>> >>>>> On a more general note...
> >>>>>> >>>>>
> >>>>>> >>>>> Any time any of you out there hit an "odd"
test failure,
> please
> >>>>>> >>>>> please
> >>>>>> >>>>> please do just what Shai did: take it to
the dev list!
> >>>>>> >>>>>
> >>>>>> >>>>> Think of Lucene's unit tests like SETI
:)  We are desperately
> >>>>>> >>>>> seeking
> >>>>>> >>>>> bugs, and you and your machine may just
be lucky enough to
> find
> >>>>>> >>>>> one...
> >>>>>> >>>>> go forth and buy expensive new power hungry
computers just so
> you
> >>>>>> >>>>> can
> >>>>>> >>>>> run the random tests over and over, seeking
the bugs!
> >>>>>> >>>>>
> >>>>>> >>>>> But be sure to include that random seed
when you do hit a
> >>>>>> >>>>> failure...
> >>>>>> >>>>>
> >>>>>> >>>>> Mike
> >>>>>> >>>>>
> >>>>>> >>>>> On Mon, Jul 26, 2010 at 8:23 AM, Robert
Muir <
> rcmuir@gmail.com>
> >>>>>> >>>>> wrote:
> >>>>>> >>>>> > I agree, Shai can you open a bug?
I cannot reproduce, did
> you
> >>>>>> >>>>> > use an
> >>>>>> >>>>> > IBM JVM
> >>>>>> >>>>> > or another environment that might
help us figure it out?
> >>>>>> >>>>> >
> >>>>>> >>>>> > On Mon, Jul 26, 2010 at 6:29 AM, Michael
McCandless
> >>>>>> >>>>> > <lucene@mikemccandless.com>
wrote:
> >>>>>> >>>>> >>
> >>>>>> >>>>> >> Hmmm this means a bug is lurking.
 This is the power of
> random
> >>>>>> >>>>> >> testing
> >>>>>> >>>>> >> (that every time we all run tests,
we're testing different
> >>>>>> >>>>> >> "paths"
> >>>>>> >>>>> >> through the code)....
> >>>>>> >>>>> >>
> >>>>>> >>>>> >> It seems exceptionally unlikely
that LUCENE-2537's changes
> >>>>>> >>>>> >> would
> >>>>>> >>>>> >> cause
> >>>>>> >>>>> >> this!
> >>>>>> >>>>> >>
> >>>>>> >>>>> >> But, unfortunately, when I plug
that seed in I don't see it
> >>>>>> >>>>> >> fail,
> >>>>>> >>>>> >> which is odd.  I'll run a stress
test to see if I can
> tickle
> >>>>>> >>>>> >> the
> >>>>>> >>>>> >> bug... can you open a Jira issue
so we don't lose track?
> >>>>>> >>>>> >>
> >>>>>> >>>>> >> Mike
> >>>>>> >>>>> >>
> >>>>>> >>>>> >> On Mon, Jul 26, 2010 at 2:57 AM,
Shai Erera <
> serera@gmail.com>
> >>>>>> >>>>> >> wrote:
> >>>>>> >>>>> >> > Hi
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> > I was running tests on trunk
(after merging the changes
> from
> >>>>>> >>>>> >> > LUCENE-2537)
> >>>>>> >>>>> >> > and received this error message:
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> > expected:<true> but
was:<false>
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> > junit.framework.AssertionFailedError:
expected: but was:
> >>>>>> >>>>> >> > at
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> >
> org.apache.lucene.util.automaton.TestUTF32ToUTF8.assertAutomaton(TestUTF32ToUTF8.java:197)
> >>>>>> >>>>> >> > at
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> >
> org.apache.lucene.util.automaton.TestUTF32ToUTF8.testRandomRegexes(TestUTF32ToUTF8.java:170)
> >>>>>> >>>>> >> > at
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> >
> org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:285)
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> > NOTE: random seed of testcase
'testRandomRegexes' was:
> >>>>>> >>>>> >> > 3510820306304573866
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> > I'm sure it's related to
my changes. Has anyone else seen
> >>>>>> >>>>> >> > this
> >>>>>> >>>>> >> > before?
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >> > Shai
> >>>>>> >>>>> >> >
> >>>>>> >>>>> >>
> >>>>>> >>>>> >>
> >>>>>> >>>>> >>
> >>>>>> >>>>> >>
> ---------------------------------------------------------------------
> >>>>>> >>>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> >>>>> >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> >>>>>> >>>>> >>
> >>>>>> >>>>> >
> >>>>>> >>>>> >
> >>>>>> >>>>> >
> >>>>>> >>>>> > --
> >>>>>> >>>>> > Robert Muir
> >>>>>> >>>>> > rcmuir@gmail.com
> >>>>>> >>>>> >
> >>>>>> >>>>>
> >>>>>> >>>>>
> >>>>>> >>>>>
> ---------------------------------------------------------------------
> >>>>>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>> >>>>>
> >>>>>> >>>>
> >>>>>> >>>
> >>>>>> >>
> >>>>>> >>
> >>>>>> >>
> >>>>>> >> --
> >>>>>> >> Robert Muir
> >>>>>> >> rcmuir@gmail.com
> >>>>>> >
> >>>>>> >
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Robert Muir
> >>>>> rcmuir@gmail.com
> >>>>
> >>>
> >>>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message