Doug Cutting wrote:
> Andrzej Bialecki wrote:
>
>> Doug Cutting wrote:
>>
>>> The graph just shows that they differ, not how much better or worse
>>> they are, since the baseline is not perfect. When the top-10 is 50%
>>> different, are those 5 different hits markedly worse matches to your
>>> eye than the five they've displaced, or are they comparable? That's
>>> what really matters.
>>
>>
>> Hmm. I'm not sure I agree with this. Your reasoning would be true if
>> we were changing the ranking formula. But the goal IMHO with these
>> patches is to return equally complete results, using the same ranking
>> formula.
>
>
> But we should not assume that the ranking formula is perfect. Imagine
> a case where the high-order bits of the score are correct and the
> low-order bits are random. Then an optimization which changes local
> orderings does not actually affect result quality.
Yes, that's true, I could accept that. In these tests the score delta
was something like 20 (between the hit #1 and hit #100), and the score
was falling down rapidly after the first 10 or 20 results. Now, the
problem is that many results _within_ this range (i.e. still within the
area with large score deltas) were missing. And this suggests that the
differences were also on the high-order bits.
Please re-run the script on your index, using typical queries, and check
the results. It's possible that I made a mistake somewhere, it would be
good to confirm at least the trends in the raw results.
>
>> I specifically avoided using normalized scores, instead using the
>> absolute scores in TopDocs. And the absolute scores in both cases are
>> exactly the same, for those results that are present.
>>
>> What is wrong is that some results that should be there (judging by
>> the ranking) are simply missing. So, it's about the recall, and the
>> baseline index gives the best estimate.
>
>
> Yes, this optimization, by definition, hurts recall. The only
> question is does it substantially hurt relevance at, e.g., 10 hits.
> If the top-10 are identical then the answer is easy: no, it does not.
> But if they differ, we can only answer this by looking at results.
> Chances are they're worse, but how much? Radically? Slightly?
> Noticiably?
The paper by Suel et al. which you referred to claims the top-100 as
high as 98% after optimizations. What I observed were values between
0-60%, but going above this level caused a heavy performance loss.
>
>>> What part of Nutch are you trying to avoid? Perhaps you could try
>>> measuring your Lucene-only benchmark against a Nutch-based one. If
>>> they don't differ markedly then you can simply use Nutch, which
>>> makes it a stronger benchmark. If they differ, then we should
>>> figure out why.
>>
>>
>> Again, I don't see it this way. Nutch results will always be worse
>> than pure Lucene, because of the added layers. If I can't improve the
>> performance in Lucene code (which takes > 85% time for every query)
>> then no matter how well optimized Nutch code is it won't get far.
>
>
> But we're mostly modifying Nutch's use of Lucene, not modifying
> Lucene. So measuring Lucene alone won't tell you everything, and
> you'll keep having to port Nutch stuff. If you want to, e.g., replay
> a large query log to measure average performance, then you'll need
> things like auto-filterization, n-grams, query plugins, etc., no?
Perhaps we misunderstood each other - I'm using an index built by Nutch,
there's no substitute for that, I agree. It was just more convenient for
me to skip all Nutch classes for _querying_ alone, because it was easier
to control the exact final form of Lucene query - especially if you want
to experiment quickly with a lot of variables that are not (yet)
parametrized through the config files. In the end, you end up with a
plain Lucene query, only then you don't know exactly how much time was
spent on translating, building filters, etc. *shrug* You can do it
either way, I agree.
>
>>>> In several installations I use smaller values of slop (around
>>>> 20-40). But this is motivated by better quality matches, not by
>>>> performance, so I didn't test for this...
>>>
>>>
>>> But that's a great reason to test for it! If lower slop can improve
>>> result quality, then we should certainly see if it also makes
>>> optimizations easier.
>>
>>
>> I forgot to mention this - the tests I ran already used the smaller
>> values: the slop was set to 20.
>
>
> Are they different if the slop is Integer.MAX_VALUE? It would be
> really good to determine what causes results to diverge, whether it is
> multiple terms (probably not) phrases (probably) and/or slop
> (perhaps). Chances are that the divergence is bad, that results are
> adversely affected, and that we need to try to fix it. But to do so
> we'll need to understand it.
Agreed. I'll try to re-run the tests with queries that set a different
slop value, or omit the phrases completely (and it's quite easy to do
this with my approach, just use a different translated query on the
cmd-line ;-) ).
>
>> That's another advantage of using Lucene directly in this script -
>> you can provide any query structure on the command-line without
>> changing the code in Nutch.
>
>
> But that just means that we should set the SLOP constant in
> BasicQueryFilter.java from a configuration property, and permit the
> setting of configuration properties from the command line, no?
Well, if you want to quickly experiment with radically different query
translation, then no.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
|