But then again i think i misunderstood the problem the second time...
On Mon, Apr 25, 2011 at 2:59 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> oh never mind.
>
> that's trivial. As Ted mentioned, i perhaps by mistake assumed the
> problem is to find most frequently used queries.
>
> if he just needs top N with the highest similarity score...well...
> that's kind of a problem i am solving for LSI over hbase right now...
> I don't want to disclose exactly how, or Ted will say that's not the
> way :) But, there are definitely ways to organize the vector space
> model to find N closest without scanning the entire vector set.
>
> d
>
>
>
> On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
>> On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>>> if you "forget" some of the items you saw along with the counters,
>>> how'd you add them back to the heap?
>>>
>>
>> What do you mean? This is the common search problem: you're iterating over
>> a list of int, double pairs (int is "docId", double is the "score"), and you
>> want to
>> keep the topK out of all N of them. Keep a sizeK heap of docId/score
>> pairs,
>> and as you iterate, check to see if your current one beats your current K'th
>> best
>> result  if not, you know it will *never* be in the overall topK, and it can
>> be discarded
>> (O(1) operation, this happens most of the time if K << n, and the list is
>> randomly ordered). If it beats the bottom of the heap, insert it (O(log(K)
>> operation),
>> dropping the old bottom of the heap.
>>
>> This is O(n + k*log(k)) for a randomly sorted list.
>>
>> jake
>>
>
