lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: get wordno, lineno, pageno for term/phrase
Date Wed, 04 Aug 2010 13:25:48 GMT
It depends (TM). Yes, it would bloat the index. But nothing in the original
post indicates
that this is a concern. The index could be 10M or 100G, in one case it
matters a lot
and in the other it doesn't. It's also unclear whether query response time
at all or whether this is some sort of batch process that can run overnight
(or whatever).

One could also store a very special field per document that contained all
the meta-data
one could care about. For instance, the offset of each line, page,
paragraph, etc. That,
combined with the offset data for the word, which is available via the span
could be what's needed.

re-scanning the input stream has it's own costs as well, but perhaps they
are the
preferable ones, it all depends on the use-case.

It seems like it's always a space/speed tradeoff......


On Wed, Aug 4, 2010 at 4:41 AM, Itamar Syn-Hershko <>wrote:

> Storing all that info per-token as payloads will bloat the index. Wouldn't
> it be wiser to use a special token to mark page feed and end of paragraph
> (numbers of which could be then stored as payloads), and scan the token
> stream per document to retrieve them back? some extra operations for
> retrieval, but much smaller index...
> Itamar.
> On 3/8/2010 11:54 PM, Erick Erickson wrote:
>> No, you can't do this with any existing analyzers I know of. Part
>> of the problem here is that there's no good generic way to KNOW
>> what a page and line are.
>> Have you investigated payloads? I'm not sure that's a good fit for
>> this particular problem, but it might be worth investigating.
>> Best
>> Erick
>> On Tue, Aug 3, 2010 at 10:58 AM, arun r<>  wrote:
>>> hi all,
>>>            I am new to Lucene. I am trying to use Lucene to generate
>>> data for a document classifier. I need to generate wordno, lineno,
>>> pageno for each term/phrase. I was able to use SpanQuery/SpanNearQuery
>>> to get the wordno (span.start()) for the term/phrase. To get pageno
>>> and lineno, a custom Analyzer needs to be written ? Can the Analyzer
>>> be made to recognize and newline and page feed characters and keep
>>> track of lineno and pageno for the tokens ?
>>> Is it possible with existing Lucene Analyzer ?
>>> Thanks,
>>> Arun
>>> --
>>> Where there is a will, there is a way !
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message