lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Belskis <>
Subject RE: best practice for indexing multiple equiv fieldnames
Date Mon, 06 May 2002 08:47:42 GMT
Dude, Landon-

How are you doing?  To the novice question I have what might be a novice 
answer... but hope it helps.

I don't think that the "Lucene documents" you create and add to the index 
need to have the same structure as the "XML documents" you read.  Instead 
of creating one Lucene document for each XML document, perhaps things will 
be easier for you if you create multiple Lucene documents for each XML file 
you parse (one Lucene document for each block).


Alexander Belskis

SchlumbergerSema - International Telematics Applications
Biotechnologies & Healthcare
c/Albasanz, 12 - 28037 Madrid (Spain)
Tel. (+34) 91 440 8800 (Ext. 7629)

-----Mensaje original-----
De:	Landon Cox []
Enviado el:	miercoles, 01 de mayo de 2002 1:52
Para:	Lucene Users List
Asunto:	best practice for indexing multiple equiv fieldnames

>I'm planning to use Lucene to index scads of XML files whose data model
>includes replicated blocks of tags.  Translation: a novice question 
>My files have a common XML pattern (for illustrative purposes):
>   <block id="123">some text 1</block>
>   <block id="456">some text 2</block>
>   <block id="789">some text 3</block>
>Each block has a unique id, but the tagname is identical.  The actual data
>model has nested tags within these blocks - ie: metadata with the same
>tagnames within each block.  So, in the real data model, there are 
>identical tagnames that are associated with a specific parent.  Something
>more like this:
>   <block id="123">
>      <author>Joe Blow</author>
>      <job>hack</job>
>   </block>
>   <block id="456">
>      <author>Jane Doe</author>
>      <job>President</job>
>   </block>
>In latter case, I need to be able to search by author or job, for example,
>and get the tag's text contents as well as the parent block id.
>Adding a field name of "block" or "author" or "job" multiple times to the
>same Lucene Document, according to the Lucene javadoc, has the effect of
>appending the text for search purposes.  I take that to mean, in order to
>use a 'hit' I would need to somehow uniquely identify the field from which
>the content came even though the content was appended for search purposes.
>If I searched an 'author' field name and got a hit, I would not be able to
>disambiguate which block id the actual hit belonged to.  Or if I searched 
>"job", how would I know a hit belonged to block id 456 instead of block id
>123 parent?
>What is the Lucene approach for indexing a single document that has the 
>field name appearing in multiple places and then using the hit to find the
>exact association of block id in the above example?
>Hope this question makes sense.  I'm sure I'm missing something
>obvious/simple in how the API would work in this case.  Thanks,
>Landon Cox
This email is confidential and intended solely for the use of the individual to whom it is
addressed. Any views or opinions presented are solely those of the author and do not necessarily
represent those of SchlumbergerSema. 
If you are not the intended recipient, be advised that you have received this email in error
and that any use, dissemination, forwarding, printing, or copying of this email is strictly

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message