lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Scheffler <thomas.scheff...@uni-jena.de>
Subject Re: merged search of document
Date Wed, 07 Jan 2004 20:16:16 GMT
Am Mit, den 07.01.2004 schrieb Dror Matalon um 20:10:
> On Wed, Jan 07, 2004 at 07:58:52PM +0100, Thomas Scheffler wrote:
> > Am Mit, den 07.01.2004 schrieb Dror Matalon um 19:00:
> > > The solution is simple, but you need to think of it conceptually in a
> > > different way. Instead of "all documents with the same DocID are the same
> > > document" think "fetch all the document where DocId is XYZ."
> > > 
> > > Assuming the contents are in a field called contents
> > > you do 
> > > +(DocID:XYZ) (contents:foo) (contents:bar)
> > 
> > I allready was on that way but think of a search like (foo -bar). With
> > your solution it will result in a hit because on page 345 (to keep my
> > example) is the word "foo" and no "bar". Of cause I want with my model,
> > that the book don't get a hit for that query. You see how hard it is to
> > handle, isn't it? 
> 
> I think, I'm starting to understand. So you want to treat several
> documents as one, and if the hit fails for one of the documents, it
> should fail for all the documents with the same id. OK. This begs the
> question. Why don't you make all these document with the same id one
> document, and index them together?

This would be a functional but not nice solution. The "pages" are send
to my java class. This point I cannot change cause it api related
restriction. To index 1000 pages I have to index the first one, when I
get the second one I need to reget the first page, bind both together an
send it to the indexwriter. I must keep track of every single page the
"book" contains. This procedure is made for every page and get uglier
while page size is increasing. Furthermore my "book" allows single pages
to be deleted or updated. Every time such a atomic task
(adding/deleting) is performed the index for the whole "book" must be
restored. The mechanism to transfer a "page" to a lucene document is
very time consuming, so I wan't to do that stuff as less as possible. It
would be great as you see, if somehow lucene is possible to thread a
"logical document" (consisting of several lucene documents) like normal
lucene documents.

> 
> > 
> > > 
> > > For that matter, you can use a standard analyzer on the query and use a
> > > boolean to tie it to the specific document set.
> > > 
> > > This is how we do searching on a specific channel at fastbuzz.com.
> > > 
> > > Dror
> > > 
> > > 
> > > On Wed, Jan 07, 2004 at 05:21:43PM +0100, Thomas Scheffler wrote:
> > > > 
> > > > Jamie Stallwood sagte:
> > > > > +(DocID:XYZ DocID:ABC) +(foo bar)
> > > > >
> > > > > will find a document that (MUST have (xyz OR abc)) AND (MUST have
(foo OR
> > > > > bar)).
> > > > 
> > > > This is just the solution for the example in real world I really don't
> > > > have noc documents containing "foo" or "bar". What I meant was: Make
> > > > Lucene think, that all Documents with the same DocID are ONE Document.
> > > > Imagine you have a big book, say 1000 pages. Instead of putting the whole
> > > > book in the index, you split it up in single pages and index them. Now
> > > > it's faster if a page changes or is deleted to update your index instead
> > > > of doing it over and over again for all 1000 pages. So you problem starts
> > > > when you're searching on the book. You search for (foo bar), foo is on
> > > > site 345 while bar ist on 435. You want to get a hit for the book. So
I
> > > > need a solution matching this more generic example.
> > > > 
> > > > >
> > > > > -----Original Message-----
> > > > > From: Thomas Scheffler [mailto:thomas.scheffler@uni-jena.de]
> > > > > Sent: 07 January 2004 11:23
> > > > > To: lucene-user@jakarta.apache.org
> > > > > Subject: merged search of document
> > > > >
> > > > > Hi,
> > > > >
> > > > > I need a tip for implementation. I have several documents all of
them with
> > > > > a field named DocID. DocID identifies not a single Lucene Document
but a
> > > > > collection of them. When I wan't to start a seach it should handle
the
> > > > > search in that way, as these lucene documents where one.
> > > > >
> > > > > example:
> > > > >
> > > > > Document 1: DocID:XYZ
> > > > >
> > > > > containing: foo
> > > > >
> > > > > Document 2: DocID:XYZ
> > > > >
> > > > > containing: bar
> > > > >
> > > > > Document 3: DocID:ABC
> > > > >
> > > > > containing: foo bar
> > > > >
> > > > > Document 4: GHJ
> > > > >
> > > > > containing: foo
> > > > >
> > > > > As you already guesses, when I'm searching for "+foo +bar" I wan't
the
> > > > > hits to contain Document 1, Document 2 and Document 3, not Document
4. Is
> > > > > that clear what I want? How do I implement such a monster? Is that
> > > > > possible with lucene? The content is not stored within lucene it's
just
> > > > > tokenized and indexed.
> > > > >
> > > > > Any help?
> > > > >
> > > > > Thanks in advance!
> > > > >
> > > > > Thomas Scheffler
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > > >
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > > >
> > > > >
> > > > 
> > > > 
> > > > -- 
> > > > 
> > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > > 
> > --
> > Fachbegriffe der Informatik - Einfach erklärt
> > =============================================
> > N° 37 -- Fehlertolerant :
> > 
> > Das Programm erlaubt keine Benutzereingaben. 
> > 
--
Fachbegriffe der Informatik - Einfach erklärt
=============================================
N° 385 -- fügt sich in bestehende Strukturen ein :

Microsoft Passport-Account nötig (Henryk Plötz) 


Mime
View raw message