lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Untokenized URL
Date Mon, 07 Jul 2008 06:53:30 GMT
As Shai told before, you should store the field twice: As tokenized field
for your search and with a different name (e.g. "field-untokenized"). For
your TermEnum Code you may use the untokenized field, for normal search
queries the tokenized.
If you want to retrieve the field contents with Document.get() & Co. instead
of TermEnum, you may store the field one time with Flags Tokenized & Stored.
But this does not work with your TermEnum solution.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
> Sent: Monday, July 07, 2008 7:39 AM
> To: java-dev@lucene.apache.org
> Subject: Re: Untokenized URL
> 
> 
> I am trying to retrieve the url and use it as filter. The main problem is
> I
> don't want to use a reader to continuously retrieve the url for each
> document located.
> 
> TermDocs termDocs = reader.termDocs();
> TermEnum termEnum = reader.terms (new Term (field, ""));
> do{
>    Term term = termEnum.term();
> }while(termEnum.next());
> 
> I am using this code to retrieve the field containing the url but it is
> tokenized. Is there anyway to untokenized it or is there a better way to
> do
> this?
> 
> 
> Shai Erera wrote:
> >
> > I think that the simplest solution will be to index the URL field twice,
> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
> > un_tokenized term.
> > If you have a document in hand and only want to fetch its URL, then add
> > the
> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
> > COMPRESS and Index.NO.
> >
> > Perhaps I don't understand the entire scenario. When do you need to
> fetch
> > the contentLength and URL? To what purpose?
> >
> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <blazingwolf7@gmail.com>
> > wrote:
> >
> >>
> >> No, I didn't store the contentLength. Just adding it into the index.
> >> Which
> >> until now I am still scratching my head as I can't think of another way
> >> to
> >> retrieve it without continuously using the reader.
> >>
> >> As for the url, I use doc.add(new Field("url",
> Store.NO,Index.TOKENIZED).
> >> I
> >> will like to keep it this way, having the url being tokenized. I am
> >> finding
> >> a way to UNtokenized it, I retrieved it using a method that will
> retrieve
> >> the entire field then extract the information in it. But the problem
> is,
> >> the
> >> url are broken down. I am seeking a way to reconstruct it to its
> >> orgininal
> >> format. Can it be done?
> >>
> >>
> >> Shai Erera wrote:
> >> >
> >> > Hi
> >> >
> >> > Regarding the contentLength, when you add it to the document, do you
> >> use
> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> >> >
> >> > Regarding the URL, how do you add it to the document? For example, if
> >> you
> >> > do
> >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
> >> > Index.UN_TOKENIZED), it would create a token like "url:
> >> http://www.cnn.com"
> >> > without breaking it to its parts. Is that what you're looking for?
> >> >
> >> > Shai
> >> >
> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
> <blazingwolf7@gmail.com>
> >> > wrote:
> >> >
> >> >>
> >> >> Hi,
> >> >>
> >> >> I am currently working on retrieving url and contentLength of each
> >> >> document
> >> >> found during the search. I want to retrieve it during the
> calculation
> >> of
> >> >> score so that I can influence the score in some other way.
> >> >>
> >> >> I used the methods from TermDocs and TermEnum to get the
> information.
> >> >> However, the url I retrieve as is know by most, is tokenized. It is
> >> >> broken
> >> >> down into several parts and I will have to rejoin them. Can anyone
> >> help
> >> >> me
> >> >> with this? I am stuck here wondering how to get back the whole url
> >> >> without
> >> >> using a Reader.
> >> >>
> >> >> Also, I try to retrieve the contentLength, but the results return
> are
> >> >> null.
> >> >> Why is that? I opened the index using Luke and the contentLength is
> >> there
> >> >> but when I try to get it using this way, the results is null.
> >> >>
> >> >> Can anyone help me with both of these problems? Any help will be
> >> >> appreciated. Thanks
> >> >> --
> >> >> View this message in context:
> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
> >> >> Sent from the Lucene - Java Developer mailing list archive at
> >> Nabble.com.
> >> >>
> >> >>
> >> >> --------------------------------------------------------------------
> -
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Shai Erera
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
> >
> >
> 
> --
> View this message in context: http://www.nabble.com/Untokenized-URL-
> tp18275048p18310348.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message