lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <ser...@gmail.com>
Subject Re: Untokenized URL
Date Sun, 06 Jul 2008 05:50:18 GMT
I think that the simplest solution will be to index the URL field twice,
once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
un_tokenized term.
If you have a document in hand and only want to fetch its URL, then add the
URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
COMPRESS and Index.NO.

Perhaps I don't understand the entire scenario. When do you need to fetch
the contentLength and URL? To what purpose?

On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <blazingwolf7@gmail.com> wrote:

>
> No, I didn't store the contentLength. Just adding it into the index. Which
> until now I am still scratching my head as I can't think of another way to
> retrieve it without continuously using the reader.
>
> As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). I
> will like to keep it this way, having the url being tokenized. I am finding
> a way to UNtokenized it, I retrieved it using a method that will retrieve
> the entire field then extract the information in it. But the problem is,
> the
> url are broken down. I am seeking a way to reconstruct it to its orgininal
> format. Can it be done?
>
>
> Shai Erera wrote:
> >
> > Hi
> >
> > Regarding the contentLength, when you add it to the document, do you use
> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> >
> > Regarding the URL, how do you add it to the document? For example, if you
> > do
> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
> > Index.UN_TOKENIZED), it would create a token like "url:
> http://www.cnn.com"
> > without breaking it to its parts. Is that what you're looking for?
> >
> > Shai
> >
> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <blazingwolf7@gmail.com>
> > wrote:
> >
> >>
> >> Hi,
> >>
> >> I am currently working on retrieving url and contentLength of each
> >> document
> >> found during the search. I want to retrieve it during the calculation of
> >> score so that I can influence the score in some other way.
> >>
> >> I used the methods from TermDocs and TermEnum to get the information.
> >> However, the url I retrieve as is know by most, is tokenized. It is
> >> broken
> >> down into several parts and I will have to rejoin them. Can anyone help
> >> me
> >> with this? I am stuck here wondering how to get back the whole url
> >> without
> >> using a Reader.
> >>
> >> Also, I try to retrieve the contentLength, but the results return are
> >> null.
> >> Why is that? I opened the index using Luke and the contentLength is
> there
> >> but when I try to get it using this way, the results is null.
> >>
> >> Can anyone help me with both of these problems? Any help will be
> >> appreciated. Thanks
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Regards,

Shai Erera

Mime
View raw message