lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From blazingwolf7 <blazingwo...@gmail.com>
Subject Re: Untokenized URL
Date Sun, 06 Jul 2008 01:26:02 GMT

No, I didn't store the contentLength. Just adding it into the index. Which
until now I am still scratching my head as I can't think of another way to
retrieve it without continuously using the reader.

As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). I
will like to keep it this way, having the url being tokenized. I am finding
a way to UNtokenized it, I retrieved it using a method that will retrieve
the entire field then extract the information in it. But the problem is, the
url are broken down. I am seeking a way to reconstruct it to its orgininal
format. Can it be done?


Shai Erera wrote:
> 
> Hi
> 
> Regarding the contentLength, when you add it to the document, do you use
> *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> 
> Regarding the URL, how do you add it to the document? For example, if you
> do
> doc.add(new Field("url", "http://www.cnn.com", Store.NO,
> Index.UN_TOKENIZED), it would create a token like "url:http://www.cnn.com"
> without breaking it to its parts. Is that what you're looking for?
> 
> Shai
> 
> On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <blazingwolf7@gmail.com>
> wrote:
> 
>>
>> Hi,
>>
>> I am currently working on retrieving url and contentLength of each
>> document
>> found during the search. I want to retrieve it during the calculation of
>> score so that I can influence the score in some other way.
>>
>> I used the methods from TermDocs and TermEnum to get the information.
>> However, the url I retrieve as is know by most, is tokenized. It is
>> broken
>> down into several parts and I will have to rejoin them. Can anyone help
>> me
>> with this? I am stuck here wondering how to get back the whole url
>> without
>> using a Reader.
>>
>> Also, I try to retrieve the contentLength, but the results return are
>> null.
>> Why is that? I opened the index using Luke and the contentLength is there
>> but when I try to get it using this way, the results is null.
>>
>> Can anyone help me with both of these problems? Any help will be
>> appreciated. Thanks
>> --
>> View this message in context:
>> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
> 
> 
> -- 
> Regards,
> 
> Shai Erera
> 
> 

-- 
View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message