lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From blazingwolf7 <blazingwo...@gmail.com>
Subject RE: Untokenized URL
Date Mon, 07 Jul 2008 07:14:53 GMT

Well, I am open to suggestion, except for using reader. The Documnet.get() &
CO, how does it works?


Uwe Schindler wrote:
> 
> As Shai told before, you should store the field twice: As tokenized field
> for your search and with a different name (e.g. "field-untokenized"). For
> your TermEnum Code you may use the untokenized field, for normal search
> queries the tokenized.
> If you want to retrieve the field contents with Document.get() & Co.
> instead
> of TermEnum, you may store the field one time with Flags Tokenized &
> Stored.
> But this does not work with your TermEnum solution.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
>> Sent: Monday, July 07, 2008 7:39 AM
>> To: java-dev@lucene.apache.org
>> Subject: Re: Untokenized URL
>> 
>> 
>> I am trying to retrieve the url and use it as filter. The main problem is
>> I
>> don't want to use a reader to continuously retrieve the url for each
>> document located.
>> 
>> TermDocs termDocs = reader.termDocs();
>> TermEnum termEnum = reader.terms (new Term (field, ""));
>> do{
>>    Term term = termEnum.term();
>> }while(termEnum.next());
>> 
>> I am using this code to retrieve the field containing the url but it is
>> tokenized. Is there anyway to untokenized it or is there a better way to
>> do
>> this?
>> 
>> 
>> Shai Erera wrote:
>> >
>> > I think that the simplest solution will be to index the URL field
>> twice,
>> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
>> > un_tokenized term.
>> > If you have a document in hand and only want to fetch its URL, then add
>> > the
>> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
>> > COMPRESS and Index.NO.
>> >
>> > Perhaps I don't understand the entire scenario. When do you need to
>> fetch
>> > the contentLength and URL? To what purpose?
>> >
>> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <blazingwolf7@gmail.com>
>> > wrote:
>> >
>> >>
>> >> No, I didn't store the contentLength. Just adding it into the index.
>> >> Which
>> >> until now I am still scratching my head as I can't think of another
>> way
>> >> to
>> >> retrieve it without continuously using the reader.
>> >>
>> >> As for the url, I use doc.add(new Field("url",
>> Store.NO,Index.TOKENIZED).
>> >> I
>> >> will like to keep it this way, having the url being tokenized. I am
>> >> finding
>> >> a way to UNtokenized it, I retrieved it using a method that will
>> retrieve
>> >> the entire field then extract the information in it. But the problem
>> is,
>> >> the
>> >> url are broken down. I am seeking a way to reconstruct it to its
>> >> orgininal
>> >> format. Can it be done?
>> >>
>> >>
>> >> Shai Erera wrote:
>> >> >
>> >> > Hi
>> >> >
>> >> > Regarding the contentLength, when you add it to the document, do you
>> >> use
>> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
>> >> >
>> >> > Regarding the URL, how do you add it to the document? For example,
>> if
>> >> you
>> >> > do
>> >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
>> >> > Index.UN_TOKENIZED), it would create a token like "url:
>> >> http://www.cnn.com"
>> >> > without breaking it to its parts. Is that what you're looking for?
>> >> >
>> >> > Shai
>> >> >
>> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
>> <blazingwolf7@gmail.com>
>> >> > wrote:
>> >> >
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I am currently working on retrieving url and contentLength of each
>> >> >> document
>> >> >> found during the search. I want to retrieve it during the
>> calculation
>> >> of
>> >> >> score so that I can influence the score in some other way.
>> >> >>
>> >> >> I used the methods from TermDocs and TermEnum to get the
>> information.
>> >> >> However, the url I retrieve as is know by most, is tokenized. It
is
>> >> >> broken
>> >> >> down into several parts and I will have to rejoin them. Can anyone
>> >> help
>> >> >> me
>> >> >> with this? I am stuck here wondering how to get back the whole
url
>> >> >> without
>> >> >> using a Reader.
>> >> >>
>> >> >> Also, I try to retrieve the contentLength, but the results return
>> are
>> >> >> null.
>> >> >> Why is that? I opened the index using Luke and the contentLength
is
>> >> there
>> >> >> but when I try to get it using this way, the results is null.
>> >> >>
>> >> >> Can anyone help me with both of these problems? Any help will be
>> >> >> appreciated. Thanks
>> >> >> --
>> >> >> View this message in context:
>> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> >> >> Sent from the Lucene - Java Developer mailing list archive at
>> >> Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> --------------------------------------------------------------------
>> -
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> >
>> >> > Shai Erera
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> >
>> > Shai Erera
>> >
>> >
>> 
>> --
>> View this message in context: http://www.nabble.com/Untokenized-URL-
>> tp18275048p18310348.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18311247.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message