lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From blazingwolf7 <blazingwo...@gmail.com>
Subject RE: Untokenized URL
Date Mon, 07 Jul 2008 08:12:50 GMT

Thanks for the help


Uwe Schindler wrote:
> 
> Hi,
> 
> Read here: http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> And I think that this type of questions is more for the Lucene Users
> mailing
> list
> (http://lucene.apache.org/java/docs/mailinglists.html#Java%20User%20List).
> This list is for developers of Lucene itself, not for users asking for
> help
> how to implement something specific with Lucene.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
>> Sent: Monday, July 07, 2008 9:15 AM
>> To: java-dev@lucene.apache.org
>> Subject: RE: Untokenized URL
>> 
>> 
>> Well, I am open to suggestion, except for using reader. The
>> Documnet.get()
>> &
>> CO, how does it works?
>> 
>> 
>> Uwe Schindler wrote:
>> >
>> > As Shai told before, you should store the field twice: As tokenized
>> field
>> > for your search and with a different name (e.g. "field-untokenized").
>> For
>> > your TermEnum Code you may use the untokenized field, for normal search
>> > queries the tokenized.
>> > If you want to retrieve the field contents with Document.get() & Co.
>> > instead
>> > of TermEnum, you may store the field one time with Flags Tokenized &
>> > Stored.
>> > But this does not work with your TermEnum solution.
>> >
>> > -----
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: uwe@thetaphi.de
>> >
>> >> -----Original Message-----
>> >> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
>> >> Sent: Monday, July 07, 2008 7:39 AM
>> >> To: java-dev@lucene.apache.org
>> >> Subject: Re: Untokenized URL
>> >>
>> >>
>> >> I am trying to retrieve the url and use it as filter. The main problem
>> is
>> >> I
>> >> don't want to use a reader to continuously retrieve the url for each
>> >> document located.
>> >>
>> >> TermDocs termDocs = reader.termDocs();
>> >> TermEnum termEnum = reader.terms (new Term (field, ""));
>> >> do{
>> >>    Term term = termEnum.term();
>> >> }while(termEnum.next());
>> >>
>> >> I am using this code to retrieve the field containing the url but it
>> is
>> >> tokenized. Is there anyway to untokenized it or is there a better way
>> to
>> >> do
>> >> this?
>> >>
>> >>
>> >> Shai Erera wrote:
>> >> >
>> >> > I think that the simplest solution will be to index the URL field
>> >> twice,
>> >> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
>> >> > un_tokenized term.
>> >> > If you have a document in hand and only want to fetch its URL, then
>> add
>> >> > the
>> >> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES
/
>> >> > COMPRESS and Index.NO.
>> >> >
>> >> > Perhaps I don't understand the entire scenario. When do you need to
>> >> fetch
>> >> > the contentLength and URL? To what purpose?
>> >> >
>> >> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7
>> <blazingwolf7@gmail.com>
>> >> > wrote:
>> >> >
>> >> >>
>> >> >> No, I didn't store the contentLength. Just adding it into the
>> index.
>> >> >> Which
>> >> >> until now I am still scratching my head as I can't think of another
>> >> way
>> >> >> to
>> >> >> retrieve it without continuously using the reader.
>> >> >>
>> >> >> As for the url, I use doc.add(new Field("url",
>> >> Store.NO,Index.TOKENIZED).
>> >> >> I
>> >> >> will like to keep it this way, having the url being tokenized.
I am
>> >> >> finding
>> >> >> a way to UNtokenized it, I retrieved it using a method that will
>> >> retrieve
>> >> >> the entire field then extract the information in it. But the
>> problem
>> >> is,
>> >> >> the
>> >> >> url are broken down. I am seeking a way to reconstruct it to its
>> >> >> orgininal
>> >> >> format. Can it be done?
>> >> >>
>> >> >>
>> >> >> Shai Erera wrote:
>> >> >> >
>> >> >> > Hi
>> >> >> >
>> >> >> > Regarding the contentLength, when you add it to the document,
do
>> you
>> >> >> use
>> >> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
>> >> >> >
>> >> >> > Regarding the URL, how do you add it to the document? For
>> example,
>> >> if
>> >> >> you
>> >> >> > do
>> >> >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
>> >> >> > Index.UN_TOKENIZED), it would create a token like "url:
>> >> >> http://www.cnn.com"
>> >> >> > without breaking it to its parts. Is that what you're looking
>> for?
>> >> >> >
>> >> >> > Shai
>> >> >> >
>> >> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
>> >> <blazingwolf7@gmail.com>
>> >> >> > wrote:
>> >> >> >
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> I am currently working on retrieving url and contentLength
of
>> each
>> >> >> >> document
>> >> >> >> found during the search. I want to retrieve it during
the
>> >> calculation
>> >> >> of
>> >> >> >> score so that I can influence the score in some other
way.
>> >> >> >>
>> >> >> >> I used the methods from TermDocs and TermEnum to get the
>> >> information.
>> >> >> >> However, the url I retrieve as is know by most, is tokenized.
It
>> is
>> >> >> >> broken
>> >> >> >> down into several parts and I will have to rejoin them.
Can
>> anyone
>> >> >> help
>> >> >> >> me
>> >> >> >> with this? I am stuck here wondering how to get back the
whole
>> url
>> >> >> >> without
>> >> >> >> using a Reader.
>> >> >> >>
>> >> >> >> Also, I try to retrieve the contentLength, but the results
>> return
>> >> are
>> >> >> >> null.
>> >> >> >> Why is that? I opened the index using Luke and the contentLength
>> is
>> >> >> there
>> >> >> >> but when I try to get it using this way, the results is
null.
>> >> >> >>
>> >> >> >> Can anyone help me with both of these problems? Any help
will be
>> >> >> >> appreciated. Thanks
>> >> >> >> --
>> >> >> >> View this message in context:
>> >> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> >> >> >> Sent from the Lucene - Java Developer mailing list archive
at
>> >> >> Nabble.com.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> --------------------------------------------------------------------
>> >> -
>> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> >
>> >> >> > Shai Erera
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
>> >> >> Sent from the Lucene - Java Developer mailing list archive at
>> >> Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> --------------------------------------------------------------------
>> -
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> >
>> >> > Shai Erera
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context: http://www.nabble.com/Untokenized-URL-
>> >> tp18275048p18310348.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>> >
>> 
>> --
>> View this message in context: http://www.nabble.com/Untokenized-URL-
>> tp18275048p18311247.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18311983.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message