lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Untokenized URL
Date Mon, 07 Jul 2008 07:25:30 GMT
Hi,

Read here: http://wiki.apache.org/lucene-java/LuceneFAQ

And I think that this type of questions is more for the Lucene Users mailing
list
(http://lucene.apache.org/java/docs/mailinglists.html#Java%20User%20List).
This list is for developers of Lucene itself, not for users asking for help
how to implement something specific with Lucene.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
> Sent: Monday, July 07, 2008 9:15 AM
> To: java-dev@lucene.apache.org
> Subject: RE: Untokenized URL
> 
> 
> Well, I am open to suggestion, except for using reader. The Documnet.get()
> &
> CO, how does it works?
> 
> 
> Uwe Schindler wrote:
> >
> > As Shai told before, you should store the field twice: As tokenized
> field
> > for your search and with a different name (e.g. "field-untokenized").
> For
> > your TermEnum Code you may use the untokenized field, for normal search
> > queries the tokenized.
> > If you want to retrieve the field contents with Document.get() & Co.
> > instead
> > of TermEnum, you may store the field one time with Flags Tokenized &
> > Stored.
> > But this does not work with your TermEnum solution.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >> -----Original Message-----
> >> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
> >> Sent: Monday, July 07, 2008 7:39 AM
> >> To: java-dev@lucene.apache.org
> >> Subject: Re: Untokenized URL
> >>
> >>
> >> I am trying to retrieve the url and use it as filter. The main problem
> is
> >> I
> >> don't want to use a reader to continuously retrieve the url for each
> >> document located.
> >>
> >> TermDocs termDocs = reader.termDocs();
> >> TermEnum termEnum = reader.terms (new Term (field, ""));
> >> do{
> >>    Term term = termEnum.term();
> >> }while(termEnum.next());
> >>
> >> I am using this code to retrieve the field containing the url but it is
> >> tokenized. Is there anyway to untokenized it or is there a better way
> to
> >> do
> >> this?
> >>
> >>
> >> Shai Erera wrote:
> >> >
> >> > I think that the simplest solution will be to index the URL field
> >> twice,
> >> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
> >> > un_tokenized term.
> >> > If you have a document in hand and only want to fetch its URL, then
> add
> >> > the
> >> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
> >> > COMPRESS and Index.NO.
> >> >
> >> > Perhaps I don't understand the entire scenario. When do you need to
> >> fetch
> >> > the contentLength and URL? To what purpose?
> >> >
> >> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <blazingwolf7@gmail.com>
> >> > wrote:
> >> >
> >> >>
> >> >> No, I didn't store the contentLength. Just adding it into the index.
> >> >> Which
> >> >> until now I am still scratching my head as I can't think of another
> >> way
> >> >> to
> >> >> retrieve it without continuously using the reader.
> >> >>
> >> >> As for the url, I use doc.add(new Field("url",
> >> Store.NO,Index.TOKENIZED).
> >> >> I
> >> >> will like to keep it this way, having the url being tokenized. I am
> >> >> finding
> >> >> a way to UNtokenized it, I retrieved it using a method that will
> >> retrieve
> >> >> the entire field then extract the information in it. But the problem
> >> is,
> >> >> the
> >> >> url are broken down. I am seeking a way to reconstruct it to its
> >> >> orgininal
> >> >> format. Can it be done?
> >> >>
> >> >>
> >> >> Shai Erera wrote:
> >> >> >
> >> >> > Hi
> >> >> >
> >> >> > Regarding the contentLength, when you add it to the document,
do
> you
> >> >> use
> >> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> >> >> >
> >> >> > Regarding the URL, how do you add it to the document? For example,
> >> if
> >> >> you
> >> >> > do
> >> >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
> >> >> > Index.UN_TOKENIZED), it would create a token like "url:
> >> >> http://www.cnn.com"
> >> >> > without breaking it to its parts. Is that what you're looking
for?
> >> >> >
> >> >> > Shai
> >> >> >
> >> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
> >> <blazingwolf7@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> I am currently working on retrieving url and contentLength
of
> each
> >> >> >> document
> >> >> >> found during the search. I want to retrieve it during the
> >> calculation
> >> >> of
> >> >> >> score so that I can influence the score in some other way.
> >> >> >>
> >> >> >> I used the methods from TermDocs and TermEnum to get the
> >> information.
> >> >> >> However, the url I retrieve as is know by most, is tokenized.
It
> is
> >> >> >> broken
> >> >> >> down into several parts and I will have to rejoin them. Can
> anyone
> >> >> help
> >> >> >> me
> >> >> >> with this? I am stuck here wondering how to get back the whole
> url
> >> >> >> without
> >> >> >> using a Reader.
> >> >> >>
> >> >> >> Also, I try to retrieve the contentLength, but the results
return
> >> are
> >> >> >> null.
> >> >> >> Why is that? I opened the index using Luke and the contentLength
> is
> >> >> there
> >> >> >> but when I try to get it using this way, the results is null.
> >> >> >>
> >> >> >> Can anyone help me with both of these problems? Any help will
be
> >> >> >> appreciated. Thanks
> >> >> >> --
> >> >> >> View this message in context:
> >> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
> >> >> >> Sent from the Lucene - Java Developer mailing list archive
at
> >> >> Nabble.com.
> >> >> >>
> >> >> >>
> >> >> >>
> >> --------------------------------------------------------------------
> >> -
> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Regards,
> >> >> >
> >> >> > Shai Erera
> >> >> >
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
> >> >> Sent from the Lucene - Java Developer mailing list archive at
> >> Nabble.com.
> >> >>
> >> >>
> >> >> --------------------------------------------------------------------
> -
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Shai Erera
> >> >
> >> >
> >>
> >> --
> >> View this message in context: http://www.nabble.com/Untokenized-URL-
> >> tp18275048p18310348.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> >
> 
> --
> View this message in context: http://www.nabble.com/Untokenized-URL-
> tp18275048p18311247.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message