lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Extracting URLs while indexing
Date Wed, 20 Jan 2010 17:53:03 GMT
That's really hard to say without seeing your configuration <G>...

If your field has WordDelimiterFactory with the proper catenate
options set to one, that'd do it.

Can you post the relevant parts of your schema?

Erick

On Wed, Jan 20, 2010 at 12:46 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com>wrote:

> I am not absolutely sure about what I am saying but I think after
> tokenization I get the URLs as single tokens but with all the "interesting
> symbols" :) like "/",":" removed from the token.
> Is it normal? Is there a chance I misconfigured something?
>
> Best regards,
> Bogdan
>
> On Wed, Jan 20, 2010 at 7:03 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > I guess it depends on what you mean by "extract". There's
> > nothing that I know of that, say, stores them to a file or
> > separate field, or even does anything special with them.
> >
> > I think StandardTokenizerFactory tries to keep URLs
> > together as a token in the field, but it's just another
> > token... You should check though....
> >
> > FWIW
> > Erick
> >
> > On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> > >wrote:
> >
> > > Sorry, I meant completely server-side - even more I want that at
> indexing
> > > time (I do not care about query-time as I am reading later the whole
> > index
> > > anyway).
> > >
> > > On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > Do you mean you want the URLs to be extracted on the client?
> > > > If so, no. Filters/analyzers reside on the server, not the client.
> > > > You'll have to do it with custom code....
> > > >
> > > > Erick
> > > >
> > > > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <
> > bogdan.vatkov@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I want to extract URLs (http://..., as well as file://... or even
> > > > //.....)
> > > > > while pushing documents into Solr.
> > > > > Is it possible with the Filters/Analyzers available nowadays?
> > > > > I looked into the doc but could not find anything related to it.
> > > > >
> > > > > Best regards,
> > > > > Bogdan
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Bogdan
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message