lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Teague James" <teag...@insystechinc.com>
Subject RE: Indexing URLs for Binaries
Date Fri, 03 Jan 2014 19:26:43 GMT
Thanks, Mark. I checked there, but pdf files are not listed. There are some
file types in there that I might need in the future, so I appreciate the
info. Any other ideas?

-----Original Message-----
From: Reyes, Mark 
Sent: Friday, January 03, 2014 1:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing URLs for Binaries

Check suffix-urlfilter.txt in your conf directory for Nutch. You might be
prohibiting those filetypes from the crawl.

- Mark






On 1/3/14, 10:29 AM, "Teague James" <teaguej@insystechinc.com> wrote:

>I am using Nutch 1.7 with Solr 4.6.0 to index websites that have links 
>to binary files, such as Word, PDF, etc. The crawler crawls the site 
>but I am not getting the URLs of the links for the binary files no 
>matter how deep I set the settings for the site. I see the labels for 
>the links in the content, but not the URLs. Any ideas on how I could 
>get those URLs back in my crawl?
>


IMPORTANT NOTICE: This e-mail message is intended to be received only by
persons entitled to receive the confidential information it may contain.
E-mail messages sent from Bridgepoint Education may contain information that
is confidential and may be legally privileged. Please do not read, copy,
forward or store this message unless you are an intended recipient of it. If
you received this transmission in error, please notify the sender by reply
e-mail and delete the message and any attachments.=


Mime
View raw message