manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawler does not follow the robots meta tag rules
Date Wed, 02 Feb 2011 17:04:11 GMT
I created a ticket: CONNECTORS-157 to cover the path-resolution issue.
Karl

On Wed, Feb 2, 2011 at 11:34 AM, Karl Wright <daddywri@gmail.com> wrote:
> Turns out Java doesn't like the form of those URLs; it doesn't they're proper:
>
> WEB: Can't use url 'dokument.pdf' because it is badly formed: Relative
> path in absolute URI: http://ridder.uio.nodokument.pdf
> WEB: In html document 'http://ridder.uio.no', found an unincluded URL
> 'dokument.pdf'
>
> This is the java.net.URI class:
>
>        java.net.URI parentURL = new java.net.URI(parentIdentifier);
>        url = parentURL.resolve(rawURL);
>
> ... and this is throwing a java.net.URISyntaxException.
>
> I'm going to have to go look at the standards to figure out what we
> should do here.  Perhaps the right approach is to note the exception
> and retry with a "/" glommed on the front if we get it.
>
> But clearly you must have modified the web connector in order to get
> it to crawl your stuff in the first place.
>
> Karl
>
> On Wed, Feb 2, 2011 at 11:08 AM, Karl Wright <daddywri@gmail.com> wrote:
>> Hmm.  I get 701 bytes from your seed, but no parseable links.  Investigating...
>> Karl
>>
>> On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>
wrote:
>>> On 28.01.11 14.32, Karl Wright wrote:
>>>>
>>>> Thanks.  I tested my changes enough so that I was confident in
>>>> committing the patch, so the changes are in trunk.
>>>
>>> I'm afraid that it doesn't work properly. I downloaded the latest version
>>> from trunk and started the crawler.
>>>
>>> Try to use the following address in your seed list and the following rule in
>>> the includes list:
>>> ^http://ridder.uio.no/.*
>>>
>>> The following document was fetched and sent to Solr for indexing even though
>>> it includes a robots noindex rule:
>>> http://ridder.uio.no/test_closed/
>>>
>>> Here's the line from the history telling me that Sole should index it:
>>> 02-02-2011 16:12:33.283         document ingest (Solr)
>>> http://ridder.uio.no/test_closed/
>>>        200
>>>
>>> I can try to modify the code you have added in order to get around this
>>> tomorrow. I guess I can find the relevant check somewhere in the following
>>> folder?
>>> mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>
>>
>

Mime
View raw message