manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Illegal Seed URL
Date Wed, 06 May 2020 12:02:34 GMT
The "?" in your url probably is being interpreted as a regular expression
"?" in your include list.  You need to escape it properly there.

Karl


On Wed, May 6, 2020 at 2:54 AM ritika jain <ritikajain5263@gmail.com> wrote:

> Hi Michael,
>
> Yes i testing this with Debug Mode and tested one more scenario.
> Whenever Seed URL is something like this:-
> https://www.abc.com/societybusiness/entrepreneurship/?lang=en
> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.,
> Our web connector.Java code is return Null in this function, when m.find()
> is executed. hence giving DocumentIdenitifer null and thus Iilegal seed URL
> error
>
>     /** Check if the document identifier is legal.
>     */
>     public boolean isDocumentLegal(String url)
>     {
>       // First, verify that the url matches one of the patterns in the
> include list.
>       int i = 0;
>       while (i < includePatterns.size())
>       {
>         Pattern p = includePatterns.get(i);
>         Matcher m = p.matcher(url);
>         if (m.find())
>           break;
>         i++;
>
> Whereas when the Seed method is something like this :-
> https://www.abc.com/societybusiness/entrepreneurship/ ,  this code is
> getting passed with out fail.
> Can anybody make me understand why the same code is behaving differently?
>
> Thanks
> Ritika
>       }
>
> On Tue, May 5, 2020 at 6:09 PM Michael Cizmar <michael.cizmar@mcplusa.com>
> wrote:
>
>> Hi Ritika,
>>
>>
>>
>> There are several reasons that you could get that.  Have you started
>> manifoldcf in debug mode?  If so, what’s the output just before that
>> statement in the logs?
>>
>>
>>
>> --
>>
>> Michael Cizmar
>>
>>
>>
>> *From: *ritika jain <ritikajain5263@gmail.com>
>> *Reply-To: *"user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> *Date: *Tuesday, May 5, 2020 at 4:34 AM
>> *To: *"user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
>> *Subject: *Illegal Seed URL
>>
>>
>>
>> Hi All,
>>
>>
>>
>> I am using Manifoldcf 2.14 Repository as Web crawler and Output as
>> Elastic Search. I have mentioned a seed URL which is valid as it is opening
>> successfully in browser.
>>
>> Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
>> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>
>> .
>>
>>
>>
>> Which is having ? query string in URL.
>>
>> I am doing anything wrong in this
>>
>>
>>
>> Thanks
>>
>> Ritika
>>
>>
>>
>>
>>
>

Mime
View raw message