manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ritika jain <ritikajain5...@gmail.com>
Subject Re: Illegal Seed URL
Date Wed, 06 May 2020 07:54:51 GMT
Hi Michael,

Yes i testing this with Debug Mode and tested one more scenario.
Whenever Seed URL is something like this:-
https://www.abc.com/societybusiness/entrepreneurship/?lang=en
<https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>.,
Our web connector.Java code is return Null in this function, when m.find()
is executed. hence giving DocumentIdenitifer null and thus Iilegal seed URL
error

    /** Check if the document identifier is legal.
    */
    public boolean isDocumentLegal(String url)
    {
      // First, verify that the url matches one of the patterns in the
include list.
      int i = 0;
      while (i < includePatterns.size())
      {
        Pattern p = includePatterns.get(i);
        Matcher m = p.matcher(url);
        if (m.find())
          break;
        i++;

Whereas when the Seed method is something like this :-
https://www.abc.com/societybusiness/entrepreneurship/ ,  this code is
getting passed with out fail.
Can anybody make me understand why the same code is behaving differently?

Thanks
Ritika
      }

On Tue, May 5, 2020 at 6:09 PM Michael Cizmar <michael.cizmar@mcplusa.com>
wrote:

> Hi Ritika,
>
>
>
> There are several reasons that you could get that.  Have you started
> manifoldcf in debug mode?  If so, what’s the output just before that
> statement in the logs?
>
>
>
> --
>
> Michael Cizmar
>
>
>
> *From: *ritika jain <ritikajain5263@gmail.com>
> *Reply-To: *"user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
> *Date: *Tuesday, May 5, 2020 at 4:34 AM
> *To: *"user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
> *Subject: *Illegal Seed URL
>
>
>
> Hi All,
>
>
>
> I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic
> Search. I have mentioned a seed URL which is valid as it is opening
> successfully in browser.
>
> Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>
> .
>
>
>
> Which is having ? query string in URL.
>
> I am doing anything wrong in this
>
>
>
> Thanks
>
> Ritika
>
>
>
>
>

Mime
View raw message