manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: basic question of Web crawler setting of "Include in index"
Date Sat, 28 Jun 2014 04:29:33 GMT
Hi Karl.

Thanks a lot for your help!

I now understand how the setting works and this solved the problem!


Again, thanks a lot.

Best regards.

Shigeki


2014-06-27 20:46 GMT+09:00 Karl Wright <daddywri@gmail.com>:

> Hi Shigeki,
>
> The code doesn't care about the query string.  It uses "find()" anyway,
> which means you don't have to have the leading ".*" and trailing ".*":
>
> >>>>>>
>       // First, verify that the url matches one of the patterns in the
> include list.
>       int i = 0;
>       while (i < includeIndexPatterns.size())
>       {
>         Pattern p = includeIndexPatterns.get(i);
>         Matcher m = p.matcher(url);
>         if (m.find())
>           break;
>         i++;
>       }
>       if (i == includeIndexPatterns.size())
>       {
>         if (Logging.connectors.isDebugEnabled())
>           Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
> because no include patterns match it");
>         return false;
>       }
>
>       // Now make sure it's not in the exclude list.
>       i = 0;
>       while (i < excludeIndexPatterns.size())
>       {
>         Pattern p = excludeIndexPatterns.get(i);
>         Matcher m = p.matcher(url);
>         if (m.find())
>         {
>           if (Logging.connectors.isDebugEnabled())
>             Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
> because exclude pattern '"+p.toString()+"' matched it");
>           return false;
>         }
>         i++;
>       }
>
>       return true;
> <<<<<<
>
> If you turn on connector debugging, you may see more reasons why the url
> is being rejected in the log.
>
> Thanks,
> Karl
>
>

Mime
View raw message