manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: basic question of Web crawler setting of "Include in index"
Date Fri, 27 Jun 2014 11:46:59 GMT
Hi Shigeki,

The code doesn't care about the query string.  It uses "find()" anyway,
which means you don't have to have the leading ".*" and trailing ".*":

>>>>>>
      // First, verify that the url matches one of the patterns in the
include list.
      int i = 0;
      while (i < includeIndexPatterns.size())
      {
        Pattern p = includeIndexPatterns.get(i);
        Matcher m = p.matcher(url);
        if (m.find())
          break;
        i++;
      }
      if (i == includeIndexPatterns.size())
      {
        if (Logging.connectors.isDebugEnabled())
          Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
because no include patterns match it");
        return false;
      }

      // Now make sure it's not in the exclude list.
      i = 0;
      while (i < excludeIndexPatterns.size())
      {
        Pattern p = excludeIndexPatterns.get(i);
        Matcher m = p.matcher(url);
        if (m.find())
        {
          if (Logging.connectors.isDebugEnabled())
            Logging.connectors.debug("WEB: Url '"+url+"' is not indexable
because exclude pattern '"+p.toString()+"' matched it");
          return false;
        }
        i++;
      }

      return true;
<<<<<<

If you turn on connector debugging, you may see more reasons why the url is
being rejected in the log.

Thanks,
Karl

Mime
View raw message