manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Crawling behind an ISA proxy (iis 7.5)
Date Fri, 11 May 2012 14:26:41 GMT
Hi Rene,

Crawling through a proxy is usually easy, but crawling a session-based
site is always a challenge.

ISA proxies usually authenticate with NTLM.  So you will want to set
up your web connection with NTLM authentication in order to even be
able to reach the pages.  It's not clear that you've got that right
yet, because if you don't have it right you will get 401 errors back.
Getting this right is a prerequisite; you won't be able to proceed
until it is correct.  To see that you do, try a very limited crawl
that fetches ONLY the login page (or some other un-session-protected
content).  If you get a 401 you'll need to figure out what's not right
before proceeding.

It sounds like the site may also be secured using session-based
authentication.  If a cookie is involved then you need to configure
session auth in order to get to any session-protected pages.  The
trick is that, for session-based auth, you need to fully understand
the sequence of pages and forms that happen when a user visits the
site and is granted the cookie(s) - the login process, what content
URLs are protected, what URLs are part of the login sequence, etc.
The end-user documentation describes this in some detail.  It can be a
challenge to get it all set up right.

Finally, for SharePoint sites, if you are intending to index
documents, you might well find the SharePoint Connector a better
choice than trying to crawl the site with the web connector.

Thanks,
Karl

On Fri, May 11, 2012 at 10:13 AM, Rene Nederhand <rene@nederhand.net> wrote:
> Hi,
>
> I am trying to get ManifoldCF crawl our electronic learning
> environment (Blackboard). To enable single sign-on, our institution
> has placed an ISA server as proxy before Blackboard.
> This is giving me a lot of problems.
>
> I've managed to get passed the ISA server using session based
> authentication, but then I am stuck at a 401 error message. According
> to our architect, ISA is responsible for the communication with
> Blackboard and will set a cookie so Blackboard will know it a
> legitimate user is accessing its service. I think, ManifoldCF is not
> able to handle this cookie and hence is not able to access Blackboard.
> Am I right? If so, is there a possibility to get Blackboard indexed?
>
> By the way, the same authentication is used for our Sharepoint. I
> would like to index this as well....
>
> Any help on solving this problem is appreciated.
>
> Cheers,
>
> René

Mime
View raw message