manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rene Nederhand <r...@nederhand.net>
Subject Re: Crawling behind an ISA proxy (iis 7.5)
Date Tue, 15 May 2012 14:57:55 GMT
Hi Karl,

Thank you so much for your detailed explanation. I am trying  each
step you've pointed out. Unfortunately, I cannot get this thing going.
Hopefully, you can help me if I give you more detailed information.

The sequence of steps is (when accessing https://bb.helo.hanze.nl):

1. https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
This gives me indeed NTLM authentication. When I create a crawler that
only crawls the above page I get a 200 response. So this works, no
401.

2. If I submit my username and password. This request is sent to the
server. This is also the only form I'll ever see.:

https://bb.helo.hanze.nl/CookieAuth.dll?Logon (302)
Request:
curl	Z2F
flags	0
forcedownlevel	0
formdir	3
trusted	0
username	loginname
password	mypassword
SubmitCreds	Log On

3. The response is a cookie being set with a redirect to the first url
(but now with the cookie set)

Response:
	HTTP/1.1 302 Moved Temporarily
Location	https://bb.helo.hanze.nl/
Set-Cookie	noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9";
HttpOnly; Domain=.hanze.nl; secure; path=/
Content-Length	0
Connection	close

Request:
	GET / HTTP/1.1
Host	bb.helo.hanze.nl
User-Agent	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0)
Gecko/20100101 Firefox/12.0
Accept	text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language	en-us,en;q=0.5
Accept-Encoding	gzip, deflate
Connection	keep-alive
Referer	https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
Cookie	noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9"

4. Lastly, a redirect is made to the Blackboard site (javascript check
for cookie and redirect)

Response:
<HTML dir='ltr'><HEAD>
<META HTTP-EQUIV="Pragma" CONTENT="no-cache"><META
HTTP-EQUIV="Cache-Control" CONTENT="no-cache">
<script language="Javascript">
  cookie_name = "cookies_enabled";
  document.cookie=cookie_name+"=yes";
  if (!document.cookie) {
    document.location.href="/nocookies.html";
  }
  document.cookie=cookie_name+"yes;expires=Thu, 01-Jan-1970 00:00:01 GMT";
</script>
<SCRIPT language="Javascript"><!--
document.location.replace('https://bb.helo.hanze.nl/webapps/portal/frameset.jsp');
//--></SCRIPT></HEAD>
<BODY BGCOLOR='#FFFFFF' LINK='#000000' ALINK='#000000'>
<br><br><br><br><div style="text-align: center;"><hr width='350'
height='5'><br>
<strong>You are being redirected to another page</strong>
<p><strong>Please Wait...</strong><br><br><hr width='350'
height='5'>
<br><A HREF='https://bb.helo.hanze.nl/webapps/portal/frameset.jsp'><strong>Click
here to access the page to which you are being
forwarded.</strong></A></div>
</BODY></HTML>

Although the first form used NTLM authentication, this doesn't work
out. Therefore, I would think that session based auth would work
better as I can create each step myself. I still haven't a clue how to
approach this. What do I fill in those boxes?

Thanks for helping me.

Cheers,
René




On Fri, May 11, 2012 at 4:26 PM, Karl Wright <daddywri@gmail.com> wrote:
> Hi Rene,
>
> Crawling through a proxy is usually easy, but crawling a session-based
> site is always a challenge.
>
> ISA proxies usually authenticate with NTLM.  So you will want to set
> up your web connection with NTLM authentication in order to even be
> able to reach the pages.  It's not clear that you've got that right
> yet, because if you don't have it right you will get 401 errors back.
> Getting this right is a prerequisite; you won't be able to proceed
> until it is correct.  To see that you do, try a very limited crawl
> that fetches ONLY the login page (or some other un-session-protected
> content).  If you get a 401 you'll need to figure out what's not right
> before proceeding.
>
> It sounds like the site may also be secured using session-based
> authentication.  If a cookie is involved then you need to configure
> session auth in order to get to any session-protected pages.  The
> trick is that, for session-based auth, you need to fully understand
> the sequence of pages and forms that happen when a user visits the
> site and is granted the cookie(s) - the login process, what content
> URLs are protected, what URLs are part of the login sequence, etc.
> The end-user documentation describes this in some detail.  It can be a
> challenge to get it all set up right.
>
> Finally, for SharePoint sites, if you are intending to index
> documents, you might well find the SharePoint Connector a better
> choice than trying to crawl the site with the web connector.
>
> Thanks,
> Karl
>
> On Fri, May 11, 2012 at 10:13 AM, Rene Nederhand <rene@nederhand.net> wrote:
>> Hi,
>>
>> I am trying to get ManifoldCF crawl our electronic learning
>> environment (Blackboard). To enable single sign-on, our institution
>> has placed an ISA server as proxy before Blackboard.
>> This is giving me a lot of problems.
>>
>> I've managed to get passed the ISA server using session based
>> authentication, but then I am stuck at a 401 error message. According
>> to our architect, ISA is responsible for the communication with
>> Blackboard and will set a cookie so Blackboard will know it a
>> legitimate user is accessing its service. I think, ManifoldCF is not
>> able to handle this cookie and hence is not able to access Blackboard.
>> Am I right? If so, is there a possibility to get Blackboard indexed?
>>
>> By the way, the same authentication is used for our Sharepoint. I
>> would like to index this as well....
>>
>> Any help on solving this problem is appreciated.
>>
>> Cheers,
>>
>> René

Mime
View raw message