manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Crawling behind an ISA proxy (iis 7.5)
Date Tue, 15 May 2012 15:59:37 GMT
Hi Rene,

You will need both NTLM auth (page auth, which you have already set
up), and Session auth (which you haven't yet set up).

In order to set up session-based auth, you should first identify the
set of pages that you want access to that are protected by a cookie
requirement.  You will need to write a regular expression that matches
these pages and ONLY these pages.  This URL gets entered as the "URL
regular expression" on the Access Credentials tab in the Session-based
Access Credentials part of the tab.  Then, click the Add button.

The next thing you will need is to specify how the connector
recognizes pages that belong to the logon sequence.  The actual
sequence you need to understand is what happens in the browser when
you try to access a specific protected URL and you don't have the
right cookie.  You did not actually specify that; I think you are
presuming that you'd be entering directly through the logon page, but
that is not how it works.  The crawler will have a URL in mind and
will need access to the content of that URL.  It will fetch the URL,
and if the actual content is NOT fetched, we need to detect that
situation and consider it part of the logon sequence.

So let's pretend that what happens when the cookie is not present is
that you get a redirection to the logon page, instead of the actual
page content.  In that case, you would create a login sequence page
description consisting of the same URL regular expression that
describes the protected content pages, plus the "redirection" radio
button, plus a target URL regular expression that would match
"bb.helo.hanze.nl/CookieAuth.dll?GetLogon".  You then click the Add
button for login pages to add that description to the set of login
pages.

Next, the GetLogon page itself needs to be added as a login sequence
page.  The regular expression should match only
"bb.helo.hanze.nl/CookieAuth.dll?GetLogon".  The type of the page is
"form" because you said this was a form where you could fill in your
login credentials.  If there is only one form on the page you can
leave the regexp that matches the form name blank since that will
match everything.  Once you click "Add" for this page, you will have
the opportunity to fill in form names and values to post when the form
gets posted.

It was not clear from your description, once again, what happens after
the Logon page is posted.  If there is a special target page, you need
to include that also in the login sequence so that its content is not
taken.  If there is a redirection back to the original content page,
you'd include that redirection.

Hopefully this is beginning to make a bit of sense to you; but this is
the general picture, not related to your actual site that closely.
For example, the Javascript redirection you mentioned will not be
processed by ManifoldCF, but that is unnecessary because at the end of
the whole login sequence ManifoldCF automatically goes back to the
original URL when the login sequence is chased to its end.  So all you
need to do is make sure that all pages that are part of that sequence
are specified.

On the other hand, it's not clear that the code you have "protecting"
the site sets cookies any other way than through Javascript.  The
cookie that this Javascript actually sets is a really stupid
non-specific cookie, but unless it is set by the standard response
header method, I don't think it's going to wind up being set at all.
Can you confirm that this is the only way the cookie gets set?

Karl

On Tue, May 15, 2012 at 10:57 AM, Rene Nederhand <rene@nederhand.net> wrote:
> Hi Karl,
>
> Thank you so much for your detailed explanation. I am trying  each
> step you've pointed out. Unfortunately, I cannot get this thing going.
> Hopefully, you can help me if I give you more detailed information.
>
> The sequence of steps is (when accessing https://bb.helo.hanze.nl):
>
> 1. https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
> This gives me indeed NTLM authentication. When I create a crawler that
> only crawls the above page I get a 200 response. So this works, no
> 401.
>
> 2. If I submit my username and password. This request is sent to the
> server. This is also the only form I'll ever see.:
>
> https://bb.helo.hanze.nl/CookieAuth.dll?Logon (302)
> Request:
> curl    Z2F
> flags   0
> forcedownlevel  0
> formdir 3
> trusted 0
> username        loginname
> password        mypassword
> SubmitCreds     Log On
>
> 3. The response is a cookie being set with a redirect to the first url
> (but now with the cookie set)
>
> Response:
>        HTTP/1.1 302 Moved Temporarily
> Location        https://bb.helo.hanze.nl/
> Set-Cookie      noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9";
> HttpOnly; Domain=.hanze.nl; secure; path=/
> Content-Length  0
> Connection      close
>
> Request:
>        GET / HTTP/1.1
> Host    bb.helo.hanze.nl
> User-Agent      Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0)
> Gecko/20100101 Firefox/12.0
> Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language en-us,en;q=0.5
> Accept-Encoding gzip, deflate
> Connection      keep-alive
> Referer https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
> Cookie  noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9"
>
> 4. Lastly, a redirect is made to the Blackboard site (javascript check
> for cookie and redirect)
>
> Response:
> <HTML dir='ltr'><HEAD>
> <META HTTP-EQUIV="Pragma" CONTENT="no-cache"><META
> HTTP-EQUIV="Cache-Control" CONTENT="no-cache">
> <script language="Javascript">
>  cookie_name = "cookies_enabled";
>  document.cookie=cookie_name+"=yes";
>  if (!document.cookie) {
>    document.location.href="/nocookies.html";
>  }
>  document.cookie=cookie_name+"yes;expires=Thu, 01-Jan-1970 00:00:01 GMT";
> </script>
> <SCRIPT language="Javascript"><!--
> document.location.replace('https://bb.helo.hanze.nl/webapps/portal/frameset.jsp');
> //--></SCRIPT></HEAD>
> <BODY BGCOLOR='#FFFFFF' LINK='#000000' ALINK='#000000'>
> <br><br><br><br><div style="text-align: center;"><hr
width='350' height='5'><br>
> <strong>You are being redirected to another page</strong>
> <p><strong>Please Wait...</strong><br><br><hr width='350'
height='5'>
> <br><A HREF='https://bb.helo.hanze.nl/webapps/portal/frameset.jsp'><strong>Click
> here to access the page to which you are being
> forwarded.</strong></A></div>
> </BODY></HTML>
>
> Although the first form used NTLM authentication, this doesn't work
> out. Therefore, I would think that session based auth would work
> better as I can create each step myself. I still haven't a clue how to
> approach this. What do I fill in those boxes?
>
> Thanks for helping me.
>
> Cheers,
> René
>
>
>
>
> On Fri, May 11, 2012 at 4:26 PM, Karl Wright <daddywri@gmail.com> wrote:
>> Hi Rene,
>>
>> Crawling through a proxy is usually easy, but crawling a session-based
>> site is always a challenge.
>>
>> ISA proxies usually authenticate with NTLM.  So you will want to set
>> up your web connection with NTLM authentication in order to even be
>> able to reach the pages.  It's not clear that you've got that right
>> yet, because if you don't have it right you will get 401 errors back.
>> Getting this right is a prerequisite; you won't be able to proceed
>> until it is correct.  To see that you do, try a very limited crawl
>> that fetches ONLY the login page (or some other un-session-protected
>> content).  If you get a 401 you'll need to figure out what's not right
>> before proceeding.
>>
>> It sounds like the site may also be secured using session-based
>> authentication.  If a cookie is involved then you need to configure
>> session auth in order to get to any session-protected pages.  The
>> trick is that, for session-based auth, you need to fully understand
>> the sequence of pages and forms that happen when a user visits the
>> site and is granted the cookie(s) - the login process, what content
>> URLs are protected, what URLs are part of the login sequence, etc.
>> The end-user documentation describes this in some detail.  It can be a
>> challenge to get it all set up right.
>>
>> Finally, for SharePoint sites, if you are intending to index
>> documents, you might well find the SharePoint Connector a better
>> choice than trying to crawl the site with the web connector.
>>
>> Thanks,
>> Karl
>>
>> On Fri, May 11, 2012 at 10:13 AM, Rene Nederhand <rene@nederhand.net> wrote:
>>> Hi,
>>>
>>> I am trying to get ManifoldCF crawl our electronic learning
>>> environment (Blackboard). To enable single sign-on, our institution
>>> has placed an ISA server as proxy before Blackboard.
>>> This is giving me a lot of problems.
>>>
>>> I've managed to get passed the ISA server using session based
>>> authentication, but then I am stuck at a 401 error message. According
>>> to our architect, ISA is responsible for the communication with
>>> Blackboard and will set a cookie so Blackboard will know it a
>>> legitimate user is accessing its service. I think, ManifoldCF is not
>>> able to handle this cookie and hence is not able to access Blackboard.
>>> Am I right? If so, is there a possibility to get Blackboard indexed?
>>>
>>> By the way, the same authentication is used for our Sharepoint. I
>>> would like to index this as well....
>>>
>>> Any help on solving this problem is appreciated.
>>>
>>> Cheers,
>>>
>>> René

Mime
View raw message