manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF and Kerberos/Basic Authentication
Date Fri, 07 Jun 2013 17:29:00 GMT
Fix checked into trunk.
Karl


On Fri, Jun 7, 2013 at 12:42 PM, Karl Wright <daddywri@gmail.com> wrote:

> I created the ticket: CONNECTORS-707.
>
>
>
> On Fri, Jun 7, 2013 at 12:16 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> I looked at the ElasticSearch connector, and it's going to treat these
>> extensions as being "" (empty string).  So your list of allowed extensions
>> will have to include "" if such documents are to be ingested.
>>
>> Checking now to see if in fact you can just add a blank line to the list
>> of extensions to get this to happen... it looks like you can't:
>>
>> >>>>>>
>>       while ((line = br.readLine()) != null)
>>       {
>>         line = line.trim();
>>         if (line.length() > 0)
>>           set.add(line);
>>       }
>> <<<<<<
>>
>> So, the ElasticSearch connector in its infinite wisdom excludes all
>> documents that have no extension.  Hmm.
>>
>> Can you open a ticket for this problem?  I'm not quite sure yet how to
>> address it, but clearly this needs to be fixed.
>>
>> Karl
>>
>>
>>
>> On Fri, Jun 7, 2013 at 12:07 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> The extension of a document comes from the url.  So for the urls listed
>>> in your previous mail, they don't appear to have any extension at all.
>>>
>>> The code here from the web connector rejects documents because of
>>> various reasons, but does not log it:
>>>
>>> >>>>>>
>>>     if (cache.getResponseCode(documentIdentifier) != 200)
>>>       return false;
>>>
>>>     if
>>> (activities.checkLengthIndexable(cache.getDataLength(documentIdentifier))
>>> == false)
>>>       return false;
>>>
>>>     if (activities.checkURLIndexable(documentIdentifier) == false)
>>>       return false;
>>>
>>>     if (filter.isDocumentIndexable(documentIdentifier) == false)
>>>       return false;
>>>
>>> <<<<<<
>>>
>>> All you would see if any one of these conditions failed would be:
>>>
>>>           if (Logging.connectors.isDebugEnabled())
>>>             Logging.connectors.debug("WEB: Decided not to ingest
>>> '"+documentIdentifier+"' because it did not match ingestability criteria");
>>>
>>> Do you see that in the log?
>>>
>>> Also, bear in mind that since the crawler is incremental, you may need
>>> to kick it to make it retry all this so you get debugging output.  You can
>>> click the "reingest all" link on your output connection to make that
>>> happen...
>>>
>>> Karl
>>>
>>>
>>> On Fri, Jun 7, 2013 at 11:52 AM, TC Tobin-Campbell <TC@epic.com> wrote:
>>>
>>>>  I took a look at the output connection, and didn’t see anything in
>>>> there that looked like it would cause any issues. I’m including all of
the
>>>> default MIME and file extensions. This should just be html I would think.
>>>> ****
>>>>
>>>> ****
>>>>
>>>> ** **
>>>>
>>>> Here’s what I’m seeing in the DEBUG output. It seems like we are
>>>> starting the extraction, but then just aren’t doing anything with it??
>>>> Seems weird.  ****
>>>>
>>>> ** **
>>>>
>>>> DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Waiting to
>>>> start getting a connection to http://10.8.159.161:80****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Attempting to
>>>> get connection to http://10.8.159.161:80 (0 ms)****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,888 (Worker thread '24') - WEB: Successfully
>>>> got connection to http://10.8.159.161:80 (0 ms)****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Waiting to
>>>> start getting a connection to http://10.8.159.161:80****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Attempting to
>>>> get connection to http://10.8.159.161:80 (0 ms)****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,889 (Worker thread '20') - WEB: Successfully
>>>> got connection to http://10.8.159.161:80 (0 ms)****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Waiting for
>>>> an HttpClient object****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: For
>>>> http://wiki/main/EpicSearch/Test, discovered matching authentication
>>>> credentials****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: For
>>>> http://wiki/main/EpicSearch/Test, setting virtual host to wiki****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Got an
>>>> HttpClient object after 0 ms.****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,893 (Worker thread '20') - WEB: Get method
>>>> for '/main/EpicSearch/Test'****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Waiting for
>>>> an HttpClient object****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: For
>>>> http://wiki.epic.com/main/EpicSearch/Test, discovered matching
>>>> authentication credentials****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: For
>>>> http://wiki.epic.com/main/EpicSearch/Test, setting virtual host to
>>>> wiki.epic.com****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Got an
>>>> HttpClient object after 0 ms.****
>>>>
>>>> DEBUG 2013-06-07 10:40:27,896 (Worker thread '24') - WEB: Get method
>>>> for '/main/EpicSearch/Test'****
>>>>
>>>> WARN 2013-06-07 10:40:27,900 (Thread-2185) - NEGOTIATE authentication
>>>> error: Invalid name provided (Mechanism level: Could not load configuration
>>>> file C:\Windows\krb5.ini (The system cannot find the file specified))**
>>>> **
>>>>
>>>> WARN 2013-06-07 10:40:27,900 (Thread-2188) - NEGOTIATE authentication
>>>> error: Invalid name provided (Mechanism level: Could not load configuration
>>>> file C:\Windows\krb5.ini (The system cannot find the file specified))**
>>>> **
>>>>
>>>> DEBUG 2013-06-07 10:40:28,378 (Thread-2185) - WEB: Performing a read
>>>> wait on bin 'wiki' of 128 ms.****
>>>>
>>>> DEBUG 2013-06-07 10:40:28,506 (Thread-2185) - WEB: Performing a read
>>>> wait on bin 'wiki' of 50 ms.****
>>>>
>>>> DEBUG 2013-06-07 10:40:28,556 (Thread-2185) - WEB: Performing a read
>>>> wait on bin 'wiki' of 64 ms.****
>>>>
>>>> DEBUG 2013-06-07 10:40:28,613 (Thread-2188) - WEB: Performing a read
>>>> wait on bin 'wiki.epic.com' of 126 ms.****
>>>>
>>>> DEBUG 2013-06-07 10:40:28,620 (Thread-2185) - WEB: Performing a read
>>>> wait on bin 'wiki' of 47 ms.****
>>>>
>>>> INFO 2013-06-07 10:40:28,682 (Worker thread '20') - WEB: FETCH URL|
>>>> http://wiki/main/EpicSearch/Test|1370619627893+787|200|14438|<http://wiki/main/EpicSearch/Test%7C1370619627893+787%7C200%7C14438%7C>
>>>> ****
>>>>
>>>> DEBUG 2013-06-07 10:40:28,682 (Worker thread '20') - WEB: Document '
>>>> http://wiki/main/EpicSearch/Test' is text, with encoding 'utf-8'; link
>>>> extraction starting****
>>>>
>>>> ** **
>>>>
>>>> *Followed by lots of these, which seems appropriate:*
>>>>
>>>> DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: Url '
>>>> http://wiki/mediawiki/main/index.php?action=edit&title=EpicSearch/Test'
>>>> is illegal because no include patterns match it****
>>>>
>>>> DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: In html
>>>> document 'http://wiki/main/EpicSearch/Test', found an unincluded URL
>>>> '/mediawiki/main/index.php?title=EpicSearch/Test&action=edit'****
>>>>
>>>> DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: Url '
>>>> http://wiki/mediawiki/main/index.php?action=edit&title=EpicSearch/Test'
>>>> is illegal because no include patterns match it****
>>>>
>>>> DEBUG 2013-06-07 10:40:28,683 (Worker thread '20') - WEB: In html
>>>> document 'http://wiki/main/EpicSearch/Test', found an unincluded URL
>>>> '/mediawiki/main/index.php?title=EpicSearch/Test&action=edit'****
>>>>
>>>> ** **
>>>>
>>>> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
>>>> 271-9000 ****
>>>>
>>>> ** **
>>>>
>>>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>>>> *Sent:* Friday, June 07, 2013 9:49 AM
>>>>
>>>> *To:* user@manifoldcf.apache.org
>>>> *Subject:* Re: ManifoldCF and Kerberos/Basic Authentication****
>>>>
>>>> ** **
>>>>
>>>> Hi TC,****
>>>>
>>>> The fact that the fetch is successful means that the URL is included
>>>> (and not excluded).  The fact that it doesn't mention a robots exclusion
>>>> means that robots.txt is happy with it.  But it could well be that:****
>>>>
>>>> (a) the mimetype is one that your ElasticSearch connection is excluding;
>>>> ****
>>>>
>>>> (b) the extension is one that your ElasticSearch connection is
>>>> excluding.****
>>>>
>>>> I would check your output connection, and if that doesn't help turn on
>>>> connector debugging (in properties.xml, set property
>>>> "org.apache.manifoldcf.connectors" to "DEBUG").  Then you will see output
>>>> that describes the consideration process the web connector is going through
>>>> for each document.****
>>>>
>>>> Karl****
>>>>
>>>> ** **
>>>>
>>>> On Fri, Jun 7, 2013 at 10:43 AM, TC Tobin-Campbell <TC@epic.com> wrote:
>>>> ****
>>>>
>>>> Apologies for the delay here Karl. I was able to get this up and
>>>> running, and the authentication is working. Thanks for getting that in so
>>>> quickly!****
>>>>
>>>>  ****
>>>>
>>>> I do have a new issue though. I have an output connection to
>>>> Elasticsearch setup for this job. ****
>>>>
>>>>  ****
>>>>
>>>> I can see that the crawler is in fact crawling the wiki, and the
>>>> fetches are all working great. However, it doesn’t seem to be attempting
to
>>>> send the pages to the index.****
>>>>
>>>>  ****
>>>>
>>>> ****
>>>>
>>>>  ****
>>>>
>>>> I’m not seeing anything in the elasticsearch logs, so it appears we’re
>>>> just not sending anything to Elasticsearch. Could this be related to the
>>>> change you made? Or is this a completely separate problem?****
>>>>
>>>>  ****
>>>>
>>>> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
>>>> 271-9000 ****
>>>>
>>>>  ****
>>>>
>>>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>>>> *Sent:* Friday, May 24, 2013 12:50 PM****
>>>>
>>>>
>>>> *To:* user@manifoldcf.apache.org
>>>> *Subject:* Re: ManifoldCF and Kerberos/Basic Authentication****
>>>>
>>>>  ****
>>>>
>>>> I had a second so I finished this.  Trunk now has support for basic
>>>> auth.  You enter the credentials on the server tab underneath the API
>>>> credentials.  Please give it a try and let me know if it works for you.
>>>>
>>>> Karl****
>>>>
>>>>  ****
>>>>
>>>> On Fri, May 24, 2013 at 11:28 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:****
>>>>
>>>> CONNECTORS-692.  I will probably look at this over the weekend.****
>>>>
>>>> Karl****
>>>>
>>>>  ****
>>>>
>>>> On Fri, May 24, 2013 at 11:26 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:****
>>>>
>>>> Hi TC,****
>>>>
>>>> Unless I'm very much mistaken, there are no Apache kerberos session
>>>> cookies being used on your site, so it should be a straightforward matter
>>>> to include basic auth credentials to your Apache mod-auth-kerb module for
>>>> all pages during crawling.****
>>>>
>>>> I'll create a ticket for this.
>>>>
>>>> Karl****
>>>>
>>>>  ****
>>>>
>>>> On Fri, May 24, 2013 at 11:14 AM, TC Tobin-Campbell <TC@epic.com>
>>>> wrote:****
>>>>
>>>> Hi Karl,****
>>>>
>>>> Here’s what I know so far.****
>>>>
>>>>  ****
>>>>
>>>> Our module is configured to use two auth methods: Negotiate and Basic.
>>>> In most cases, we use Negotiate, but I’m guessing you’d prefer Basic.**
>>>> **
>>>>
>>>>  ****
>>>>
>>>> Here’s an example header.****
>>>>
>>>>  ****
>>>>
>>>> GET / HTTP/1.1****
>>>>
>>>> Host: wiki.epic.com****
>>>>
>>>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101
>>>> Firefox/20.0****
>>>>
>>>> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>>> ****
>>>>
>>>> Accept-Language: en-US,en;q=0.5****
>>>>
>>>> Accept-Encoding: gzip, deflate****
>>>>
>>>> Cookie: wooTracker=QOMVLXDIC6OGOUXMGST1O54HYW573NNC;
>>>> .EPICASPXAUTHQA=FA94C945F613DACB9341384EBB1C28C52CFC52558E606FC2F880DD5BA811BE7E94301C7A0A1990FAC2E119AABB8591EC975059A2B8169BEA9FC525D0577F3C0EC56DC29C28880D23E0790AD890024FB57A338981606774259656B6971556645B095778115ADFE6B9B434970869C4B546A59A61B2CDEF0C0A5B23E80BB1D1E3D3D567E4C113D9E7B32D137FDEE65E51AC7B3DF5A04F9767FA7C8723140AC274E2695D939C716D9B49CCF0F1D79967CE902781BC8CB5A253E3FB39896021ABB4F2FCA01D0E138E00A8176EB2ECE5B0204597C21969C8F501A9EDE4D27694E699777BB179CD329748B3341A4BBF3085C447E2B55BE97E27D23E415C23F1A53A33A15551D9AE6B5CF255C3B8ECE038A481B8291A8EC46F0EA8730C3658DABC5BE7557C6659321677D8F4586CA79D6D5CCCB1C5687F9077A6CD96487EAEF417A1411C2F62BE6FF57DD1F515B16406CF4B0B9460EFB9BCB46F8F7E47FCB8E8CE4FAE2EB92F20DECEF2BBF1D95C80597BE935A031CD158593EFA2E446FA6FAFDD2B4E691CD8569B7D60DAD4378EBD6A138E23F0F616FD01443647D9A6F852AEF773A69580390496748241739C0DDF2791B1C2143B7E9E976754056B70EB846DAE1D7018CC40026F862ABF613D89C8D31B2C468B81D0C18C37697E8BA5D415F8DFCA37AF2935AAD0238ED6F652E24062849EC8E0C4651C4FB8BB9DD11BE4F8639AD690C791868B8E94ADB626C9B1BD8E334F675E664A03DC;
>>>> wiki_pensieve_session=j1pcf1746js1442m7p92hag9g1; wiki_pensieveUserID=5;
>>>> wiki_pensieveUserName=Lziobro;
>>>> wiki_pensieveToken=********************be3a3a990a8a****
>>>>
>>>> Connection: keep-alive****
>>>>
>>>> Authorization: Basic bHppb**************xMjM0   <-I've censored this
>>>> line so you cannot get my password****
>>>>
>>>>  ****
>>>>
>>>> If I’m understanding you correctly, there’s no way to accomplish this
>>>> currently? Or, is there some workaround we could implement? ****
>>>>
>>>>  ****
>>>>
>>>> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
>>>> 271-9000 ****
>>>>
>>>>  ****
>>>>
>>>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>>>> *Sent:* Thursday, May 16, 2013 12:05 PM
>>>> *To:* user@manifoldcf.apache.org
>>>> *Subject:* Re: ManifoldCF and Kerberos/Basic Authentication****
>>>>
>>>>  ****
>>>>
>>>> Hi TC,
>>>>
>>>> Apparently mod-auth-kerb can be configured in a number of different
>>>> ways.  But if yours will work with basic auth, we can just transmit the
>>>> credentials each time.  It will be relatively slow because mod-auth-kerb
>>>> will then need to talk to the kdc on each page fetch, but it should work.
>>>> Better yet would be if Apache set a browser cookie containing your tickets,
>>>> which it knew how to interpret if returned - but I don't see any Google
>>>> evidence that mod-auth-kerb is capable of that.  But either of these two
>>>> approaches we could readily implement.****
>>>>
>>>> FWIW, the standard way to work with kerberos is for you to actually
>>>> have tickets already kinit'd and installed on your machine.  Your browser
>>>> then picks up those tickets and transmits them to the Wiki server (I
>>>> presume in a header that mod-auth-kerb knows about), and the kdc does not
>>>> need to be involved.  But initializing that kind of ticket store, and
>>>> managing the associated kinit requests when necessary, are beyond the scope
>>>> of any connector we've so far done, so if we had to go that way, that would
>>>> effectively make this proposal a Research Project.****
>>>>
>>>> What would be great to know in advance is how exactly your browser
>>>> interacts with your Apache server.  Are you familiar with the process of
>>>> getting a packet dump?  You'd use a tool like tcpdump (Unix) or wireshark
>>>> (windows) in order to capture the packet traffic between a browser session
>>>> and your Apache server, to see exactly what is happening.  Start by
>>>> shutting down all your browser windows, so there is no in-memory state, and
>>>> then start the capture and browse to a part of the wiki that is secured by
>>>> mod-auth-kerb.  We'd want to see if cookies get set, or if any special
>>>> headers get transmitted by your browser (other than the standard Basic Auth
>>>> "Authentication" headers).  If the exchange is protected by SSL, then
>>>> you'll have to use FireFox and use a plugin called LiveHeaders to see what
>>>> is going on instead.****
>>>>
>>>> Please let me know what you find.****
>>>>
>>>> Karl****
>>>>
>>>>  ****
>>>>
>>>>  ****
>>>>
>>>> On Thu, May 16, 2013 at 12:37 PM, Karl Wright <daddywri@gmail.com>
>>>> wrote:****
>>>>
>>>> Hi TC,****
>>>>
>>>> Thanks, this is a big help in understanding your setup.****
>>>>
>>>> I don't know enough about exactly *how* mod-auth-kerb uses Basic Auth
>>>> to communicate with the browser, and whether it expects the browser to
>>>> cache the resulting tickets (in cookies?)  I will have to do some research
>>>> and get back to you on that.****
>>>>
>>>> Basically, security for a Wiki is usually handled by the Wiki, but
>>>> since you've put added auth in front of it by going through mod-auth-kerb,
>>>> it's something that the Wiki connector would have to understand (and
>>>> emulate your browser) in order to implement.  So it does not likely support
>>>> this right now.  It may be relatively easy to do or it may be a challenge
-
>>>> we'll see.  I would also be somewhat concerned that it may not possible to
>>>> actually reach the API urls through Apache; that would make everything moot
>>>> if it were true.  Could you confirm that you can visit API urls through
>>>> your Apache setup?****
>>>>
>>>> Karl****
>>>>
>>>>  ****
>>>>
>>>> On Thu, May 16, 2013 at 12:21 PM, TC Tobin-Campbell <TC@epic.com>
>>>> wrote:****
>>>>
>>>> Hi there,****
>>>>
>>>> I'm trying to connect ManifoldCF to an internal wiki at my company. The
>>>> ManifoldCF wiki connector supplies a username and password field for the
>>>> wiki api, however, at my company, a username and password is required to
>>>> connect to the apache server running the wiki site, and after that
>>>> authentication takes place, those credentials are passed on to the wiki api.
>>>> ****
>>>>
>>>>  ****
>>>>
>>>> So, essentially, I need a way to have ManifoldCF pass my windows
>>>> credentials on when trying to make its connection. Using the api login
>>>> fields does not work.****
>>>>
>>>>  ****
>>>>
>>>> We use Kerberos the Kerberos Module for Apache<http://modauthkerb.sourceforge.net/index.html>(AuthType
Kerberos).  My understanding based on that linked documentation
>>>> is that this module does use Basic Auth to communicate with the browser.
>>>> ****
>>>>
>>>>  ****
>>>>
>>>> Is there anything we can to make ManifoldCF authenticate in this
>>>> scenario? ****
>>>>
>>>>  ****
>>>>
>>>> Thanks,****
>>>>
>>>>  ****
>>>>
>>>>  ****
>>>>
>>>> *TC Tobin-Campbell *| Technical Services | Willow | *Epic*  | (608)
>>>> 271-9000 ****
>>>>
>>>>  ****
>>>>
>>>> Sherlock <https://sherlock.epic.com/> (Issue tracking)****
>>>>
>>>> Analyst Toolkits<https://sites.epic.com/epiclib/epicdoc/Pages/analyst/default.aspx>
>>>> (Common setup and support tasks)****
>>>>
>>>> Report Repository<https://documentation.epic.com/DataHandbook/Reports/ReportSearch>(Epic
reports documentation)
>>>> ****
>>>>
>>>> Nova<https://nova.epic.com/Login/GetOrg.aspx?returnUrl=%2fdefault.aspx>(Release
note management)
>>>> ****
>>>>
>>>> Galaxy <https://documentation.epic.com/OnlineDoc/Documents.aspx> (Epic
>>>> documentation)  ****
>>>>
>>>>  ****
>>>>
>>>>  ****
>>>>
>>>>  ****
>>>>
>>>>  ****
>>>>
>>>>  ****
>>>>
>>>>  ****
>>>>
>>>> ** **
>>>>
>>>
>>>
>>
>

Mime
View raw message