manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Crawling all of a SharePoint site
Date Tue, 19 Nov 2013 01:53:56 GMT
Ok, patch attached.

One of two things will happen with this patch:
(1) It will work
(2) It will crawl to completion but not get any list rows

If it is the latter, it means that SharePoint operating in this mode
REPLACES the list items with some funky cache URL, rather than augmenting
them.  So please send me the log output if that happens.

Thanks,
Karl



On Mon, Nov 18, 2013 at 8:45 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hah.  Exactly the kind of configuration difference I was expecting.
> Whatever it is, it's showing up as a list.
>
> I'll open a ticket, and propose a patch; let's see if that gets us past
> this.
>
> The ticket is CONNECTORS-812.  I should have a patch in a few minutes,
> attached to the ticket.
>
> Karl
>
>
>
>
> On Mon, Nov 18, 2013 at 8:41 PM, Mark Libucha <mlibucha@gmail.com> wrote:
>
>> Seems to be a SP-internal thing.
>>
>> http://msdn.microsoft.com/en-us/library/aa661294.ASPX
>>
>> Mark
>>
>>
>> On Mon, Nov 18, 2013 at 5:39 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Mark,
>>>
>>> Is "Cache Profiles" a list in your SharePoint?  If not, what is it?
>>>
>>> Karl
>>>
>>>
>>>
>>> On Mon, Nov 18, 2013 at 8:37 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> It's not the first problem you mentioned. I don't have a site specified
>>>> in my SP connection. But it could well be the misconfigured IIS issue...
>>>>
>>>> Here's what I get with your modified log message:
>>>>
>>>> ERROR 2013-11-18 20:35:47,440 (Worker thread '7') - Exception tossed:
>>>> Expected path to start with /Lists/, saw: '/Cache Profiles/1_.000'
>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected
>>>> path to start with /Lists/, saw: '/Cache Profiles/1_.000'
>>>>
>>>> Thanks,
>>>>
>>>> Mark
>>>>
>>>>
>>>>
>>>> On Mon, Nov 18, 2013 at 5:29 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> The exception is very helpful.
>>>>>
>>>>> I've seen this before.  I know of two ways it can happen.
>>>>>
>>>>> First way: your Repository Connection is not actually pointing at the
>>>>> SharePoint root, but rather a subsite of the root.  That usually messes
>>>>> things up pretty well - and it's not easy to detect in the connector
>>>>> properly either.  You must point at the actual root, not a subsite, and
use
>>>>> the criteria to limit what you include.
>>>>>
>>>>> Second way: your SharePoint instance has a malconfigured IIS, which is
>>>>> mapping paths in ways that are unexpected.
>>>>>
>>>>> There may be other ways that this can happen; SharePoint has a myriad
>>>>> different configuration options and it is possible your instance has
one
>>>>> that is not something we've ever seen before.  If you think that is what
is
>>>>> happening, change this line:
>>>>>
>>>>>             throw new ManifoldCFException("Expected path to start with
>>>>> /Lists/");
>>>>>
>>>>> to:
>>>>>
>>>>>             throw new ManifoldCFException("Expected path to start with
>>>>> /Lists/, saw: '"+relPath+"'");
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>>>
>>>>>> Screen shot attached. Using 4.1, SharePoint 2010.
>>>>>>
>>>>>> Throws this exception:
>>>>>>
>>>>>> ERROR 2013-11-18 20:12:58,058 (Worker thread '13') - Exception
>>>>>> tossed: Expected path to start with /Lists/
>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected
>>>>>> path to start with /Lists/
>>>>>>     at
>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointRepository.java:2255)
>>>>>>
>>>>>> I added a debug log message to the SharePoint crawler so the line
>>>>>> number may be off by 1 or 2...
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright <daddywri@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> First, what version of ManifoldCF are you using?  1.3 has some
bugs
>>>>>>> where lists are concerned.
>>>>>>>
>>>>>>> Second, I've recently and repeatedly run exactly this crawl against
>>>>>>> a site that one of our ManifoldCF users set up in Amazon, so
I know it
>>>>>>> works properly.  So now the question is to determine exactly
what you are
>>>>>>> doing that is not correct.
>>>>>>>
>>>>>>> If you want to crawl just lists, you will nevertheless need to
enter
>>>>>>> both a Site match and a List match.  Otherwise you will get nothing,
>>>>>>> because no sites can be crawled.
>>>>>>>
>>>>>>> To enter ANY of the rules I specified above, type a "*" in the
>>>>>>> type-in box, then select "Add Text".  Then, select one of
>>>>>>> "File","Site","List",or "Library" from the pulldown, and then
click the
>>>>>>> "Add new Rule" button.  The Metadata tab works similarly.
>>>>>>>
>>>>>>> If you want me to verify you have done this correctly, please
>>>>>>> include a screen shot of the job's View page.
>>>>>>>
>>>>>>> If this still isn't helping you, please include a screen shot
of the
>>>>>>> Simple History report after you have run a crawl.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>>>>>
>>>>>>>> I've seen this issue come up before, but I'd like to hear
more
>>>>>>>> about it (Karl), if there is more to say about it...
>>>>>>>>
>>>>>>>> Why isn't there an option to crawl an entire SharePoint site.
I
>>>>>>>> mean it's awesome that the UI gives us the option of drilling
down
>>>>>>>> dynamically and specifying exactly which parts we want crawled,
but isn't
>>>>>>>> the default case for most users to just crawl the whole thing?
>>>>>>>>
>>>>>>>> So, why exactly is this not an option, and what would adding
that
>>>>>>>> functionality (I would be volunteering to try this) be feasible?
>>>>>>>>
>>>>>>>> On a more specific level, Karl wrote this in an earlier thread:
>>>>>>>>
>>>>>>>> <quote>
>>>>>>>> For SharePoint, if you want to crawl everything beneath your
root
>>>>>>>> site, the simplest way is to define 4 rules:
>>>>>>>> (1) SITE rule "/*"
>>>>>>>> (2) LIST rule "/*"
>>>>>>>> (3) LIBRARY rule "/*"
>>>>>>>> (4) FILE rule "/*"
>>>>>>>> </quote>
>>>>>>>>
>>>>>>>> I haven't be able to get this to work. It only seems to get
files.
>>>>>>>>
>>>>>>>> Limiting the scope to just Lists, when I use "/*" and specify
List,
>>>>>>>> I get nothing crawled. Also tried "/Lists/*". Still nothing.
>>>>>>>>
>>>>>>>> Maybe I'm not specifying the Metadata correctly? Could you
expand
>>>>>>>> on this Karl? What exactly needs to be specified to crawl
all Lists? If I
>>>>>>>> can get that to work I can probably figure out the rest of
it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message