manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Crawling all of a SharePoint site
Date Tue, 19 Nov 2013 01:29:22 GMT
Hi Mark,

The exception is very helpful.

I've seen this before.  I know of two ways it can happen.

First way: your Repository Connection is not actually pointing at the
SharePoint root, but rather a subsite of the root.  That usually messes
things up pretty well - and it's not easy to detect in the connector
properly either.  You must point at the actual root, not a subsite, and use
the criteria to limit what you include.

Second way: your SharePoint instance has a malconfigured IIS, which is
mapping paths in ways that are unexpected.

There may be other ways that this can happen; SharePoint has a myriad
different configuration options and it is possible your instance has one
that is not something we've ever seen before.  If you think that is what is
happening, change this line:

            throw new ManifoldCFException("Expected path to start with
/Lists/");

to:

            throw new ManifoldCFException("Expected path to start with
/Lists/, saw: '"+relPath+"'");

Karl




On Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha <mlibucha@gmail.com> wrote:

> Screen shot attached. Using 4.1, SharePoint 2010.
>
> Throws this exception:
>
> ERROR 2013-11-18 20:12:58,058 (Worker thread '13') - Exception tossed:
> Expected path to start with /Lists/
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path
> to start with /Lists/
>     at
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointRepository.java:2255)
>
> I added a debug log message to the SharePoint crawler so the line number
> may be off by 1 or 2...
>
> Thanks,
>
> Mark
>
>
>
> On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Mark,
>>
>> First, what version of ManifoldCF are you using?  1.3 has some bugs where
>> lists are concerned.
>>
>> Second, I've recently and repeatedly run exactly this crawl against a
>> site that one of our ManifoldCF users set up in Amazon, so I know it works
>> properly.  So now the question is to determine exactly what you are doing
>> that is not correct.
>>
>> If you want to crawl just lists, you will nevertheless need to enter both
>> a Site match and a List match.  Otherwise you will get nothing, because no
>> sites can be crawled.
>>
>> To enter ANY of the rules I specified above, type a "*" in the type-in
>> box, then select "Add Text".  Then, select one of "File","Site","List",or
>> "Library" from the pulldown, and then click the "Add new Rule" button.  The
>> Metadata tab works similarly.
>>
>> If you want me to verify you have done this correctly, please include a
>> screen shot of the job's View page.
>>
>> If this still isn't helping you, please include a screen shot of the
>> Simple History report after you have run a crawl.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha <mlibucha@gmail.com> wrote:
>>
>>> I've seen this issue come up before, but I'd like to hear more about it
>>> (Karl), if there is more to say about it...
>>>
>>> Why isn't there an option to crawl an entire SharePoint site. I mean
>>> it's awesome that the UI gives us the option of drilling down dynamically
>>> and specifying exactly which parts we want crawled, but isn't the default
>>> case for most users to just crawl the whole thing?
>>>
>>> So, why exactly is this not an option, and what would adding that
>>> functionality (I would be volunteering to try this) be feasible?
>>>
>>> On a more specific level, Karl wrote this in an earlier thread:
>>>
>>> <quote>
>>> For SharePoint, if you want to crawl everything beneath your root site,
>>> the simplest way is to define 4 rules:
>>> (1) SITE rule "/*"
>>> (2) LIST rule "/*"
>>> (3) LIBRARY rule "/*"
>>> (4) FILE rule "/*"
>>> </quote>
>>>
>>> I haven't be able to get this to work. It only seems to get files.
>>>
>>> Limiting the scope to just Lists, when I use "/*" and specify List, I
>>> get nothing crawled. Also tried "/Lists/*". Still nothing.
>>>
>>> Maybe I'm not specifying the Metadata correctly? Could you expand on
>>> this Karl? What exactly needs to be specified to crawl all Lists? If I can
>>> get that to work I can probably figure out the rest of it.
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>
>

Mime
View raw message