manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Wiki connector stuck crawling namespaces other than default
Date Wed, 01 Oct 2014 14:20:37 GMT
Hi Kambiz,

I looked deeper into the log, and found that it is looping on trying to
seed.  The reason it is looping is because the wiki server you are crawling
is not honoring the "apfrom" parameter when the namespace is specified.
Please see the following response, which is coming back from the query:

DEBUG 2014-10-01 08:34:22,470 (Thread-618) - http-outgoing-7 >> "GET
/wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATraining&aplimit=500
HTTP/1.1[\r][\n]"

This response is *supposed* to start with Africa:Training and go on from
there.  Instead, it seems to be starting from the beginning of the
namespace:

>>>>>>
<?xml version="1.0"?><api><query>
<allpages>
<p pageid="10171" ns="404" title="Africa:Arcgis" />
<p pageid="9977" ns="404" title="Africa:Atlas" />
<p pageid="9979" ns="404" title="Africa:CTMargins" />
<p pageid="9727" ns="404" title="Africa:Conferences" />
<p pageid="9386" ns="404" title="Africa:Conferences2010" />
<p pageid="9833" ns="404" title="Africa:Countryprojects" />
<p pageid="9823" ns="404" title="Africa:Databases" />
<p pageid="9976" ns="404" title="Africa:EasternMed" />
<p pageid="10277" ns="404" title="Africa:Farmin" />
<p pageid="9388" ns="404" title="Africa:FieldtripGuides" />
<p pageid="9834" ns="404" title="Africa:Gabon2010" />
<p pageid="9975" ns="404" title="Africa:InteriorRifs" />
<p pageid="10762" ns="404" title="Africa:Kenya2011" />
<p pageid="15660" ns="404" title="Africa:Kenya2012" />
<p pageid="14945" ns="404" title="Africa:Madagascar2012" />
<p pageid="9973" ns="404" title="Africa:Mozambique2011" />
<p pageid="9385" ns="404" title="Africa:New Ventures Africa" />
<p pageid="9812" ns="404" title="Africa:New Ventures Africa Map" />
<p pageid="9969" ns="404" title="Africa:Newsletter" />
<p pageid="19985" ns="404" title="Africa:Project Abyss" />
<p pageid="19986" ns="404" title="Africa:Project Geronimo" />
<p pageid="20079" ns="404" title="Africa:Project Inlet" />
<p pageid="9832" ns="404" title="Africa:Regionalprojects" />
<p pageid="9974" ns="404" title="Africa:Seychelles2011" />
<p pageid="9978" ns="404" title="Africa:TetianCarbonates" />
<p pageid="9822" ns="404" title="Africa:Training" />
</allpages>
</query></api>
<<<<<<

What version of Wiki are you crawling here?  Perhaps something has changed
in the spec, or maybe you are crawling a wiki that is too old to support
this feature?

Karl


On Wed, Oct 1, 2014 at 9:57 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Kambiz,
>
> In the log you sent, I did not see any activity at all other than
> seeding.  Was the log complete?
>
> You can get a better sense of what is happening by obtaining a simple
> history report for this connection, and a document status report for the
> job.  If there are only 27 documents, it should be very clear what is
> happening by looking at these. Can you include them please?
>
> Karl
>
>
> On Wed, Oct 1, 2014 at 9:50 AM, Kambiz Niktabar <niktabar@yahoo.com>
> wrote:
>
>> Hi Karl,
>>
>> Snapshot of the job view page is attached. By the way, it seems the
>> number of pages under that namespace is only 27 and they are not being
>> processed even after some minutes (see the second snapshot)
>>
>> Regards
>> Kambiz
>>
>>   ------------------------------
>>  *From:* Karl Wright <daddywri@gmail.com>
>> *To:* "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>; Kambiz
>> Niktabar <niktabar@yahoo.com>
>> *Sent:* Wednesday, October 1, 2014 2:05 PM
>> *Subject:* Re: Wiki connector stuck crawling namespaces other than
>> default
>>
>> Hi Kambiz,
>>
>> The debugging output indicates that your namespace name is "404".  That
>> doesn't sound correct to me.
>>
>> >>>>>>
>> GET
>> /wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATetianCarbonates&aplimit=500
>> HTTP/1.1
>> <<<<<<
>>
>> I've gone back and looked at the code and can find no way that the
>> namespace would be corrupted.  But maybe this is actually correct.  Can you
>> send along a screen shot of the view page for the job?
>>
>> Also, the wiki connector seeds documents in batches of 500 at a time.  It
>> uses the last title fetched in order to be able to find the next batch of
>> 500.  So if there are a lot of documents, it will take a while to seed them
>> all.  In your log I see signs that this is what is happening.  Have a look
>> at all the GET requests and note the apfrom parameter.
>>
>>
>>
>>
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>>
>

Mime
View raw message