manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kambiz Niktabar <nikta...@yahoo.com>
Subject Re: Wiki connector stuck crawling namespaces other than default
Date Wed, 01 Oct 2014 15:42:28 GMT
Hi Karl,

Thanks for the info. I will check with the people maintaining the Wiki site to see if there
is any specific configuration that causes this.

Regards
Kambiz


________________________________
 From: Karl Wright <daddywri@gmail.com>
To: Kambiz Niktabar <niktabar@yahoo.com> 
Cc: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org> 
Sent: Wednesday, October 1, 2014 4:34 PM
Subject: Re: Wiki connector stuck crawling namespaces other than default
 


The standard mediawiki api for this operation is listed here:

http://www.mediawiki.org/wiki/API:Allpages

Karl






On Wed, Oct 1, 2014 at 10:20 AM, Karl Wright <daddywri@gmail.com> wrote:

Hi Kambiz,
>
>I looked deeper into the log, and found that it is looping on trying to seed.  The reason
it is looping is because the wiki server you are crawling is not honoring the "apfrom" parameter
when the namespace is specified.  Please see the following response, which is coming back
from the query:
>
>DEBUG 2014-10-01 08:34:22,470 (Thread-618) - http-outgoing-7 >> "GET /wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATraining&aplimit=500
HTTP/1.1[\r][\n]"
>
>This response is *supposed* to start with Africa:Training and go on from there.  Instead,
it seems to be starting from the beginning of the namespace:
>
>
>>>>>>>
><?xml version="1.0"?><api><query>
><allpages>
><p pageid="10171" ns="404" title="Africa:Arcgis" />
><p pageid="9977" ns="404" title="Africa:Atlas" />
><p pageid="9979" ns="404" title="Africa:CTMargins" />
><p pageid="9727" ns="404" title="Africa:Conferences" />
><p pageid="9386" ns="404" title="Africa:Conferences2010" />
><p pageid="9833" ns="404" title="Africa:Countryprojects" />
><p pageid="9823" ns="404" title="Africa:Databases" />
><p pageid="9976" ns="404" title="Africa:EasternMed" />
><p pageid="10277" ns="404" title="Africa:Farmin" />
><p pageid="9388" ns="404" title="Africa:FieldtripGuides" />
><p pageid="9834" ns="404" title="Africa:Gabon2010" />
><p pageid="9975" ns="404" title="Africa:InteriorRifs" />
><p pageid="10762" ns="404" title="Africa:Kenya2011" />
><p pageid="15660" ns="404" title="Africa:Kenya2012" />
><p pageid="14945" ns="404" title="Africa:Madagascar2012" />
><p pageid="9973" ns="404" title="Africa:Mozambique2011" />
><p pageid="9385" ns="404" title="Africa:New Ventures Africa" />
><p pageid="9812" ns="404" title="Africa:New Ventures Africa Map" />
><p pageid="9969" ns="404" title="Africa:Newsletter" />
><p pageid="19985" ns="404" title="Africa:Project Abyss" />
><p pageid="19986" ns="404" title="Africa:Project Geronimo" />
><p pageid="20079" ns="404" title="Africa:Project Inlet" />
><p pageid="9832" ns="404" title="Africa:Regionalprojects" />
><p pageid="9974" ns="404" title="Africa:Seychelles2011" />
><p pageid="9978" ns="404" title="Africa:TetianCarbonates" />
><p pageid="9822" ns="404" title="Africa:Training" />
></allpages>
></query></api>
><<<<<<
>
>
>What version of Wiki are you crawling here?  Perhaps something has changed in the spec,
or maybe you are crawling a wiki that is too old to support this feature?
>
>
>Karl
>
>
>
>
>On Wed, Oct 1, 2014 at 9:57 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>Hi Kambiz,
>>
>>In the log you sent, I did not see any activity at all other than seeding.  Was the
log complete?
>>
>>You can get a better sense of what is happening by obtaining a simple history report
for this connection, and a document status report for the job.  If there are only 27 documents,
it should be very clear what is happening by looking at these. Can you include them please?
>>
>>Karl
>>
>>
>>
>>
>>On Wed, Oct 1, 2014 at 9:50 AM, Kambiz Niktabar <niktabar@yahoo.com> wrote:
>>
>>Hi Karl,
>>>
>>>
>>>Snapshot of the job view page is attached. By the way, it seems the number of
pages under that namespace is only 27 and they are not being processed even after some minutes
(see the second snapshot)
>>>
>>>
>>>Regards
>>>Kambiz
>>>
>>>
>>>
>>>________________________________
>>> From: Karl Wright <daddywri@gmail.com>
>>>To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>; Kambiz Niktabar
<niktabar@yahoo.com> 
>>>Sent: Wednesday, October 1, 2014 2:05 PM
>>>Subject: Re: Wiki connector stuck crawling namespaces other than default
>>> 
>>>
>>>
>>>Hi Kambiz,
>>>
>>>The debugging output indicates that your namespace name is "404".  That doesn't
sound correct to me.
>>>
>>>>>>>>>
>>>GET /wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATetianCarbonates&aplimit=500
HTTP/1.1
>>><<<<<<
>>>
>>>I've gone back and looked at the code and can find no way that the namespace would
be corrupted.  But maybe this is actually correct.  Can you send along a screen shot of the
view page for the job?
>>>
>>>
>>>Also, the wiki connector seeds documents in batches of 500 at a time.  It uses
the last title fetched in order to be able to find the next batch of 500.  So if there are
a lot of documents, it will take a while to seed them all.  In your log I see signs that this
is what is happening.  Have a look at all the GET requests and note the apfrom parameter.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>Thanks,
>>>Karl
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
Mime
View raw message