manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Wiki connector stuck crawling namespaces other than default
Date Wed, 01 Oct 2014 14:34:38 GMT
The standard mediawiki api for this operation is listed here:

http://www.mediawiki.org/wiki/API:Allpages

Karl


On Wed, Oct 1, 2014 at 10:20 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Kambiz,
>
> I looked deeper into the log, and found that it is looping on trying to
> seed.  The reason it is looping is because the wiki server you are crawling
> is not honoring the "apfrom" parameter when the namespace is specified.
> Please see the following response, which is coming back from the query:
>
> DEBUG 2014-10-01 08:34:22,470 (Thread-618) - http-outgoing-7 >> "GET
> /wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATraining&aplimit=500
> HTTP/1.1[\r][\n]"
>
> This response is *supposed* to start with Africa:Training and go on from
> there.  Instead, it seems to be starting from the beginning of the
> namespace:
>
> >>>>>>
> <?xml version="1.0"?><api><query>
> <allpages>
> <p pageid="10171" ns="404" title="Africa:Arcgis" />
> <p pageid="9977" ns="404" title="Africa:Atlas" />
> <p pageid="9979" ns="404" title="Africa:CTMargins" />
> <p pageid="9727" ns="404" title="Africa:Conferences" />
> <p pageid="9386" ns="404" title="Africa:Conferences2010" />
> <p pageid="9833" ns="404" title="Africa:Countryprojects" />
> <p pageid="9823" ns="404" title="Africa:Databases" />
> <p pageid="9976" ns="404" title="Africa:EasternMed" />
> <p pageid="10277" ns="404" title="Africa:Farmin" />
> <p pageid="9388" ns="404" title="Africa:FieldtripGuides" />
> <p pageid="9834" ns="404" title="Africa:Gabon2010" />
> <p pageid="9975" ns="404" title="Africa:InteriorRifs" />
> <p pageid="10762" ns="404" title="Africa:Kenya2011" />
> <p pageid="15660" ns="404" title="Africa:Kenya2012" />
> <p pageid="14945" ns="404" title="Africa:Madagascar2012" />
> <p pageid="9973" ns="404" title="Africa:Mozambique2011" />
> <p pageid="9385" ns="404" title="Africa:New Ventures Africa" />
> <p pageid="9812" ns="404" title="Africa:New Ventures Africa Map" />
> <p pageid="9969" ns="404" title="Africa:Newsletter" />
> <p pageid="19985" ns="404" title="Africa:Project Abyss" />
> <p pageid="19986" ns="404" title="Africa:Project Geronimo" />
> <p pageid="20079" ns="404" title="Africa:Project Inlet" />
> <p pageid="9832" ns="404" title="Africa:Regionalprojects" />
> <p pageid="9974" ns="404" title="Africa:Seychelles2011" />
> <p pageid="9978" ns="404" title="Africa:TetianCarbonates" />
> <p pageid="9822" ns="404" title="Africa:Training" />
> </allpages>
> </query></api>
> <<<<<<
>
> What version of Wiki are you crawling here?  Perhaps something has changed
> in the spec, or maybe you are crawling a wiki that is too old to support
> this feature?
>
> Karl
>
>
> On Wed, Oct 1, 2014 at 9:57 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Kambiz,
>>
>> In the log you sent, I did not see any activity at all other than
>> seeding.  Was the log complete?
>>
>> You can get a better sense of what is happening by obtaining a simple
>> history report for this connection, and a document status report for the
>> job.  If there are only 27 documents, it should be very clear what is
>> happening by looking at these. Can you include them please?
>>
>> Karl
>>
>>
>> On Wed, Oct 1, 2014 at 9:50 AM, Kambiz Niktabar <niktabar@yahoo.com>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> Snapshot of the job view page is attached. By the way, it seems the
>>> number of pages under that namespace is only 27 and they are not being
>>> processed even after some minutes (see the second snapshot)
>>>
>>> Regards
>>> Kambiz
>>>
>>>   ------------------------------
>>>  *From:* Karl Wright <daddywri@gmail.com>
>>> *To:* "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>; Kambiz
>>> Niktabar <niktabar@yahoo.com>
>>> *Sent:* Wednesday, October 1, 2014 2:05 PM
>>> *Subject:* Re: Wiki connector stuck crawling namespaces other than
>>> default
>>>
>>> Hi Kambiz,
>>>
>>> The debugging output indicates that your namespace name is "404".  That
>>> doesn't sound correct to me.
>>>
>>> >>>>>>
>>> GET
>>> /wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATetianCarbonates&aplimit=500
>>> HTTP/1.1
>>> <<<<<<
>>>
>>> I've gone back and looked at the code and can find no way that the
>>> namespace would be corrupted.  But maybe this is actually correct.  Can you
>>> send along a screen shot of the view page for the job?
>>>
>>> Also, the wiki connector seeds documents in batches of 500 at a time.
>>> It uses the last title fetched in order to be able to find the next batch
>>> of 500.  So if there are a lot of documents, it will take a while to seed
>>> them all.  In your log I see signs that this is what is happening.  Have a
>>> look at all the GET requests and note the apfrom parameter.
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message