manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Export crawled URLs
Date Mon, 05 Dec 2011 10:22:33 GMT
If you've updated the DBInterfaceMySQL driver, any chance you would be
willing to contribute it back to the project?

Karl


On Sun, Dec 4, 2011 at 11:13 PM, Hitoshi Ozawa
<Ozawa_Hitoshi@ogis-ri.co.jp> wrote:
> "The interpretation of this field will differ from connector to connector".
> From the above description, seems the content of entityid is dependent of
> which connector is
> being used to crawl the web pages.
> You're right about the second point on entityid column datatype. In MySQL,
> which I'm using
> with ManifoldCF, the datatype of entityid is LONGTEXT. I was just using it
> figurably even though
> I just found out that I can actually execute the sql statement. :-)
>
> Cheers,
> H.Ozawa
>
>
> (2011/12/05 10:29), Karl Wright wrote:
>>
>> Well, the history comes from the repohistory table, yes - but you may
>> not be able to construct a query with entityid=jobs.id, first of all
>> because that is incorrect (what the entity field contains is dependent
>>  on the activity type), and secondly because that column is
>> potentially long and only some kinds of queries can be done against
>> it.  Specifically it cannot be built into an index on PostgreSQL.
>>
>> Karl
>>
>> On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa
>> <Ozawa_Hitoshi@ogis-ri.co.jp>  wrote:
>>
>>>
>>> Is "history" just entries in the "repohistory" table with entitityid =
>>> jobs.id?
>>>
>>> H.Ozawa
>>>
>>> (2011/12/03 1:43), Karl Wright wrote:
>>>
>>>>
>>>> The best place to get this from is the simple history.  A command-line
>>>> utility to dump this information to a text file should be possible
>>>> with the currently available interface primitives.  If that is how you
>>>> want to go, you will need to run ManifoldCF in multiprocess mode.
>>>> Alternatively you might want to request the info from the API, but
>>>> that's problematic because nobody has implemented report support in
>>>> the API as of now.
>>>>
>>>> A final alternative is to get this from the log.  There is an [INFO]
>>>> level line from the web connector for every fetch, I seem to recall,
>>>> and you might be able to use that.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher<mj.kelleher@gmail.com>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> Is it possible to export / download the list of URLs visited during a
>>>>> crawl job?
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>

Mime
View raw message