incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <scher...@gmail.com>
Subject Re: Link interface for crawler
Date Wed, 30 Jan 2013 13:30:29 GMT
On 01/30/2013 01:55 PM, Thorsten Scherler wrote:
> On 01/30/2013 12:31 PM, Tobias Rübner wrote:
>> Hi Thorsten,
>>
>> I would propose to extend the ContentEntity and add the needed fields there.
>>
>> The Task should only contain data releveant for executing the task.
>> All other "meta" data should be stored in the ContentEntity.
>> The getTo Information can already be stored in ContentEntity.setLinks and
>> getFrom is a reverse searh on the same field.
>>
>> What do you think of this approach?
> I prefer a well defined interface since the ContentEntity is in the end
> a simple HashMap where we store information.

The problem as well I see ATM is that we do
public abstract class CrawlingDroid extends AbstractDroid<LinkTask> {
but LinkTask is no interface where I can provide my own implementation.

salu2

> We have a couple of
> developments that are actively use link.getLastModifiedDate() in the
> filtering state that would now need to become
> link.getContentEntity().get("lastModifiedDate").
>
> The lastModified is important for the execution of the task in some
> usecases, where you can filter on it. Further IMO not all ContentEntity
> are providing Links (list of new tasks).
>
> Regarding getTo and getFrom it is a bit different. I try to explain on
> by example. A page may have links so it creates a new Task where the
> getFrom is the page which contained the page as link (stored in getTo).
> Both can be used for filtering so I would like to have them exposed
> directly in the link and not go via the contentEntity.
>
> In general as I understand you correct you propose to move down the
> "meta" data to the contentEntity but for me that meta is meta from the task.
>
> salu2
>> Tobias
>>
>>
>> On Wed, Jan 30, 2013 at 12:05 PM, Thorsten Scherler <scherler@gmail.com>wrote:
>>
>>> Hi all,
>>>
>>> Tobias I saw that you dropped the link interface but moved the links to
>>> the contentEntity. The problem I see is that an URL needs stuff like
>>> getAnchorText if it is useful for the crawler. This is as well true for
>>> the getFrom and getTo stuff to implement mapping rules.
>>>
>>> Can I bring back the Link interface?
>>>
>>> salu2
>>>
>>> --
>>> Thorsten Scherler <scherler.at.gmail.com>
>>> codeBusters S.L. - web based systems
>>> <consulting, training and solutions>
>>>
>>> http://www.codebusters.es/
>>>
>>>
>


-- 
Thorsten Scherler <scherler.at.gmail.com>
codeBusters S.L. - web based systems
<consulting, training and solutions>

http://www.codebusters.es/


Mime
View raw message