incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <>
Subject Re: Link interface for crawler
Date Wed, 30 Jan 2013 13:30:29 GMT
On 01/30/2013 01:55 PM, Thorsten Scherler wrote:
> On 01/30/2013 12:31 PM, Tobias Rübner wrote:
>> Hi Thorsten,
>> I would propose to extend the ContentEntity and add the needed fields there.
>> The Task should only contain data releveant for executing the task.
>> All other "meta" data should be stored in the ContentEntity.
>> The getTo Information can already be stored in ContentEntity.setLinks and
>> getFrom is a reverse searh on the same field.
>> What do you think of this approach?
> I prefer a well defined interface since the ContentEntity is in the end
> a simple HashMap where we store information.

The problem as well I see ATM is that we do
public abstract class CrawlingDroid extends AbstractDroid<LinkTask> {
but LinkTask is no interface where I can provide my own implementation.


> We have a couple of
> developments that are actively use link.getLastModifiedDate() in the
> filtering state that would now need to become
> link.getContentEntity().get("lastModifiedDate").
> The lastModified is important for the execution of the task in some
> usecases, where you can filter on it. Further IMO not all ContentEntity
> are providing Links (list of new tasks).
> Regarding getTo and getFrom it is a bit different. I try to explain on
> by example. A page may have links so it creates a new Task where the
> getFrom is the page which contained the page as link (stored in getTo).
> Both can be used for filtering so I would like to have them exposed
> directly in the link and not go via the contentEntity.
> In general as I understand you correct you propose to move down the
> "meta" data to the contentEntity but for me that meta is meta from the task.
> salu2
>> Tobias
>> On Wed, Jan 30, 2013 at 12:05 PM, Thorsten Scherler <>wrote:
>>> Hi all,
>>> Tobias I saw that you dropped the link interface but moved the links to
>>> the contentEntity. The problem I see is that an URL needs stuff like
>>> getAnchorText if it is useful for the crawler. This is as well true for
>>> the getFrom and getTo stuff to implement mapping rules.
>>> Can I bring back the Link interface?
>>> salu2
>>> --
>>> Thorsten Scherler <>
>>> codeBusters S.L. - web based systems
>>> <consulting, training and solutions>

Thorsten Scherler <>
codeBusters S.L. - web based systems
<consulting, training and solutions>

View raw message