incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Rübner <t...@apache.org>
Subject Re: Link interface for crawler
Date Wed, 30 Jan 2013 14:48:42 GMT
Hi Thorsten,

so the question is, to whom does the meta data belong to?

To clarify my approach:
I wanted to define a set of meta data information in the ContentEntity,
which could retrieved in a direct way, e.g. getLinks or getContent.
This could also be extended to getLastModifiedDate, getContentSize,
getAuthor and so on.
All other (unknown/special) meta data could be stored in the ContentEntity
using the put / getValue Methods.

Here is an example header of a http response:
ETag: "2a06f-681d-4d470723a1f80"
Vary: Accept-Encoding
Date: Wed, 30 Jan 2013 10:05:07 GMT
Content-Length: 26138
Last-Modified: Tue, 29 Jan 2013 17:08:30 GMT

IMO all of these fields are properties of the content.
Why should Last-Modified treated differently than Content-Length?
Yes it can be used for filtering, but I could also filter by author or
content length.

Adding this field to the task could also lead to confusion.
It could be the date, the task was last modified,
e.g. when added some data to the content entity or maybe the date of
aborting the task.

So we have to think about, what really is data of the task or necessary for
the execution of the task.


To the getTo/getFrom I would agree that you can add this to a separate
interface.
I would propose the LinkedTask Interface.
For example I can imagine that you want to abort a task and all it subtasks.
But then I would remove the setLinks/getLinks from the ContentEntity.
Otherwise we have duplicate data.

So I would propose to rename LinkTask to CrawlerTask, which implements
LinkedTask.

Tobias


On Wed, Jan 30, 2013 at 2:30 PM, Thorsten Scherler <scherler@gmail.com>wrote:

> On 01/30/2013 01:55 PM, Thorsten Scherler wrote:
> > On 01/30/2013 12:31 PM, Tobias Rübner wrote:
> >> Hi Thorsten,
> >>
> >> I would propose to extend the ContentEntity and add the needed fields
> there.
> >>
> >> The Task should only contain data releveant for executing the task.
> >> All other "meta" data should be stored in the ContentEntity.
> >> The getTo Information can already be stored in ContentEntity.setLinks
> and
> >> getFrom is a reverse searh on the same field.
> >>
> >> What do you think of this approach?
> > I prefer a well defined interface since the ContentEntity is in the end
> > a simple HashMap where we store information.
>
> The problem as well I see ATM is that we do
> public abstract class CrawlingDroid extends AbstractDroid<LinkTask> {
> but LinkTask is no interface where I can provide my own implementation.
>
> salu2
>
> > We have a couple of
> > developments that are actively use link.getLastModifiedDate() in the
> > filtering state that would now need to become
> > link.getContentEntity().get("lastModifiedDate").
> >
> > The lastModified is important for the execution of the task in some
> > usecases, where you can filter on it. Further IMO not all ContentEntity
> > are providing Links (list of new tasks).
> >
> > Regarding getTo and getFrom it is a bit different. I try to explain on
> > by example. A page may have links so it creates a new Task where the
> > getFrom is the page which contained the page as link (stored in getTo).
> > Both can be used for filtering so I would like to have them exposed
> > directly in the link and not go via the contentEntity.
> >
> > In general as I understand you correct you propose to move down the
> > "meta" data to the contentEntity but for me that meta is meta from the
> task.
> >
> > salu2
> >> Tobias
> >>
> >>
> >> On Wed, Jan 30, 2013 at 12:05 PM, Thorsten Scherler <scherler@gmail.com
> >wrote:
> >>
> >>> Hi all,
> >>>
> >>> Tobias I saw that you dropped the link interface but moved the links to
> >>> the contentEntity. The problem I see is that an URL needs stuff like
> >>> getAnchorText if it is useful for the crawler. This is as well true for
> >>> the getFrom and getTo stuff to implement mapping rules.
> >>>
> >>> Can I bring back the Link interface?
> >>>
> >>> salu2
> >>>
> >>> --
> >>> Thorsten Scherler <scherler.at.gmail.com>
> >>> codeBusters S.L. - web based systems
> >>> <consulting, training and solutions>
> >>>
> >>> http://www.codebusters.es/
> >>>
> >>>
> >
>
>
> --
> Thorsten Scherler <scherler.at.gmail.com>
> codeBusters S.L. - web based systems
> <consulting, training and solutions>
>
> http://www.codebusters.es/
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message