incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <>
Subject [jira] Updated: (DROIDS-54) Make LinkTask supports arbitrary data by extends HashMap, and consider to refactor Task, Link, and LinkTask
Date Thu, 18 Jun 2009 10:51:07 GMT


Mingfai Ma updated DROIDS-54:


attached is a sample implementation for review

 - we still can make a LinkTask extend this base Link class, or just add more method to this
class (and optionally change it to LinkTask)
 - it stores url as String, but the constructor always call new URI() to ensure the url string
is valid in construction time.
 - stuff like toString, equals and hashCode maybe deleted in the final implementation. or
change them to follow this project's standard.
 - a few convenient method are added, such as getHost(), getURI(), resolve(String) are added.
for resolve, it's added just like the URI has a resolve method. using a LinkResolver with
the same base URI could be slightly more efficient.

for me, i am using a crawler derived from Droids, and I make the all usage of Link as <T
extends Link>. e.g. LinkQueue<T extends Link> extends PriorityBlockingQueue<T>.
This also could be considered. 

> Make LinkTask supports arbitrary data by extends HashMap, and consider to refactor Task,
Link, and LinkTask
> -----------------------------------------------------------------------------------------------------------
>                 Key: DROIDS-54
>                 URL:
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments:
> refer to the initial idea at:
> The current implementation of LinkTask
> {code}
> public class LinkTask implements Link, Serializable
> {
>   private Date started;
>   private final int depth;
>   private final URI uri;
>   private final Link from;
>   private Date lastModifedDate;
>   private Collection<URI> linksTo;
>   private String anchorText;
>   private int weight;
> {code}
> Suggested change:
> {code}
> public class LinkTask extends HashMap<String, Serializable> 
> or
> public class LinkTask extends HashMap<String, Serializable> implements Link
> {code}
> The minimum required attributes are:
>  - final ? id, 
>    - mainly to have a minimum size value as hash key and store in memory/data grid for
lookup, e.g. for use as history to avoid duplicated fetching. refer to DROIDS-53 
>  - final String url
>    - the original String representation of the URL (preferred), or representation
with the encoded string (seems no good).
>    - the url is the original one provided by the user in construction. two diff url may
refer to the same url, e.g. and, it's up to the
user to decide if they should be normalized. (and they could use the URL/LinkNormalizer in
> the other fields are basically optional. 
>   - started/taskDate, if the queue use it for sorting, then it's useful, otherwise, it's
just for logging.
>   -  "weight" is another example that not all implementation may need. 
>   - "linksTo", a.k.a. outLinks, is also optional to be attached to the LinkTask. an implementation
may extract the outlink and put them in queue directly without storing the outlinks in the
>   - "from", a.k.a. referrer, should not store the Link reference as it will affect GC.

> btw, should we also simplify Link, Task and LinkTask?  if we use a Map, it's very generic
already. Link and Task could be different concepts if we need to use them separately.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message