incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DROIDS-53) Implement a unique hash function for Task ID
Date Mon, 08 Jun 2009 08:50:08 GMT

    [ https://issues.apache.org/jira/browse/DROIDS-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717201#action_12717201
] 

Mingfai Ma commented on DROIDS-53:
----------------------------------

hashCode in integer is signed (with negative number). it might be better to use unsigned long
to represent signed integer hashcode. 

{code}
public long unsignInteger(int hashCode){
        long id = hashCode;
        id = (id << 32) >>> 32;
        return id;
}
{code}

> Implement a unique hash function for Task ID
> --------------------------------------------
>
>                 Key: DROIDS-53
>                 URL: https://issues.apache.org/jira/browse/DROIDS-53
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>            Reporter: Mingfai Ma
>
> For the SimpleTaskQueueWithHistory.previous, CrawlingWorker "// TODO -- make the hashvalue
for Outlink...", 
> after some research, it seems to smaller hash that we could get easy is the hashCode()
of URL String. It takes 16 bytes only. Java native hash code should be unique, otherwise,
any HashTable or HashMap will be wrong.
> if in case we need to represent it in String, we may compress the hashCode in base62,
such as with the following function:
> {code}
>   public static final char[] baseChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".toCharArray();
>   public static String base62(int hashCode) {
>     StringBuilder out = new StringBuilder(hashCode == 0 ? "0" : "");
>     while(hashCode!=0){
>       out.append(baseChars[hashCode % baseChars.size()])
>       hashCode = hashCode / baseChars.size()
>     }
>     return out.reverse().toString();
>   }
> {code}
> e.g.
> {code}
> assertEquals("c55Ow", base62("http://incubator.apache.org/droids/".hashCode()) );
> {code}
> for 5 characters, it's 48 bytes.
> It is important to get a short, unique hash function for URL. For any crawler that do
not want duplication, we unavoidably have to store a URL hash in memory or data grid. (unless
it is stored on disk or in a database at the cost of higher look up time) With a hashCode
that takes 16bytes, it's around 15M heap size for 1M URLs. 
> This task is related to DROIDS-52

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message