incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <j...@apache.org>
Subject [jira] Created: (DROIDS-53) Implement a unique hash function for Task ID
Date Fri, 29 May 2009 07:56:45 GMT
Implement a unique hash function for Task ID
--------------------------------------------

                 Key: DROIDS-53
                 URL: https://issues.apache.org/jira/browse/DROIDS-53
             Project: Droids
          Issue Type: New Feature
          Components: core
            Reporter: Mingfai Ma


For the SimpleTaskQueueWithHistory.previous, CrawlingWorker "// TODO -- make the hashvalue
for Outlink...", 
after some research, it seems to smaller hash that we could get easy is the hashCode() of
URL String. It takes 16 bytes only. Java native hash code should be unique, otherwise, any
HashTable or HashMap will be wrong.

if in case we need to represent it in String, we may compress the hashCode in base62, such
as with the following function:

{code}
  public static final char[] baseChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".toCharArray();
  public static String base62(int hashCode) {
    StringBuilder out = new StringBuilder(hashCode == 0 ? "0" : "");
    while(hashCode!=0){
      out.append(baseChars[hashCode % baseChars.size()])
      hashCode = hashCode / baseChars.size()
    }
    return out.reverse().toString();
  }
{code}

e.g.
{code}
assertEquals("c55Ow", base62("http://incubator.apache.org/droids/".hashCode()) );
{code}

for 5 characters, it's 48 bytes.

It is important to get a short, unique hash function for URL. For any crawler that do not
want duplication, we unavoidably have to store a URL hash in memory or data grid. (unless
it is stored on disk or in a database at the cost of higher look up time) With a hashCode
that takes 16bytes, it's around 15M heap size for 1M URLs. 

This task is related to DROIDS-52

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message