incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DROIDS-52) Optimize memory usage of TaskQueue and History
Date Thu, 28 May 2009 07:07:45 GMT

    [ https://issues.apache.org/jira/browse/DROIDS-52?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713895#action_12713895
] 

Mingfai Ma commented on DROIDS-52:
----------------------------------

one more set of figures, for an URL "http://www.apache.org/12345678"
 - in URI, it's 224 bytes
 - in String, it's 96 bytes
 - in byte[], it's 48 bytes

just for example, if in the link task we store the URI as bytes[],

store data as byte[] is quite extremely. In crawling, we probably concerns whether CPU, Memory
or bandwidth are more costly.

In Droids, the use of URI, String, and Link are not too standardized:
 - URLFilter: String filter(String urlString); 
 - parser: Parse parse(ContentEntity entity, Link link) throws DroidsException, IOException;
 - handler: void handle(URI uri, ContentEntity entity)
 - LinkTask: public LinkTask( Link from, URI uri, int depth )
 
In modern CPU, the construction of URI is quite trivial. In a quick test in my PC, the following
piece of code takes 5s to run:
{code}
        int max = 1000000;
        String url = "http://www.apache.org/";
        byte[] bytes = url.getBytes();

        long beginTime = System.currentTimeMillis();
        for (int i = 0; i < max; i++) {
            new URI(new String(bytes));
        }
        System.out.println("elapsed time: " + (System.currentTimeMillis() - beginTime) + "ms");
{code}

My initial thought is, we should standardize the interface to use either URI or String. 

> Optimize memory usage of TaskQueue and History
> ----------------------------------------------
>
>                 Key: DROIDS-52
>                 URL: https://issues.apache.org/jira/browse/DROIDS-52
>             Project: Droids
>          Issue Type: Wish
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>            Priority: Minor
>         Attachments: TaskQueueMemoryTest.java
>
>
> Tasks in TaskQueue and History are items that has to be "persisted" in a single crawl
"session"/run. They are not consuming too much memory right now and this task is created for
tracking some optimization ideas. 
> The following is some sample memory usage figures in a 32-bit Windows Vista environment:
(refer to the attached test case)
>  - With javamex classmexer, 1M LinkTask in a queue consumes 280M of memory. 
>  - For history, stores as MD5 as String, each URL could take 104 bytes only. 1M URL takes
100M roughly. (reference: http://www.javamex.com/tutorials/memory/string_memory_usage.shtml)
Notice that MD5 is not guaranteed to be unique but it should be ok for general cases.
>  - To reduce memory footprint future, we may store MD5 as byte[], that take exactly 32
bytes, and will consumes 32M memory for 1M records
> Previously, I ran a job that I try to reduce the memory usage for TaskQueue, I tried
to simulate a Queue function with JBossCache that support eviction and passivation. JBossCache's
passivation mechanism basically serialize the item into a database (or other device) and unload
them from memory. It could effectively reduce memory usage. For a Queue with lots of items,
there is no need to keep them all in memory as they won't be processed at the same time anyway.
If it is necessary to keep a reference, we may passivate the LinkTask and just keep a hash
(MD5, or even hashCode()). 
> There is one more way to store the tasks in an embedded database such as H2Database.
It stores the data on disk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message