hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: [ANNOUNCEMENT] A query system for BSP processing
Date Thu, 30 Aug 2012 14:02:26 GMT
Shall we work together?

On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras <fegaras@cse.uta.edu> wrote:
> Thank you very much for your interest and for testing my system.
> It seems that my release was premature: It worked for some random data but
> didn't for some others. It's a minor logical error that I will try to fix in
> the next few days. The problem is with the stopping condition of the repeat
> expression that calculates the new pagerank from the old. It must stop if
> ALL peers reach  the specified precision. This is done by having those peers
> that need to continue send a message to others to continue. It seems that
> now when all peers agree at the same time, the program works fine. But if
> one finishes sooner, instead of continuing the repeat loop, it runs away to
> the next BSP step that follows the repeat, then exits prematurely and the
> system hangs. The casting errors are due to the run-away peers executing the
> wrong BSP steps reading wrong messages. Queries without repeat though are
> OK.
> By the way, I had a problem exchanging large amount of data during sync (I
> discussed this with Thomas).  My solution was to to break a BSP superstep
> into multiple substeps so that each substep can handle a max number of
> messages. Of course my program has to collect all messages in a vector in
> memory. When the vector is too big, it is spilled in a local file. This
> moved the problem from the Hama side to my side and allowed me to handle
> larger data, especially in joins. I think this problem of exchanging large
> amount of data during a superstep is currently a weakness of Hama.
> Leonidas
> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>> BTW, should we feature this on our website?
>> 2012/8/24 Thomas Jungblut <thomas.jungblut@gmail.com>
>>> Hi Leonidas!
>>> I have to admit that I have known what is going on (and had to keep
>>> silent), but I have to say: Thank you very much!
>>> This will help many people writing BSPs in a more easier way.
>>> Of course this is not as fast as the native BSP code, Hive and Pig suffer
>>> from the same problems in MR.
>>> But it gives people the opportunity to develop faster and get their code
>>> in production with just a minor time expense.
>>> And I think, that we will help you gladly on improving the BSP part of
>>> your framework. At least I would do ;)
>>> Thanks!
>>> 2012/8/24 Edward J. Yoon <edwardyoon@apache.org>
>>> Here's my few test results on Oracle BDA (40G/s infiniband network).
>>>> It seems slow than our PageRank example.
>>>> P.S., There are some errors so I couldn't test large-scale.
>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast to
>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a non-materialized
>>>> sequence ..., etc.)
>>>> == 100K nodes and 1M edges ==
>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle about
>>>> 2383611 bytes of input data.
>>>> Run time: 30.384 secs
>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle about
>>>> 1191805 bytes of input data.
>>>> Run time: 24.412 secs
>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon <edwardyoon@apache.org>
>>>> wrote:
>>>>> Wow, very interesting. I'm going to install and test on my large
>>>> cluster.
>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras <fegaras@cse.uta.edu>
>>>> wrote:
>>>>>> Dear Hama users,
>>>>>> I am pleased to announce that the MRQL query processing system can
>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available at:
>>>>>> http://lambda.uta.edu/mrql/
>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query language
>>>>>> large-scale, distributed data analysis. MRQL is powerful enough to
>>>>>> express most common data analysis tasks over many different kinds
>>>>>> raw data, including hierarchical data and nested collections, such
>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode using
>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using Apache
>>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
>>>>>> Note that, the BSP mode is currently experimental (not fine-tuned
>>>>>> and lacks any fault-tolerance (if an error occurs, the entire job
>>>>>> be restarted). Due to our limited resources, MRQL has only been tested
>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP mode with
>>>>>> the MR mode by evaluating a pagerank query over a small graph (100K
>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times faster
>>>>>> than the MR mode. Please let me know if you'd like to contribute
>>>>>> this project by testing MRQL on a larger cluster.
>>>>>> Best regards,
>>>>>> Leonidas Fegaras
>>>>>> University of Texas at Arlington
>>>>> --
>>>>> Best Regards, Edward J. Yoon
>>>>> @eddieyoon
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>> .

Best Regards, Edward J. Yoon

View raw message