Hi Johannes,
You would likely not avoid it if you went with the approach of multiple DAGs. For most iterative
programs, you do need to checkpoint at some point. The checkpoint would likely need to be
reliable to reduce the amount of recomputation needed if the check pointed data is lost.
An option would be to use something like the HDFS inmemory storage tier ( which lazily persists
to disk ) to reduce the perf overhead. Also, in terms of loop unrolling, a single DAG could
be preconstructed to run multiple iterations using multiple vertices and then use the final
vertex of the DAG as a checkpointing mechanism after N iterations/vertices.
Also, depending on the amount of data being written out, the overhead of writing to HDFS may
not be too high. Furthermore, with Tez sessions, there is no real overhead of launching a
new DAG ( if some containers are retained ) as compared to trying to do the same with multiple
MR jobs.
— Hitesh
On Mar 25, 2015, at 2:02 AM, Johannes Zillmann <jzillmann@googlemail.com> wrote:
> Hey Gopal,
>
>> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <gopalv@apache.org> wrote:
>>
>> Hi,
>>
>> Iterative algorithms are expressed as DAGs in a loop.
>>
>> The acyclic nature of DAGs, whether in Tez or Spark (since you mention the
>> paper) make that the natural way to implement that  repeated application
>> of the same operation over the same data, with a decision condition
>> determining whether to stay in the loop or not.
>
> Can you point to a piece of code which implements this approach ?
> If you each look operation is a single DAG, how would that avoid hdfs barrier ?
>
> Johannes
>
>>
>> You might want to look at last year¹s Hadoop Summit presentations for a
>> direct example of Iterative algorithms with Tez.
>>
>> http://www.slideshare.net/Hadoop_Summit/pigontezlowlatencyetlwithbig
>> data/25
>>
>>
>> Logistic regression needs you to use a library which implements that
>> specific algorithm [1].
>>
>> On that note, something which needs incremental iteration can probably be
>> even more efficient in Tez than these approaches if you unroll the
>> iteration as 11 edges all of the final tasks ending up generating outputs.
>>
>> Cheers,
>> Gopal
>> [1]  https://github.com/myui/hivemall#regression
>>
>>
>> On 3/24/15, 8:43 PM, "Chang Chen" <baibaichen@gmail.com> wrote:
>>
>>> Hi
>>>
>>> from the PhD Disseration
>>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS201412.pdf> of
>>> Matei
>>> Zaharia, there are four computation models in the large scale clusters:
>>>
>>>
>>> 1. *Iterative algorithm*, such as graph processing and machine leaning
>>> algorithm
>>> 2. *Relational query*
>>> 3. *MapReduce*, a general parallel computation model
>>> 4. *Stream processing*,
>>>
>>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
>>> examples.
>>>
>>> As for streaming, I guess if we implement appropriate input, there is no
>>> reason that tez can't support in theory.
>>>
>>> But for Machine Leaning, how do we use vertex and edge to express
>>> *Logistic
>>> Regression*?
>>>
>>> Thanks
>>> Chang
>>
>>
>
