spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin" <r...@databricks.com>
Subject Re: Spark DAG scheduler
Date Fri, 17 Apr 2020 00:07:33 GMT
If you are talking about a tree, then the RDDs are nodes, and the dependencies are the edges.

If you are talking about a DAG, then the partitions in the RDDs are the nodes, and the dependencies
between the partitions are the edges.

On Thu, Apr 16, 2020 at 4:02 PM, Mania Abdi < abdi.ma@husky.neu.edu > wrote:

> 
> Is it correct to say, the nodes in the DAG are RDDs and the edges are
> computations?
> 
> 
> On Thu, Apr 16, 2020 at 6:21 PM Reynold Xin < rxin@ databricks. com (
> rxin@databricks.com ) > wrote:
> 
> 
>> The RDD is the DAG.
>> 
>> 
>> 
>> On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi < abdi. ma@ husky. neu. edu (
>> abdi.ma@husky.neu.edu ) > wrote:
>> 
>>> Hello everyone,
>>> 
>>> I am implementing a caching mechanism for analytic workloads running on
>>> top of Spark and I need to retrieve the Spark DAG right after it is
>>> generated and the DAG scheduler. I would appreciate it if you could give
>>> me some hints or reference me to some documents about where the DAG is
>>> generated and inputs assigned to it. I found the DAG Scheduler class (
>>> https://github.com/apache/spark/blob/55dea9be62019d64d5d76619e1551956c8bb64d0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
>>> ) but I am not sure if it is a good starting point.
>>> 
>>> 
>>> 
>>> Regards
>>> Mania
>>> 
>> 
>> 
> 
>
Mime
View raw message