spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haoyuan Li <>
Subject Re: Spark or Tachyon: capture data lineage
Date Fri, 02 Jan 2015 20:32:06 GMT

Great question. Spark and Tachyon capture lineage information at different
granularities. We are working on an integration between Spark/Tachyon about
this. Hope to get it ready to be released soon.



On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam <> wrote:

> Hi spark developers,
> I was thinking it would be nice to extract the data lineage information
> from a data processing pipeline. I assume that spark/tachyon keeps this
> information somewhere. For instance, a data processing pipeline uses
> datasource A and B to produce C. C is then used by another process to
> produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
> useful if there is a way to capture this information when we are using
> spark/tachyon to query this data lineage information. For example, give me
> datasets that produce E. It should give me  a graph like (A and B)->C->E.
> Is this something already possible with spark/tachyon? If not, do you
> think it is possible? Does anyone mind to share their experience in
> capturing the data lineage in a data processing pipeline?
> Best Regards,
> Jerry

Haoyuan Li
AMPLab, EECS, UC Berkeley

View raw message