spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ☼ R Nair (रविशंकर नायर) <>
Subject Dataframe caching
Date Fri, 20 Jan 2017 14:27:42 GMT
Dear all,

Here is a requirement I am thinking of implementing in Spark core. Please
let me know if this is possible, and kindly provide your thoughts.

A user executes a query to fetch 1 million records from , let's say a
database. We let the user store this as a  dataframe, partitioned across
the cluster.

Another user , executed the same query from another session. Is there
anyway that we can let the second user reuse the dataframe created by the
first user?

Can we have a master dataframe (or RDD) which stores the information about
the current dataframes loaded and matches against any queries that are
coming from other users?

In this way, we will have a wonderful system which never allows same query
to be executed and loaded again into the cluster memory.

Best, Ravion

View raw message