spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evo Eftimov" <evo.efti...@isecc.com>
Subject RE: How to share large resources like dictionaries while processing data with Spark ?
Date Fri, 05 Jun 2015 11:19:58 GMT
Spark uses Tachyon internally ie all SERIALIZED IN-MEMORY RDDs are kept there – so if you
have a BATCH RDD which is SERIALIZED IN_MEMORY then you are using Tachyon implicitly – the
only difference is that if you are using Tachyon explicitly ie as a distributed, in-memory
file system you can share data between Jobs, while an RDD is ALWAYS visible within Jobs using
the same Spark Context 

 

From: Charles Earl [mailto:charles.cearl@gmail.com] 
Sent: Friday, June 5, 2015 12:10 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing data with Spark
?

 

Would tachyon be appropriate here?

On Friday, June 5, 2015, Evo Eftimov <evo.eftimov@isecc.com> wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark Batch Jobs (besides
anyone can put something like that in 5 min), while I am under the impression that Dmytiy
is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between multiple threads


 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD from any
Spark Streaming App and then join DStream RDDs  against them to emulate “lookups” , you
can also try the “Lookup RDD” – there is a git hub project

 

From: Dmitry Goldenberg [mailto:dgoldenberg123@gmail.com <javascript:_e(%7B%7D,'cvml','dgoldenberg123@gmail.com');>
] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org <javascript:_e(%7B%7D,'cvml','user@spark.apache.org');>

Subject: Re: How to share large resources like dictionaries while processing data with Spark
?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas <johngouf85@gmail.com <javascript:_e(%7B%7D,'cvml','johngouf85@gmail.com');>
> wrote:

Hi there,

 

I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I
think gives the functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot <ssaboum@gmail.com <javascript:_e(%7B%7D,'cvml','ssaboum@gmail.com');>
> wrote:

You can use it as a broadcast variable, but if it's "too" large (more than 1Gb I guess), you
may need to share it joining this using some kind of key to the other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg <dgoldenberg123@gmail.com <javascript:_e(%7B%7D,'cvml','dgoldenberg123@gmail.com');>
> a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <javascript:_e(%7B%7D,'cvml','user-unsubscribe@spark.apache.org');>

For additional commands, e-mail: user-help@spark.apache.org <javascript:_e(%7B%7D,'cvml','user-help@spark.apache.org');>


 

 



-- 
- Charles


Mime
View raw message