spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Earl <charles.ce...@gmail.com>
Subject Re: How to share large resources like dictionaries while processing data with Spark ?
Date Fri, 05 Jun 2015 11:09:52 GMT
Would tachyon be appropriate here?

On Friday, June 5, 2015, Evo Eftimov <evo.eftimov@isecc.com> wrote:

> Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
> Batch Jobs (besides anyone can put something like that in 5 min), while I
> am under the impression that Dmytiy is working on Spark Streaming app
>
>
>
> Besides the Job Server is essentially for sharing the Spark Context
> between multiple threads
>
>
>
> Re Dmytiis intial question – you can load large data sets as Batch
> (Static) RDD from any Spark Streaming App and then join DStream RDDs
> against them to emulate “lookups” , you can also try the “Lookup RDD” –
> there is a git hub project
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenberg123@gmail.com
> <javascript:_e(%7B%7D,'cvml','dgoldenberg123@gmail.com');>]
> *Sent:* Friday, June 5, 2015 12:12 AM
> *To:* Yiannis Gkoufas
> *Cc:* Olivier Girardot; user@spark.apache.org
> <javascript:_e(%7B%7D,'cvml','user@spark.apache.org');>
> *Subject:* Re: How to share large resources like dictionaries while
> processing data with Spark ?
>
>
>
> Thanks so much, Yiannis, Olivier, Huang!
>
>
>
> On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas <johngouf85@gmail.com
> <javascript:_e(%7B%7D,'cvml','johngouf85@gmail.com');>> wrote:
>
> Hi there,
>
>
>
> I would recommend checking out
> https://github.com/spark-jobserver/spark-jobserver which I think gives
> the functionality you are looking for.
>
> I haven't tested it though.
>
>
>
> BR
>
>
>
> On 5 June 2015 at 01:35, Olivier Girardot <ssaboum@gmail.com
> <javascript:_e(%7B%7D,'cvml','ssaboum@gmail.com');>> wrote:
>
> You can use it as a broadcast variable, but if it's "too" large (more than
> 1Gb I guess), you may need to share it joining this using some kind of key
> to the other RDDs.
>
> But this is the kind of thing broadcast variables were designed for.
>
>
>
> Regards,
>
>
>
> Olivier.
>
>
>
> Le jeu. 4 juin 2015 à 23:50, dgoldenberg <dgoldenberg123@gmail.com
> <javascript:_e(%7B%7D,'cvml','dgoldenberg123@gmail.com');>> a écrit :
>
> We have some pipelines defined where sometimes we need to load potentially
> large resources such as dictionaries.
>
> What would be the best strategy for sharing such resources among the
> transformations/actions within a consumer?  Can they be shared somehow
> across the RDD's?
>
> I'm looking for a way to load such a resource once into the cluster memory
> and have it be available throughout the lifecycle of a consumer...
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> <javascript:_e(%7B%7D,'cvml','user-unsubscribe@spark.apache.org');>
> For additional commands, e-mail: user-help@spark.apache.org
> <javascript:_e(%7B%7D,'cvml','user-help@spark.apache.org');>
>
>
>
>
>


-- 
- Charles

Mime
View raw message