spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Earl <charles.ce...@gmail.com>
Subject Re: How to share large resources like dictionaries while processing data with Spark ?
Date Fri, 05 Jun 2015 14:27:03 GMT
Would the IndexedRDD feature provide what the Lookup RDD does?
I'Ve been using a broadcast variable map for a similar kind of thing -- It
probably is within 1GB but interested to know if the lookup (or indexed)
might be better.
C

On Friday, June 5, 2015, Dmitry Goldenberg <dgoldenberg123@gmail.com> wrote:

> Thanks everyone. Evo, could you provide a link to the Lookup RDD project?
> I can't seem to locate it exactly on Github. (Yes, to your point, our
> project is Spark streaming based). Thank you.
>
> On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov <evo.eftimov@isecc.com
> <javascript:_e(%7B%7D,'cvml','evo.eftimov@isecc.com');>> wrote:
>
>> Oops, @Yiannis, sorry to be a party pooper but the Job Server is for
>> Spark Batch Jobs (besides anyone can put something like that in 5 min),
>> while I am under the impression that Dmytiy is working on Spark Streaming
>> app
>>
>>
>>
>> Besides the Job Server is essentially for sharing the Spark Context
>> between multiple threads
>>
>>
>>
>> Re Dmytiis intial question – you can load large data sets as Batch
>> (Static) RDD from any Spark Streaming App and then join DStream RDDs
>> against them to emulate “lookups” , you can also try the “Lookup RDD” –
>> there is a git hub project
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenberg123@gmail.com
>> <javascript:_e(%7B%7D,'cvml','dgoldenberg123@gmail.com');>]
>> *Sent:* Friday, June 5, 2015 12:12 AM
>> *To:* Yiannis Gkoufas
>> *Cc:* Olivier Girardot; user@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user@spark.apache.org');>
>> *Subject:* Re: How to share large resources like dictionaries while
>> processing data with Spark ?
>>
>>
>>
>> Thanks so much, Yiannis, Olivier, Huang!
>>
>>
>>
>> On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas <johngouf85@gmail.com
>> <javascript:_e(%7B%7D,'cvml','johngouf85@gmail.com');>> wrote:
>>
>> Hi there,
>>
>>
>>
>> I would recommend checking out
>> https://github.com/spark-jobserver/spark-jobserver which I think gives
>> the functionality you are looking for.
>>
>> I haven't tested it though.
>>
>>
>>
>> BR
>>
>>
>>
>> On 5 June 2015 at 01:35, Olivier Girardot <ssaboum@gmail.com
>> <javascript:_e(%7B%7D,'cvml','ssaboum@gmail.com');>> wrote:
>>
>> You can use it as a broadcast variable, but if it's "too" large (more
>> than 1Gb I guess), you may need to share it joining this using some kind of
>> key to the other RDDs.
>>
>> But this is the kind of thing broadcast variables were designed for.
>>
>>
>>
>> Regards,
>>
>>
>>
>> Olivier.
>>
>>
>>
>> Le jeu. 4 juin 2015 à 23:50, dgoldenberg <dgoldenberg123@gmail.com
>> <javascript:_e(%7B%7D,'cvml','dgoldenberg123@gmail.com');>> a écrit :
>>
>> We have some pipelines defined where sometimes we need to load potentially
>> large resources such as dictionaries.
>>
>> What would be the best strategy for sharing such resources among the
>> transformations/actions within a consumer?  Can they be shared somehow
>> across the RDD's?
>>
>> I'm looking for a way to load such a resource once into the cluster memory
>> and have it be available throughout the lifecycle of a consumer...
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user-unsubscribe@spark.apache.org');>
>> For additional commands, e-mail: user-help@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user-help@spark.apache.org');>
>>
>>
>>
>>
>>
>
>

-- 
- Charles

Mime
View raw message