spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Nguyen <...@adatao.com>
Subject Re: Large DataStructure to Broadcast
Date Thu, 26 Dec 2013 05:11:15 GMT
Purav, depending on the access pattern you should also consider the
trade-offs of setting up a lookup service (using, e.g., memcached, egad!)
which may end up being more efficient overall.

The general point is not to restrict yourself to only Spark APIs when
considering the overall architecture.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Wed, Dec 25, 2013 at 7:32 PM, purav aggarwal
<puravaggarwal123@gmail.com>wrote:

> Hi all,
>
> I have a large file ( > 5 gigs) which I need to lookup. Since each slave
> need to perform the search operation on the hashmap (built out of the file)
> in parallel I need to broadcast the file. I was wondering if broadcasting
> such a huge file is really a good idea. Do we have any benchmarks for the
> broadcast variables. I am on a Standalone cluster and machine configuration
> is not a problem at the moment.
> Has anyone exploited broadcast to such an extent ?
>
> Thanks,
> Purav
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message