spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin Kincaid Williams <disc...@uw.edu>
Subject Re: Improving performance of a kafka spark streaming app
Date Mon, 02 May 2016 18:19:11 GMT
Hi David,

 My current concern is that I'm using a spark hbase bulk put driver
written for Spark 1.2 on the version of CDH my spark / yarn cluster is
running on. Even if I were to run on another Spark cluster, I'm
concerned that I might have issues making the put requests into hbase.
However I should give it a shot if I abandon Spark 1.2, and my current
environment.

Thanks,

Colin Williams

On Mon, May 2, 2016 at 6:06 PM, Krieg, David
<David.Krieg@earlywarning.com> wrote:
> Spark 1.2 is a little old and busted. I think most of the advice you'll get is
> to try to use Spark 1.3 at least, which introduced a new Spark streaming mode
> (direct receiver). The 1.2 Receiver based implementation had a number of
> shortcomings. 1.3 is where the "direct streaming" interface was introduced,
> which is what we use. You'll get more joy the more you upgrade Spark, at least
> to some extent.
>
> David Krieg | Enterprise Software Engineer
> Early Warning
> Direct: 480.426.2171 | Fax: 480.483.4628 | Mobile: 859.227.6173
>
>
> -----Original Message-----
> From: Colin Kincaid Williams [mailto:discord@uw.edu]
> Sent: Monday, May 02, 2016 10:55 AM
> To: user@spark.apache.org
> Subject: Improving performance of a kafka spark streaming app
>
> I've written an application to get content from a kafka topic with 1.7 billion
> entries,  get the protobuf serialized entries, and insert into hbase.
> Currently the environment that I'm running in is Spark 1.2.
>
> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
> 0-2500 writes / second. This will take much too long to consume the entries.
>
> I currently believe that the spark kafka receiver is the bottleneck.
> I've tried both 1.2 receivers, with the WAL and without, and didn't notice any
> large performance difference. I've tried many different spark configuration
> options, but can't seem to get better performance.
>
> I saw 80000 requests / second inserting these records into kafka using yarn /
> hbase / protobuf / kafka in a bulk fashion.
>
> While hbase inserts might not deliver the same throughput, I'd like to at
> least get 10%.
>
> My application looks like
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_drocsid_b0efa4ff6ff4a7c3c8bb56767d0b6877&d=CwIBaQ&c=rtKJL1IoQkrgf7t9D493SuUmYZJqgJmwEhoO6UD_DpY&r=rWkTz7PE5TRtkkWejPue_zcBxoTQE4f0g8LBaR2mVi8&m=pVPZ7WXHDTWO7s5u0qQupsWkiaGiv3B50BdtYvOvazo&s=_FnCXUJfmNKIVqDy046SS5YVP8cpJBQ3ynECFWJkzK8&e=
>
> This is my first spark application. I'd appreciate any assistance.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message