spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: sparkSQL thread safe?
Date Thu, 10 Jul 2014 23:50:56 GMT
Hey Ian,

Thanks for bringing these up!  Responses in-line:

Just wondering if right now spark sql is expected to be thread safe on
> master?
> doing a simple hadoop file -> RDD -> schema RDD -> write parquet
> will fail in reflection code if i run these in a thread pool.

You are probably hitting SPARK-2178
<> which is caused by
SI-6240 <>.  We have a plan to
fix this by moving the schema introspection to compile time, using macros.

> The SparkSqlSerializer, seems to create a new Kryo instance each time it
> wants to serialize anything. I got a huge speedup when I had any
> non-primitive type in my SchemaRDD using the ResourcePool's from Chill for
> providing the KryoSerializer to it. (I can open an RB if there is some
> reason not to re-use them?)

Sounds like SPARK-2102 <>.
 There is no reason AFAIK to not reuse the instance. A PR would be greatly

> With the Distinct Count operator there is no map-side operations, and a
> test to check for this. Is there any reason not to do a map side combine
> into a set and then merge the sets later? (similar to the approximate
> distinct count operator)

Thats just not an optimization that we had implemented yet... but I've just
done it here <> and it'll be in
master soon :)

> Another thing while i'm mailing.. the 1.0.1 docs have a section like:
> "
> // Note: Case classes in Scala 2.10 can support only up to 22 fields. To
> work around this limit, // you can use custom classes that implement the
> Product interface.
> "
> Which sounds great, we have lots of data in thrift.. so via scrooge (
>, we end up with ultimately instances
> of
> traits which implement product. Though the reflection code appears to look
> for the constructor of the class and base the types based on those
> parameters?

Yeah, thats true that we only look in the constructor at the moment, but I
don't think there is a really good reason for that (other than I guess we
will need to add code to make sure we skip builtin object methods).  If you
want to open a JIRA, we can try fixing this.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message