spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: enum-like types in Spark
Date Mon, 16 Mar 2015 18:38:03 GMT
In MLlib, we use strings for emu-like types in Python APIs, which is
quite common in Python and easy for py4j. On the JVM side, we
implement `fromString` to convert them back to enums. -Xiangrui

On Wed, Mar 11, 2015 at 12:56 PM, RJ Nowling <rnowling@gmail.com> wrote:
> How do these proposals affect PySpark?  I think compatibility with PySpark
> through Py4J should be considered.
>
> On Mon, Mar 9, 2015 at 8:39 PM, Patrick Wendell <pwendell@gmail.com> wrote:
>
>> Does this matter for our own internal types in Spark? I don't think
>> any of these types are designed to be used in RDD records, for
>> instance.
>>
>> On Mon, Mar 9, 2015 at 6:25 PM, Aaron Davidson <ilikerps@gmail.com> wrote:
>> > Perhaps the problem with Java enums that was brought up was actually that
>> > their hashCode is not stable across JVMs, as it depends on the memory
>> > location of the enum itself.
>> >
>> > On Mon, Mar 9, 2015 at 6:15 PM, Imran Rashid <irashid@cloudera.com>
>> wrote:
>> >
>> >> Can you expand on the serde issues w/ java enum's at all?  I haven't
>> heard
>> >> of any problems specific to enums.  The java object serialization rules
>> >> seem very clear and it doesn't seem like different jvms should have a
>> >> choice on what they do:
>> >>
>> >>
>> >>
>> http://docs.oracle.com/javase/6/docs/platform/serialization/spec/serial-arch.html#6469
>> >>
>> >> (in a nutshell, serialization must use enum.name())
>> >>
>> >> of course there are plenty of ways the user could screw this up(eg.
>> rename
>> >> the enums, or change their meaning, or remove them).  But then again,
>> all
>> >> of java serialization has issues w/ serialization the user has to be
>> aware
>> >> of.  Eg., if we go with case objects, than java serialization blows up
>> if
>> >> you add another helper method, even if that helper method is completely
>> >> compatible.
>> >>
>> >> Some prior debate in the scala community:
>> >>
>> >>
>> https://groups.google.com/d/msg/scala-internals/8RWkccSRBxQ/AN5F_ZbdKIsJ
>> >>
>> >> SO post on which version to use in scala:
>> >>
>> >>
>> >>
>> http://stackoverflow.com/questions/1321745/how-to-model-type-safe-enum-types
>> >>
>> >> SO post about the macro-craziness people try to add to scala to make
>> them
>> >> almost as good as a simple java enum:
>> >> (NB: the accepted answer doesn't actually work in all cases ...)
>> >>
>> >>
>> >>
>> http://stackoverflow.com/questions/20089920/custom-scala-enum-most-elegant-version-searched
>> >>
>> >> Another proposal to add better enums built into scala ... but seems to
>> be
>> >> dormant:
>> >>
>> >> https://groups.google.com/forum/#!topic/scala-sips/Bf82LxK02Kk
>> >>
>> >>
>> >>
>> >> On Thu, Mar 5, 2015 at 10:49 PM, Mridul Muralidharan <mridul@gmail.com>
>> >> wrote:
>> >>
>> >> >   I have a strong dislike for java enum's due to the fact that they
>> >> > are not stable across JVM's - if it undergoes serde, you end up with
>> >> > unpredictable results at times [1].
>> >> > One of the reasons why we prevent enum's from being key : though it
is
>> >> > highly possible users might depend on it internally and shoot
>> >> > themselves in the foot.
>> >> >
>> >> > Would be better to keep away from them in general and use something
>> more
>> >> > stable.
>> >> >
>> >> > Regards,
>> >> > Mridul
>> >> >
>> >> > [1] Having had to debug this issue for 2 weeks - I really really hate
>> it.
>> >> >
>> >> >
>> >> > On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid <irashid@cloudera.com>
>> >> wrote:
>> >> > > I have a very strong dislike for #1 (scala enumerations).   I'm
ok
>> with
>> >> > #4
>> >> > > (with Xiangrui's final suggestion, especially making it sealed
&
>> >> > available
>> >> > > in Java), but I really think #2, java enums, are the best option.
>> >> > >
>> >> > > Java enums actually have some very real advantages over the other
>> >> > > approaches -- you get values(), valueOf(), EnumSet, and EnumMap.
>> There
>> >> > has
>> >> > > been endless debate in the Scala community about the problems
with
>> the
>> >> > > approaches in Scala.  Very smart, level-headed Scala gurus have
>> >> > complained
>> >> > > about their short-comings (Rex Kerr's name is coming to mind,
though
>> >> I'm
>> >> > > not positive about that); there have been numerous well-thought
out
>> >> > > proposals to give Scala a better enum.  But the powers-that-be
in
>> Scala
>> >> > > always reject them.  IIRC the explanation for rejecting is basically
>> >> that
>> >> > > (a) enums aren't important enough for introducing some new special
>> >> > feature,
>> >> > > scala's got bigger things to work on and (b) if you really need
a
>> good
>> >> > > enum, just use java's enum.
>> >> > >
>> >> > > I doubt it really matters that much for Spark internals, which
is
>> why I
>> >> > > think #4 is fine.  But I figured I'd give my spiel, because every
>> >> > developer
>> >> > > loves language wars :)
>> >> > >
>> >> > > Imran
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng <mengxr@gmail.com>
>> >> wrote:
>> >> > >
>> >> > >> `case object` inside an `object` doesn't show up in Java.
This is
>> the
>> >> > >> minimal code I found to make everything show up correctly
in both
>> >> > >> Scala and Java:
>> >> > >>
>> >> > >> sealed abstract class StorageLevel // cannot be a trait
>> >> > >>
>> >> > >> object StorageLevel {
>> >> > >>   private[this] case object _MemoryOnly extends StorageLevel
>> >> > >>   final val MemoryOnly: StorageLevel = _MemoryOnly
>> >> > >>
>> >> > >>   private[this] case object _DiskOnly extends StorageLevel
>> >> > >>   final val DiskOnly: StorageLevel = _DiskOnly
>> >> > >> }
>> >> > >>
>> >> > >> On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell <
>> pwendell@gmail.com>
>> >> > >> wrote:
>> >> > >> > I like #4 as well and agree with Aaron's suggestion.
>> >> > >> >
>> >> > >> > - Patrick
>> >> > >> >
>> >> > >> > On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson <
>> ilikerps@gmail.com>
>> >> > >> wrote:
>> >> > >> >> I'm cool with #4 as well, but make sure we dictate
that the
>> values
>> >> > >> should
>> >> > >> >> be defined within an object with the same name as
the
>> enumeration
>> >> > (like
>> >> > >> we
>> >> > >> >> do for StorageLevel). Otherwise we may pollute a
higher
>> namespace.
>> >> > >> >>
>> >> > >> >> e.g. we SHOULD do:
>> >> > >> >>
>> >> > >> >> trait StorageLevel
>> >> > >> >> object StorageLevel {
>> >> > >> >>   case object MemoryOnly extends StorageLevel
>> >> > >> >>   case object DiskOnly extends StorageLevel
>> >> > >> >> }
>> >> > >> >>
>> >> > >> >> On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust
<
>> >> > >> michael@databricks.com>
>> >> > >> >> wrote:
>> >> > >> >>
>> >> > >> >>> #4 with a preference for CamelCaseEnums
>> >> > >> >>>
>> >> > >> >>> On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley
<
>> >> > joseph@databricks.com>
>> >> > >> >>> wrote:
>> >> > >> >>>
>> >> > >> >>> > another vote for #4
>> >> > >> >>> > People are already used to adding "()" in
Java.
>> >> > >> >>> >
>> >> > >> >>> >
>> >> > >> >>> > On Wed, Mar 4, 2015 at 5:14 PM, Stephen
Boesch <
>> >> javadba@gmail.com
>> >> > >
>> >> > >> >>> wrote:
>> >> > >> >>> >
>> >> > >> >>> > > #4 but with MemoryOnly (more scala-like)
>> >> > >> >>> > >
>> >> > >> >>> > > http://docs.scala-lang.org/style/naming-conventions.html
>> >> > >> >>> > >
>> >> > >> >>> > > Constants, Values, Variable and Methods
>> >> > >> >>> > >
>> >> > >> >>> > > Constant names should be in upper camel
case. That is, if
>> the
>> >> > >> member is
>> >> > >> >>> > > final, immutable and it belongs to
a package object or an
>> >> > object,
>> >> > >> it
>> >> > >> >>> may
>> >> > >> >>> > be
>> >> > >> >>> > > considered a constant (similar to Java'sstatic
final
>> members):
>> >> > >> >>> > >
>> >> > >> >>> > >
>> >> > >> >>> > >    1. object Container {
>> >> > >> >>> > >    2.     val MyConstant = ...
>> >> > >> >>> > >    3. }
>> >> > >> >>> > >
>> >> > >> >>> > >
>> >> > >> >>> > > 2015-03-04 17:11 GMT-08:00 Xiangrui
Meng <mengxr@gmail.com
>> >:
>> >> > >> >>> > >
>> >> > >> >>> > > > Hi all,
>> >> > >> >>> > > >
>> >> > >> >>> > > > There are many places where we
use enum-like types in
>> Spark,
>> >> > but
>> >> > >> in
>> >> > >> >>> > > > different ways. Every approach
has both pros and cons. I
>> >> > wonder
>> >> > >> >>> > > > whether there should be an "official"
approach for
>> enum-like
>> >> > >> types in
>> >> > >> >>> > > > Spark.
>> >> > >> >>> > > >
>> >> > >> >>> > > > 1. Scala's Enumeration (e.g.,
SchedulingMode,
>> WorkerState,
>> >> > etc)
>> >> > >> >>> > > >
>> >> > >> >>> > > > * All types show up as Enumeration.Value
in Java.
>> >> > >> >>> > > >
>> >> > >> >>> > > >
>> >> > >> >>> > >
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >>
>> >> >
>> >>
>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
>> >> > >> >>> > > >
>> >> > >> >>> > > > 2. Java's Enum (e.g., SaveMode,
IOMode)
>> >> > >> >>> > > >
>> >> > >> >>> > > > * Implementation must be in a
Java file.
>> >> > >> >>> > > > * Values doesn't show up in the
ScalaDoc:
>> >> > >> >>> > > >
>> >> > >> >>> > > >
>> >> > >> >>> > >
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >>
>> >> >
>> >>
>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
>> >> > >> >>> > > >
>> >> > >> >>> > > > 3. Static fields in Java (e.g.,
TripletFields)
>> >> > >> >>> > > >
>> >> > >> >>> > > > * Implementation must be in a
Java file.
>> >> > >> >>> > > > * Doesn't need "()" in Java code.
>> >> > >> >>> > > > * Values don't show up in the
ScalaDoc:
>> >> > >> >>> > > >
>> >> > >> >>> > > >
>> >> > >> >>> > >
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >>
>> >> >
>> >>
>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
>> >> > >> >>> > > >
>> >> > >> >>> > > > 4. Objects in Scala. (e.g., StorageLevel)
>> >> > >> >>> > > >
>> >> > >> >>> > > > * Needs "()" in Java code.
>> >> > >> >>> > > > * Values show up in both ScalaDoc
and JavaDoc:
>> >> > >> >>> > > >
>> >> > >> >>> > > >
>> >> > >> >>> > >
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >>
>> >> >
>> >>
>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
>> >> > >> >>> > > >
>> >> > >> >>> > > >
>> >> > >> >>> > >
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >>
>> >> >
>> >>
>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
>> >> > >> >>> > > >
>> >> > >> >>> > > > It would be great if we have an
"official" approach for
>> this
>> >> > as
>> >> > >> well
>> >> > >> >>> > > > as the naming convention for enum-like
values
>> ("MEMORY_ONLY"
>> >> > or
>> >> > >> >>> > > > "MemoryOnly"). Personally, I like
4) with "MEMORY_ONLY".
>> Any
>> >> > >> >>> thoughts?
>> >> > >> >>> > > >
>> >> > >> >>> > > > Best,
>> >> > >> >>> > > > Xiangrui
>> >> > >> >>> > > >
>> >> > >> >>> > > >
>> >> > >>
>> ---------------------------------------------------------------------
>> >> > >> >>> > > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> > >> >>> > > > For additional commands, e-mail:
>> dev-help@spark.apache.org
>> >> > >> >>> > > >
>> >> > >> >>> > > >
>> >> > >> >>> > >
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >>
>> >> > >>
>> ---------------------------------------------------------------------
>> >> > >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >> > >> For additional commands, e-mail: dev-help@spark.apache.org
>> >> > >>
>> >> > >>
>> >> >
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message