spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: Calling external classes added by sc.addJar needs to be through reflection
Date Wed, 21 May 2014 22:15:28 GMT
Of these two solutions I'd definitely prefer 2 in the short term. I'd
imagine the fix is very straightforward (it would mostly just be
remove code), and we'd be making this more consistent with the
standalone mode which makes things way easier to reason about.

In the long term we'll definitely want to exploit the distributed
cache more, but at this point it's premature optimization at a high
complexity cost. Writing stuff to HDFS through is so slow anyways I'd
guess that serving it directly from the driver is still faster in most
cases (though for very large jar sizes or very large clusters, yes,
we'll need the distributed cache).

- Patrick

On Wed, May 21, 2014 at 2:41 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
> That's a good example. If we really want to cover that case, there are
> two solutions:
>
> 1. Follow DB's patch, adding jars to the system classloader. Then we
> cannot put a user class in front of an existing class.
> 2. Do not send the primary jar and secondary jars to executors'
> distributed cache. Instead, add them to "spark.jars" in SparkSubmit
> and serve them via http by called sc.addJar in SparkContext.
>
> What is your preference?
>
> On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza <sandy.ryza@cloudera.com> wrote:
>> Is that an assumption we can make?  I think we'd run into an issue in this
>> situation:
>>
>> *In primary jar:*
>> def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()
>>
>> *In app code:*
>> sc.addJar("dynamicjar.jar")
>> ...
>> rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))
>>
>> It might be fair to say that the user should make sure to use the context
>> classloader when instantiating dynamic classes, but I think it's weird that
>> this code would work on Spark standalone but not on YARN.
>>
>> -Sandy
>>
>>
>> On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>
>>> I think adding jars dynamically should work as long as the primary jar
>>> and the secondary jars do not depend on dynamically added jars, which
>>> should be the correct logic. -Xiangrui
>>>
>>> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>>> > This will be another separate story.
>>> >
>>> > Since in the yarn deployment, as Sandy said, the app.jar will be always
>>> in
>>> > the systemclassloader which means any object instantiated in app.jar will
>>> > have parent loader of systemclassloader instead of custom one. As a
>>> result,
>>> > the custom class loader in yarn will never work without specifically
>>> using
>>> > reflection.
>>> >
>>> > Solution will be not using system classloader in the classloader
>>> hierarchy,
>>> > and add all the resources in system one into custom one. This is the
>>> > approach of tomcat takes.
>>> >
>>> > Or we can directly overwirte the system class loader by calling the
>>> > protected method `addURL` which will not work and throw exception if the
>>> > code is wrapped in security manager.
>>> >
>>> >
>>> > Sincerely,
>>> >
>>> > DB Tsai
>>> > -------------------------------------------------------
>>> > My Blog: https://www.dbtsai.com
>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >
>>> >
>>> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sandy.ryza@cloudera.com>
>>> wrote:
>>> >
>>> >> This will solve the issue for jars added upon application submission,
>>> but,
>>> >> on top of this, we need to make sure that anything dynamically added
>>> >> through sc.addJar works as well.
>>> >>
>>> >> To do so, we need to make sure that any jars retrieved via the driver's
>>> >> HTTP server are loaded by the same classloader that loads the jars
>>> given on
>>> >> app submission.  To achieve this, we need to either use the same
>>> >> classloader for both system jars and user jars, or make sure that the
>>> user
>>> >> jars given on app submission are under the same classloader used for
>>> >> dynamically added jars.
>>> >>
>>> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <mengxr@gmail.com>
>>> wrote:
>>> >>
>>> >> > Talked with Sandy and DB offline. I think the best solution is
sending
>>> >> > the secondary jars to the distributed cache of all containers rather
>>> >> > than just the master, and set the classpath to include spark jar,
>>> >> > primary app jar, and secondary jars before executor starts. In
this
>>> >> > way, user only needs to specify secondary jars via --jars instead
of
>>> >> > calling sc.addJar inside the code. It also solves the scalability
>>> >> > problem of serving all the jars via http.
>>> >> >
>>> >> > If this solution sounds good, I can try to make a patch.
>>> >> >
>>> >> > Best,
>>> >> > Xiangrui
>>> >> >
>>> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <dbtsai@stanford.edu>
>>> wrote:
>>> >> > > In 1.0, there is a new option for users to choose which classloader
>>> has
>>> >> > > higher priority via spark.files.userClassPathFirst, I decided
to
>>> submit
>>> >> > the
>>> >> > > PR for 0.9 first. We use this patch in our lab and we can
use those
>>> >> jars
>>> >> > > added by sc.addJar without reflection.
>>> >> > >
>>> >> > > https://github.com/apache/spark/pull/834
>>> >> > >
>>> >> > > Can anyone comment if it's a good approach?
>>> >> > >
>>> >> > > Thanks.
>>> >> > >
>>> >> > >
>>> >> > > Sincerely,
>>> >> > >
>>> >> > > DB Tsai
>>> >> > > -------------------------------------------------------
>>> >> > > My Blog: https://www.dbtsai.com
>>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> > >
>>> >> > >
>>> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <dbtsai@stanford.edu>
>>> wrote:
>>> >> > >
>>> >> > >> Good summary! We fixed it in branch 0.9 since our production
is
>>> still
>>> >> in
>>> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit
PR for
>>> 1.0
>>> >> > >> tonight.
>>> >> > >>
>>> >> > >>
>>> >> > >> Sincerely,
>>> >> > >>
>>> >> > >> DB Tsai
>>> >> > >> -------------------------------------------------------
>>> >> > >> My Blog: https://www.dbtsai.com
>>> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> > >>
>>> >> > >>
>>> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
>>> sandy.ryza@cloudera.com
>>> >> > >wrote:
>>> >> > >>
>>> >> > >>> It just hit me why this problem is showing up on YARN
and not on
>>> >> > >>> standalone.
>>> >> > >>>
>>> >> > >>> The relevant difference between YARN and standalone
is that, on
>>> YARN,
>>> >> > the
>>> >> > >>> app jar is loaded by the system classloader instead
of Spark's
>>> custom
>>> >> > URL
>>> >> > >>> classloader.
>>> >> > >>>
>>> >> > >>> On YARN, the system classloader knows about [the classes
in the
>>> spark
>>> >> > >>> jars,
>>> >> > >>> the classes in the primary app jar].   The custom
classloader
>>> knows
>>> >> > about
>>> >> > >>> [the classes in secondary app jars] and has the system
>>> classloader as
>>> >> > its
>>> >> > >>> parent.
>>> >> > >>>
>>> >> > >>> A few relevant facts (mostly redundant with what Sean
pointed
>>> out):
>>> >> > >>> * Every class has a classloader that loaded it.
>>> >> > >>> * When an object of class B is instantiated inside
of class A, the
>>> >> > >>> classloader used for loading B is the classloader
that was used
>>> for
>>> >> > >>> loading
>>> >> > >>> A.
>>> >> > >>> * When a classloader fails to load a class, it lets
its parent
>>> >> > classloader
>>> >> > >>> try.  If its parent succeeds, its parent becomes the
"classloader
>>> >> that
>>> >> > >>> loaded it".
>>> >> > >>>
>>> >> > >>> So suppose class B is in a secondary app jar and class
A is in the
>>> >> > primary
>>> >> > >>> app jar:
>>> >> > >>> 1. The custom classloader will try to load class A.
>>> >> > >>> 2. It will fail, because it only knows about the secondary
jars.
>>> >> > >>> 3. It will delegate to its parent, the system classloader.
>>> >> > >>> 4. The system classloader will succeed, because it
knows about the
>>> >> > primary
>>> >> > >>> app jar.
>>> >> > >>> 5. A's classloader will be the system classloader.
>>> >> > >>> 6. A tries to instantiate an instance of class B.
>>> >> > >>> 7. B will be loaded with A's classloader, which is
the system
>>> >> > classloader.
>>> >> > >>> 8. Loading B will fail, because A's classloader, which
is the
>>> system
>>> >> > >>> classloader, doesn't know about the secondary app
jars.
>>> >> > >>>
>>> >> > >>> In Spark standalone, A and B are both loaded by the
custom
>>> >> > classloader, so
>>> >> > >>> this issue doesn't come up.
>>> >> > >>>
>>> >> > >>> -Sandy
>>> >> > >>>
>>> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
>>> pwendell@gmail.com
>>> >> >
>>> >> > >>> wrote:
>>> >> > >>>
>>> >> > >>> > Having a user add define a custom class inside
of an added jar
>>> and
>>> >> > >>> > instantiate it directly inside of an executor
is definitely
>>> >> supported
>>> >> > >>> > in Spark and has been for a really long time
(several years).
>>> This
>>> >> is
>>> >> > >>> > something we do all the time in Spark.
>>> >> > >>> >
>>> >> > >>> > DB - I'd hold off on a re-architecting of this
until we identify
>>> >> > >>> > exactly what is causing the bug you are running
into.
>>> >> > >>> >
>>> >> > >>> > In a nutshell, when the bytecode "new Foo()"
is run on the
>>> >> executor,
>>> >> > >>> > it will ask the driver for the class over HTTP
using a custom
>>> >> > >>> > classloader. Something in that pipeline is breaking
here,
>>> possibly
>>> >> > >>> > related to the YARN deployment stuff.
>>> >> > >>> >
>>> >> > >>> >
>>> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com
>>> >
>>> >> > wrote:
>>> >> > >>> > > I don't think a customer classloader is
necessary.
>>> >> > >>> > >
>>> >> > >>> > > Well, it occurs to me that this is no new
problem. Hadoop,
>>> >> Tomcat,
>>> >> > etc
>>> >> > >>> > > all run custom user code that creates new
user objects without
>>> >> > >>> > > reflection. I should go see how that's done.
Maybe it's
>>> totally
>>> >> > valid
>>> >> > >>> > > to set the thread's context classloader
for just this purpose,
>>> >> and
>>> >> > I
>>> >> > >>> > > am not thinking clearly.
>>> >> > >>> > >
>>> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew
Ash <
>>> >> andrew@andrewash.com>
>>> >> > >>> > wrote:
>>> >> > >>> > >> Sounds like the problem is that classloaders
always look in
>>> >> their
>>> >> > >>> > parents
>>> >> > >>> > >> before themselves, and Spark users want
executors to pick up
>>> >> > classes
>>> >> > >>> > from
>>> >> > >>> > >> their custom code before the ones in
Spark plus its
>>> >> dependencies.
>>> >> > >>> > >>
>>> >> > >>> > >> Would a custom classloader that delegates
to the parent after
>>> >> > first
>>> >> > >>> > >> checking itself fix this up?
>>> >> > >>> > >>
>>> >> > >>> > >>
>>> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB
Tsai <
>>> dbtsai@stanford.edu>
>>> >> > >>> wrote:
>>> >> > >>> > >>
>>> >> > >>> > >>> Hi Sean,
>>> >> > >>> > >>>
>>> >> > >>> > >>> It's true that the issue here is
classloader, and due to the
>>> >> > >>> > classloader
>>> >> > >>> > >>> delegation model, users have to
use reflection in the
>>> executors
>>> >> > to
>>> >> > >>> > pick up
>>> >> > >>> > >>> the classloader in order to use
those classes added by
>>> >> sc.addJars
>>> >> > >>> APIs.
>>> >> > >>> > >>> However, it's very inconvenience
for users, and not
>>> documented
>>> >> in
>>> >> > >>> > spark.
>>> >> > >>> > >>>
>>> >> > >>> > >>> I'm working on a patch to solve
it by calling the protected
>>> >> > method
>>> >> > >>> > addURL
>>> >> > >>> > >>> in URLClassLoader to update the
current default
>>> classloader, so
>>> >> > no
>>> >> > >>> > >>> customClassLoader anymore. I wonder
if this is an good way
>>> to
>>> >> go.
>>> >> > >>> > >>>
>>> >> > >>> > >>>   private def addURL(url: URL, loader:
URLClassLoader){
>>> >> > >>> > >>>     try {
>>> >> > >>> > >>>       val method: Method =
>>> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>>> >> classOf[URL])
>>> >> > >>> > >>>       method.setAccessible(true)
>>> >> > >>> > >>>       method.invoke(loader, url)
>>> >> > >>> > >>>     }
>>> >> > >>> > >>>     catch {
>>> >> > >>> > >>>       case t: Throwable => {
>>> >> > >>> > >>>         throw new IOException("Error,
could not add URL to
>>> >> system
>>> >> > >>> > >>> classloader")
>>> >> > >>> > >>>       }
>>> >> > >>> > >>>     }
>>> >> > >>> > >>>   }
>>> >> > >>> > >>>
>>> >> > >>> > >>>
>>> >> > >>> > >>>
>>> >> > >>> > >>> Sincerely,
>>> >> > >>> > >>>
>>> >> > >>> > >>> DB Tsai
>>> >> > >>> > >>> -------------------------------------------------------
>>> >> > >>> > >>> My Blog: https://www.dbtsai.com
>>> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> > >>> > >>>
>>> >> > >>> > >>>
>>> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM,
Sean Owen <
>>> >> sowen@cloudera.com>
>>> >> > >>> > wrote:
>>> >> > >>> > >>>
>>> >> > >>> > >>> > I might be stating the obvious
for everyone, but the issue
>>> >> > here is
>>> >> > >>> > not
>>> >> > >>> > >>> > reflection or the source of
the JAR, but the ClassLoader.
>>> The
>>> >> > >>> basic
>>> >> > >>> > >>> > rules are this.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > "new Foo" will use the ClassLoader
that defines Foo. This
>>> is
>>> >> > >>> usually
>>> >> > >>> > >>> > the ClassLoader that loaded
whatever it is that first
>>> >> > referenced
>>> >> > >>> Foo
>>> >> > >>> > >>> > and caused it to be loaded
-- usually the ClassLoader
>>> holding
>>> >> > your
>>> >> > >>> > >>> > other app classes.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > ClassLoaders can have a parent-child
relationship.
>>> >> ClassLoaders
>>> >> > >>> > always
>>> >> > >>> > >>> > look in their parent before
themselves.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > (Careful then -- in contexts
like Hadoop or Tomcat where
>>> your
>>> >> > app
>>> >> > >>> is
>>> >> > >>> > >>> > loaded in a child ClassLoader,
and you reference a class
>>> that
>>> >> > >>> Hadoop
>>> >> > >>> > >>> > or Tomcat also has (like a
lib class) you will get the
>>> >> > container's
>>> >> > >>> > >>> > version!)
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > When you load an external JAR
it has a separate
>>> ClassLoader
>>> >> > which
>>> >> > >>> > does
>>> >> > >>> > >>> > not necessarily bear any relation
to the one containing
>>> your
>>> >> > app
>>> >> > >>> > >>> > classes, so yeah it is not
generally going to make "new
>>> Foo"
>>> >> > work.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > Reflection lets you pick the
ClassLoader, yes.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > I would not call setContextClassLoader.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00
AM, Sandy Ryza <
>>> >> > >>> > sandy.ryza@cloudera.com>
>>> >> > >>> > >>> > wrote:
>>> >> > >>> > >>> > > I spoke with DB offline
about this a little while ago
>>> and
>>> >> he
>>> >> > >>> > confirmed
>>> >> > >>> > >>> > that
>>> >> > >>> > >>> > > he was able to access
the jar from the driver.
>>> >> > >>> > >>> > >
>>> >> > >>> > >>> > > The issue appears to be
a general Java issue: you can't
>>> >> > directly
>>> >> > >>> > >>> > > instantiate a class from
a dynamically loaded jar.
>>> >> > >>> > >>> > >
>>> >> > >>> > >>> > > I reproduced it locally
outside of Spark with:
>>> >> > >>> > >>> > > ---
>>> >> > >>> > >>> > >     URLClassLoader urlClassLoader
= new
>>> URLClassLoader(new
>>> >> > >>> URL[] {
>>> >> > >>> > new
>>> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL()
}, null);
>>> >> > >>> > >>> > >
>>> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>>> >> > >>> > >>> > >     MyClassFromMyOtherJar
obj = new
>>> >> MyClassFromMyOtherJar();
>>> >> > >>> > >>> > > ---
>>> >> > >>> > >>> > >
>>> >> > >>> > >>> > > I was able to load the
class with reflection.
>>> >> > >>> > >>> >
>>> >> > >>> > >>>
>>> >> > >>> >
>>> >> > >>>
>>> >> > >>
>>> >> > >>
>>> >> >
>>> >>
>>>

Mime
View raw message