spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Calling external classes added by sc.addJar needs to be through reflection
Date Wed, 21 May 2014 21:41:15 GMT
That's a good example. If we really want to cover that case, there are
two solutions:

1. Follow DB's patch, adding jars to the system classloader. Then we
cannot put a user class in front of an existing class.
2. Do not send the primary jar and secondary jars to executors'
distributed cache. Instead, add them to "spark.jars" in SparkSubmit
and serve them via http by called sc.addJar in SparkContext.

What is your preference?

On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza <sandy.ryza@cloudera.com> wrote:
> Is that an assumption we can make?  I think we'd run into an issue in this
> situation:
>
> *In primary jar:*
> def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()
>
> *In app code:*
> sc.addJar("dynamicjar.jar")
> ...
> rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))
>
> It might be fair to say that the user should make sure to use the context
> classloader when instantiating dynamic classes, but I think it's weird that
> this code would work on Spark standalone but not on YARN.
>
> -Sandy
>
>
> On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>
>> I think adding jars dynamically should work as long as the primary jar
>> and the secondary jars do not depend on dynamically added jars, which
>> should be the correct logic. -Xiangrui
>>
>> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>> > This will be another separate story.
>> >
>> > Since in the yarn deployment, as Sandy said, the app.jar will be always
>> in
>> > the systemclassloader which means any object instantiated in app.jar will
>> > have parent loader of systemclassloader instead of custom one. As a
>> result,
>> > the custom class loader in yarn will never work without specifically
>> using
>> > reflection.
>> >
>> > Solution will be not using system classloader in the classloader
>> hierarchy,
>> > and add all the resources in system one into custom one. This is the
>> > approach of tomcat takes.
>> >
>> > Or we can directly overwirte the system class loader by calling the
>> > protected method `addURL` which will not work and throw exception if the
>> > code is wrapped in security manager.
>> >
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > -------------------------------------------------------
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sandy.ryza@cloudera.com>
>> wrote:
>> >
>> >> This will solve the issue for jars added upon application submission,
>> but,
>> >> on top of this, we need to make sure that anything dynamically added
>> >> through sc.addJar works as well.
>> >>
>> >> To do so, we need to make sure that any jars retrieved via the driver's
>> >> HTTP server are loaded by the same classloader that loads the jars
>> given on
>> >> app submission.  To achieve this, we need to either use the same
>> >> classloader for both system jars and user jars, or make sure that the
>> user
>> >> jars given on app submission are under the same classloader used for
>> >> dynamically added jars.
>> >>
>> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <mengxr@gmail.com>
>> wrote:
>> >>
>> >> > Talked with Sandy and DB offline. I think the best solution is sending
>> >> > the secondary jars to the distributed cache of all containers rather
>> >> > than just the master, and set the classpath to include spark jar,
>> >> > primary app jar, and secondary jars before executor starts. In this
>> >> > way, user only needs to specify secondary jars via --jars instead of
>> >> > calling sc.addJar inside the code. It also solves the scalability
>> >> > problem of serving all the jars via http.
>> >> >
>> >> > If this solution sounds good, I can try to make a patch.
>> >> >
>> >> > Best,
>> >> > Xiangrui
>> >> >
>> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <dbtsai@stanford.edu>
>> wrote:
>> >> > > In 1.0, there is a new option for users to choose which classloader
>> has
>> >> > > higher priority via spark.files.userClassPathFirst, I decided
to
>> submit
>> >> > the
>> >> > > PR for 0.9 first. We use this patch in our lab and we can use
those
>> >> jars
>> >> > > added by sc.addJar without reflection.
>> >> > >
>> >> > > https://github.com/apache/spark/pull/834
>> >> > >
>> >> > > Can anyone comment if it's a good approach?
>> >> > >
>> >> > > Thanks.
>> >> > >
>> >> > >
>> >> > > Sincerely,
>> >> > >
>> >> > > DB Tsai
>> >> > > -------------------------------------------------------
>> >> > > My Blog: https://www.dbtsai.com
>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> > >
>> >> > >
>> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <dbtsai@stanford.edu>
>> wrote:
>> >> > >
>> >> > >> Good summary! We fixed it in branch 0.9 since our production
is
>> still
>> >> in
>> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit
PR for
>> 1.0
>> >> > >> tonight.
>> >> > >>
>> >> > >>
>> >> > >> Sincerely,
>> >> > >>
>> >> > >> DB Tsai
>> >> > >> -------------------------------------------------------
>> >> > >> My Blog: https://www.dbtsai.com
>> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> > >>
>> >> > >>
>> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
>> sandy.ryza@cloudera.com
>> >> > >wrote:
>> >> > >>
>> >> > >>> It just hit me why this problem is showing up on YARN
and not on
>> >> > >>> standalone.
>> >> > >>>
>> >> > >>> The relevant difference between YARN and standalone is
that, on
>> YARN,
>> >> > the
>> >> > >>> app jar is loaded by the system classloader instead of
Spark's
>> custom
>> >> > URL
>> >> > >>> classloader.
>> >> > >>>
>> >> > >>> On YARN, the system classloader knows about [the classes
in the
>> spark
>> >> > >>> jars,
>> >> > >>> the classes in the primary app jar].   The custom classloader
>> knows
>> >> > about
>> >> > >>> [the classes in secondary app jars] and has the system
>> classloader as
>> >> > its
>> >> > >>> parent.
>> >> > >>>
>> >> > >>> A few relevant facts (mostly redundant with what Sean
pointed
>> out):
>> >> > >>> * Every class has a classloader that loaded it.
>> >> > >>> * When an object of class B is instantiated inside of
class A, the
>> >> > >>> classloader used for loading B is the classloader that
was used
>> for
>> >> > >>> loading
>> >> > >>> A.
>> >> > >>> * When a classloader fails to load a class, it lets its
parent
>> >> > classloader
>> >> > >>> try.  If its parent succeeds, its parent becomes the "classloader
>> >> that
>> >> > >>> loaded it".
>> >> > >>>
>> >> > >>> So suppose class B is in a secondary app jar and class
A is in the
>> >> > primary
>> >> > >>> app jar:
>> >> > >>> 1. The custom classloader will try to load class A.
>> >> > >>> 2. It will fail, because it only knows about the secondary
jars.
>> >> > >>> 3. It will delegate to its parent, the system classloader.
>> >> > >>> 4. The system classloader will succeed, because it knows
about the
>> >> > primary
>> >> > >>> app jar.
>> >> > >>> 5. A's classloader will be the system classloader.
>> >> > >>> 6. A tries to instantiate an instance of class B.
>> >> > >>> 7. B will be loaded with A's classloader, which is the
system
>> >> > classloader.
>> >> > >>> 8. Loading B will fail, because A's classloader, which
is the
>> system
>> >> > >>> classloader, doesn't know about the secondary app jars.
>> >> > >>>
>> >> > >>> In Spark standalone, A and B are both loaded by the custom
>> >> > classloader, so
>> >> > >>> this issue doesn't come up.
>> >> > >>>
>> >> > >>> -Sandy
>> >> > >>>
>> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
>> pwendell@gmail.com
>> >> >
>> >> > >>> wrote:
>> >> > >>>
>> >> > >>> > Having a user add define a custom class inside of
an added jar
>> and
>> >> > >>> > instantiate it directly inside of an executor is
definitely
>> >> supported
>> >> > >>> > in Spark and has been for a really long time (several
years).
>> This
>> >> is
>> >> > >>> > something we do all the time in Spark.
>> >> > >>> >
>> >> > >>> > DB - I'd hold off on a re-architecting of this until
we identify
>> >> > >>> > exactly what is causing the bug you are running into.
>> >> > >>> >
>> >> > >>> > In a nutshell, when the bytecode "new Foo()" is run
on the
>> >> executor,
>> >> > >>> > it will ask the driver for the class over HTTP using
a custom
>> >> > >>> > classloader. Something in that pipeline is breaking
here,
>> possibly
>> >> > >>> > related to the YARN deployment stuff.
>> >> > >>> >
>> >> > >>> >
>> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com
>> >
>> >> > wrote:
>> >> > >>> > > I don't think a customer classloader is necessary.
>> >> > >>> > >
>> >> > >>> > > Well, it occurs to me that this is no new problem.
Hadoop,
>> >> Tomcat,
>> >> > etc
>> >> > >>> > > all run custom user code that creates new user
objects without
>> >> > >>> > > reflection. I should go see how that's done.
Maybe it's
>> totally
>> >> > valid
>> >> > >>> > > to set the thread's context classloader for
just this purpose,
>> >> and
>> >> > I
>> >> > >>> > > am not thinking clearly.
>> >> > >>> > >
>> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash
<
>> >> andrew@andrewash.com>
>> >> > >>> > wrote:
>> >> > >>> > >> Sounds like the problem is that classloaders
always look in
>> >> their
>> >> > >>> > parents
>> >> > >>> > >> before themselves, and Spark users want
executors to pick up
>> >> > classes
>> >> > >>> > from
>> >> > >>> > >> their custom code before the ones in Spark
plus its
>> >> dependencies.
>> >> > >>> > >>
>> >> > >>> > >> Would a custom classloader that delegates
to the parent after
>> >> > first
>> >> > >>> > >> checking itself fix this up?
>> >> > >>> > >>
>> >> > >>> > >>
>> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai
<
>> dbtsai@stanford.edu>
>> >> > >>> wrote:
>> >> > >>> > >>
>> >> > >>> > >>> Hi Sean,
>> >> > >>> > >>>
>> >> > >>> > >>> It's true that the issue here is classloader,
and due to the
>> >> > >>> > classloader
>> >> > >>> > >>> delegation model, users have to use
reflection in the
>> executors
>> >> > to
>> >> > >>> > pick up
>> >> > >>> > >>> the classloader in order to use those
classes added by
>> >> sc.addJars
>> >> > >>> APIs.
>> >> > >>> > >>> However, it's very inconvenience for
users, and not
>> documented
>> >> in
>> >> > >>> > spark.
>> >> > >>> > >>>
>> >> > >>> > >>> I'm working on a patch to solve it by
calling the protected
>> >> > method
>> >> > >>> > addURL
>> >> > >>> > >>> in URLClassLoader to update the current
default
>> classloader, so
>> >> > no
>> >> > >>> > >>> customClassLoader anymore. I wonder
if this is an good way
>> to
>> >> go.
>> >> > >>> > >>>
>> >> > >>> > >>>   private def addURL(url: URL, loader:
URLClassLoader){
>> >> > >>> > >>>     try {
>> >> > >>> > >>>       val method: Method =
>> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>> >> classOf[URL])
>> >> > >>> > >>>       method.setAccessible(true)
>> >> > >>> > >>>       method.invoke(loader, url)
>> >> > >>> > >>>     }
>> >> > >>> > >>>     catch {
>> >> > >>> > >>>       case t: Throwable => {
>> >> > >>> > >>>         throw new IOException("Error,
could not add URL to
>> >> system
>> >> > >>> > >>> classloader")
>> >> > >>> > >>>       }
>> >> > >>> > >>>     }
>> >> > >>> > >>>   }
>> >> > >>> > >>>
>> >> > >>> > >>>
>> >> > >>> > >>>
>> >> > >>> > >>> Sincerely,
>> >> > >>> > >>>
>> >> > >>> > >>> DB Tsai
>> >> > >>> > >>> -------------------------------------------------------
>> >> > >>> > >>> My Blog: https://www.dbtsai.com
>> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> > >>> > >>>
>> >> > >>> > >>>
>> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean
Owen <
>> >> sowen@cloudera.com>
>> >> > >>> > wrote:
>> >> > >>> > >>>
>> >> > >>> > >>> > I might be stating the obvious
for everyone, but the issue
>> >> > here is
>> >> > >>> > not
>> >> > >>> > >>> > reflection or the source of the
JAR, but the ClassLoader.
>> The
>> >> > >>> basic
>> >> > >>> > >>> > rules are this.
>> >> > >>> > >>> >
>> >> > >>> > >>> > "new Foo" will use the ClassLoader
that defines Foo. This
>> is
>> >> > >>> usually
>> >> > >>> > >>> > the ClassLoader that loaded whatever
it is that first
>> >> > referenced
>> >> > >>> Foo
>> >> > >>> > >>> > and caused it to be loaded -- usually
the ClassLoader
>> holding
>> >> > your
>> >> > >>> > >>> > other app classes.
>> >> > >>> > >>> >
>> >> > >>> > >>> > ClassLoaders can have a parent-child
relationship.
>> >> ClassLoaders
>> >> > >>> > always
>> >> > >>> > >>> > look in their parent before themselves.
>> >> > >>> > >>> >
>> >> > >>> > >>> > (Careful then -- in contexts like
Hadoop or Tomcat where
>> your
>> >> > app
>> >> > >>> is
>> >> > >>> > >>> > loaded in a child ClassLoader,
and you reference a class
>> that
>> >> > >>> Hadoop
>> >> > >>> > >>> > or Tomcat also has (like a lib
class) you will get the
>> >> > container's
>> >> > >>> > >>> > version!)
>> >> > >>> > >>> >
>> >> > >>> > >>> > When you load an external JAR it
has a separate
>> ClassLoader
>> >> > which
>> >> > >>> > does
>> >> > >>> > >>> > not necessarily bear any relation
to the one containing
>> your
>> >> > app
>> >> > >>> > >>> > classes, so yeah it is not generally
going to make "new
>> Foo"
>> >> > work.
>> >> > >>> > >>> >
>> >> > >>> > >>> > Reflection lets you pick the ClassLoader,
yes.
>> >> > >>> > >>> >
>> >> > >>> > >>> > I would not call setContextClassLoader.
>> >> > >>> > >>> >
>> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM,
Sandy Ryza <
>> >> > >>> > sandy.ryza@cloudera.com>
>> >> > >>> > >>> > wrote:
>> >> > >>> > >>> > > I spoke with DB offline about
this a little while ago
>> and
>> >> he
>> >> > >>> > confirmed
>> >> > >>> > >>> > that
>> >> > >>> > >>> > > he was able to access the
jar from the driver.
>> >> > >>> > >>> > >
>> >> > >>> > >>> > > The issue appears to be a
general Java issue: you can't
>> >> > directly
>> >> > >>> > >>> > > instantiate a class from a
dynamically loaded jar.
>> >> > >>> > >>> > >
>> >> > >>> > >>> > > I reproduced it locally outside
of Spark with:
>> >> > >>> > >>> > > ---
>> >> > >>> > >>> > >     URLClassLoader urlClassLoader
= new
>> URLClassLoader(new
>> >> > >>> URL[] {
>> >> > >>> > new
>> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL()
}, null);
>> >> > >>> > >>> > >
>> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>> >> > >>> > >>> > >     MyClassFromMyOtherJar
obj = new
>> >> MyClassFromMyOtherJar();
>> >> > >>> > >>> > > ---
>> >> > >>> > >>> > >
>> >> > >>> > >>> > > I was able to load the class
with reflection.
>> >> > >>> > >>> >
>> >> > >>> > >>>
>> >> > >>> >
>> >> > >>>
>> >> > >>
>> >> > >>
>> >> >
>> >>
>>

Mime
View raw message