spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Calling external classes added by sc.addJar needs to be through reflection
Date Wed, 21 May 2014 00:59:16 GMT
Talked with Sandy and DB offline. I think the best solution is sending
the secondary jars to the distributed cache of all containers rather
than just the master, and set the classpath to include spark jar,
primary app jar, and secondary jars before executor starts. In this
way, user only needs to specify secondary jars via --jars instead of
calling sc.addJar inside the code. It also solves the scalability
problem of serving all the jars via http.

If this solution sounds good, I can try to make a patch.

Best,
Xiangrui

On Mon, May 19, 2014 at 10:04 PM, DB Tsai <dbtsai@stanford.edu> wrote:
> In 1.0, there is a new option for users to choose which classloader has
> higher priority via spark.files.userClassPathFirst, I decided to submit the
> PR for 0.9 first. We use this patch in our lab and we can use those jars
> added by sc.addJar without reflection.
>
> https://github.com/apache/spark/pull/834
>
> Can anyone comment if it's a good approach?
>
> Thanks.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Mon, May 19, 2014 at 7:42 PM, DB Tsai <dbtsai@stanford.edu> wrote:
>
>> Good summary! We fixed it in branch 0.9 since our production is still in
>> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
>> tonight.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.ryza@cloudera.com>wrote:
>>
>>> It just hit me why this problem is showing up on YARN and not on
>>> standalone.
>>>
>>> The relevant difference between YARN and standalone is that, on YARN, the
>>> app jar is loaded by the system classloader instead of Spark's custom URL
>>> classloader.
>>>
>>> On YARN, the system classloader knows about [the classes in the spark
>>> jars,
>>> the classes in the primary app jar].   The custom classloader knows about
>>> [the classes in secondary app jars] and has the system classloader as its
>>> parent.
>>>
>>> A few relevant facts (mostly redundant with what Sean pointed out):
>>> * Every class has a classloader that loaded it.
>>> * When an object of class B is instantiated inside of class A, the
>>> classloader used for loading B is the classloader that was used for
>>> loading
>>> A.
>>> * When a classloader fails to load a class, it lets its parent classloader
>>> try.  If its parent succeeds, its parent becomes the "classloader that
>>> loaded it".
>>>
>>> So suppose class B is in a secondary app jar and class A is in the primary
>>> app jar:
>>> 1. The custom classloader will try to load class A.
>>> 2. It will fail, because it only knows about the secondary jars.
>>> 3. It will delegate to its parent, the system classloader.
>>> 4. The system classloader will succeed, because it knows about the primary
>>> app jar.
>>> 5. A's classloader will be the system classloader.
>>> 6. A tries to instantiate an instance of class B.
>>> 7. B will be loaded with A's classloader, which is the system classloader.
>>> 8. Loading B will fail, because A's classloader, which is the system
>>> classloader, doesn't know about the secondary app jars.
>>>
>>> In Spark standalone, A and B are both loaded by the custom classloader, so
>>> this issue doesn't come up.
>>>
>>> -Sandy
>>>
>>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pwendell@gmail.com>
>>> wrote:
>>>
>>> > Having a user add define a custom class inside of an added jar and
>>> > instantiate it directly inside of an executor is definitely supported
>>> > in Spark and has been for a really long time (several years). This is
>>> > something we do all the time in Spark.
>>> >
>>> > DB - I'd hold off on a re-architecting of this until we identify
>>> > exactly what is causing the bug you are running into.
>>> >
>>> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
>>> > it will ask the driver for the class over HTTP using a custom
>>> > classloader. Something in that pipeline is breaking here, possibly
>>> > related to the YARN deployment stuff.
>>> >
>>> >
>>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com> wrote:
>>> > > I don't think a customer classloader is necessary.
>>> > >
>>> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat,
etc
>>> > > all run custom user code that creates new user objects without
>>> > > reflection. I should go see how that's done. Maybe it's totally valid
>>> > > to set the thread's context classloader for just this purpose, and
I
>>> > > am not thinking clearly.
>>> > >
>>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <andrew@andrewash.com>
>>> > wrote:
>>> > >> Sounds like the problem is that classloaders always look in their
>>> > parents
>>> > >> before themselves, and Spark users want executors to pick up classes
>>> > from
>>> > >> their custom code before the ones in Spark plus its dependencies.
>>> > >>
>>> > >> Would a custom classloader that delegates to the parent after first
>>> > >> checking itself fix this up?
>>> > >>
>>> > >>
>>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <dbtsai@stanford.edu>
>>> wrote:
>>> > >>
>>> > >>> Hi Sean,
>>> > >>>
>>> > >>> It's true that the issue here is classloader, and due to the
>>> > classloader
>>> > >>> delegation model, users have to use reflection in the executors
to
>>> > pick up
>>> > >>> the classloader in order to use those classes added by sc.addJars
>>> APIs.
>>> > >>> However, it's very inconvenience for users, and not documented
in
>>> > spark.
>>> > >>>
>>> > >>> I'm working on a patch to solve it by calling the protected
method
>>> > addURL
>>> > >>> in URLClassLoader to update the current default classloader,
so no
>>> > >>> customClassLoader anymore. I wonder if this is an good way
to go.
>>> > >>>
>>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>>> > >>>     try {
>>> > >>>       val method: Method =
>>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>>> > >>>       method.setAccessible(true)
>>> > >>>       method.invoke(loader, url)
>>> > >>>     }
>>> > >>>     catch {
>>> > >>>       case t: Throwable => {
>>> > >>>         throw new IOException("Error, could not add URL to
system
>>> > >>> classloader")
>>> > >>>       }
>>> > >>>     }
>>> > >>>   }
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Sincerely,
>>> > >>>
>>> > >>> DB Tsai
>>> > >>> -------------------------------------------------------
>>> > >>> My Blog: https://www.dbtsai.com
>>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>> > >>>
>>> > >>>
>>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <sowen@cloudera.com>
>>> > wrote:
>>> > >>>
>>> > >>> > I might be stating the obvious for everyone, but the issue
here is
>>> > not
>>> > >>> > reflection or the source of the JAR, but the ClassLoader.
The
>>> basic
>>> > >>> > rules are this.
>>> > >>> >
>>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This
is
>>> usually
>>> > >>> > the ClassLoader that loaded whatever it is that first
referenced
>>> Foo
>>> > >>> > and caused it to be loaded -- usually the ClassLoader
holding your
>>> > >>> > other app classes.
>>> > >>> >
>>> > >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
>>> > always
>>> > >>> > look in their parent before themselves.
>>> > >>> >
>>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where
your app
>>> is
>>> > >>> > loaded in a child ClassLoader, and you reference a class
that
>>> Hadoop
>>> > >>> > or Tomcat also has (like a lib class) you will get the
container's
>>> > >>> > version!)
>>> > >>> >
>>> > >>> > When you load an external JAR it has a separate ClassLoader
which
>>> > does
>>> > >>> > not necessarily bear any relation to the one containing
your app
>>> > >>> > classes, so yeah it is not generally going to make "new
Foo" work.
>>> > >>> >
>>> > >>> > Reflection lets you pick the ClassLoader, yes.
>>> > >>> >
>>> > >>> > I would not call setContextClassLoader.
>>> > >>> >
>>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>>> > sandy.ryza@cloudera.com>
>>> > >>> > wrote:
>>> > >>> > > I spoke with DB offline about this a little while
ago and he
>>> > confirmed
>>> > >>> > that
>>> > >>> > > he was able to access the jar from the driver.
>>> > >>> > >
>>> > >>> > > The issue appears to be a general Java issue: you
can't directly
>>> > >>> > > instantiate a class from a dynamically loaded jar.
>>> > >>> > >
>>> > >>> > > I reproduced it locally outside of Spark with:
>>> > >>> > > ---
>>> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
>>> URL[] {
>>> > new
>>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>>> > >>> > >
>>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>>> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>>> > >>> > > ---
>>> > >>> > >
>>> > >>> > > I was able to load the class with reflection.
>>> > >>> >
>>> > >>>
>>> >
>>>
>>
>>

Mime
View raw message