spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: Calling external classes added by sc.addJar needs to be through reflection
Date Tue, 20 May 2014 02:42:53 GMT
Good summary! We fixed it in branch 0.9 since our production is still in
0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
tonight.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.ryza@cloudera.com> wrote:

> It just hit me why this problem is showing up on YARN and not on
> standalone.
>
> The relevant difference between YARN and standalone is that, on YARN, the
> app jar is loaded by the system classloader instead of Spark's custom URL
> classloader.
>
> On YARN, the system classloader knows about [the classes in the spark jars,
> the classes in the primary app jar].   The custom classloader knows about
> [the classes in secondary app jars] and has the system classloader as its
> parent.
>
> A few relevant facts (mostly redundant with what Sean pointed out):
> * Every class has a classloader that loaded it.
> * When an object of class B is instantiated inside of class A, the
> classloader used for loading B is the classloader that was used for loading
> A.
> * When a classloader fails to load a class, it lets its parent classloader
> try.  If its parent succeeds, its parent becomes the "classloader that
> loaded it".
>
> So suppose class B is in a secondary app jar and class A is in the primary
> app jar:
> 1. The custom classloader will try to load class A.
> 2. It will fail, because it only knows about the secondary jars.
> 3. It will delegate to its parent, the system classloader.
> 4. The system classloader will succeed, because it knows about the primary
> app jar.
> 5. A's classloader will be the system classloader.
> 6. A tries to instantiate an instance of class B.
> 7. B will be loaded with A's classloader, which is the system classloader.
> 8. Loading B will fail, because A's classloader, which is the system
> classloader, doesn't know about the secondary app jars.
>
> In Spark standalone, A and B are both loaded by the custom classloader, so
> this issue doesn't come up.
>
> -Sandy
>
> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pwendell@gmail.com>
> wrote:
>
> > Having a user add define a custom class inside of an added jar and
> > instantiate it directly inside of an executor is definitely supported
> > in Spark and has been for a really long time (several years). This is
> > something we do all the time in Spark.
> >
> > DB - I'd hold off on a re-architecting of this until we identify
> > exactly what is causing the bug you are running into.
> >
> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
> > it will ask the driver for the class over HTTP using a custom
> > classloader. Something in that pipeline is breaking here, possibly
> > related to the YARN deployment stuff.
> >
> >
> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com> wrote:
> > > I don't think a customer classloader is necessary.
> > >
> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
> > > all run custom user code that creates new user objects without
> > > reflection. I should go see how that's done. Maybe it's totally valid
> > > to set the thread's context classloader for just this purpose, and I
> > > am not thinking clearly.
> > >
> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <andrew@andrewash.com>
> > wrote:
> > >> Sounds like the problem is that classloaders always look in their
> > parents
> > >> before themselves, and Spark users want executors to pick up classes
> > from
> > >> their custom code before the ones in Spark plus its dependencies.
> > >>
> > >> Would a custom classloader that delegates to the parent after first
> > >> checking itself fix this up?
> > >>
> > >>
> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <dbtsai@stanford.edu>
> wrote:
> > >>
> > >>> Hi Sean,
> > >>>
> > >>> It's true that the issue here is classloader, and due to the
> > classloader
> > >>> delegation model, users have to use reflection in the executors to
> > pick up
> > >>> the classloader in order to use those classes added by sc.addJars
> APIs.
> > >>> However, it's very inconvenience for users, and not documented in
> > spark.
> > >>>
> > >>> I'm working on a patch to solve it by calling the protected method
> > addURL
> > >>> in URLClassLoader to update the current default classloader, so no
> > >>> customClassLoader anymore. I wonder if this is an good way to go.
> > >>>
> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> > >>>     try {
> > >>>       val method: Method =
> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
> > >>>       method.setAccessible(true)
> > >>>       method.invoke(loader, url)
> > >>>     }
> > >>>     catch {
> > >>>       case t: Throwable => {
> > >>>         throw new IOException("Error, could not add URL to system
> > >>> classloader")
> > >>>       }
> > >>>     }
> > >>>   }
> > >>>
> > >>>
> > >>>
> > >>> Sincerely,
> > >>>
> > >>> DB Tsai
> > >>> -------------------------------------------------------
> > >>> My Blog: https://www.dbtsai.com
> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> > >>>
> > >>>
> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <sowen@cloudera.com>
> > wrote:
> > >>>
> > >>> > I might be stating the obvious for everyone, but the issue here
is
> > not
> > >>> > reflection or the source of the JAR, but the ClassLoader. The
basic
> > >>> > rules are this.
> > >>> >
> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
> usually
> > >>> > the ClassLoader that loaded whatever it is that first referenced
> Foo
> > >>> > and caused it to be loaded -- usually the ClassLoader holding
your
> > >>> > other app classes.
> > >>> >
> > >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
> > always
> > >>> > look in their parent before themselves.
> > >>> >
> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your
app
> is
> > >>> > loaded in a child ClassLoader, and you reference a class that
> Hadoop
> > >>> > or Tomcat also has (like a lib class) you will get the container's
> > >>> > version!)
> > >>> >
> > >>> > When you load an external JAR it has a separate ClassLoader which
> > does
> > >>> > not necessarily bear any relation to the one containing your app
> > >>> > classes, so yeah it is not generally going to make "new Foo" work.
> > >>> >
> > >>> > Reflection lets you pick the ClassLoader, yes.
> > >>> >
> > >>> > I would not call setContextClassLoader.
> > >>> >
> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> > sandy.ryza@cloudera.com>
> > >>> > wrote:
> > >>> > > I spoke with DB offline about this a little while ago and
he
> > confirmed
> > >>> > that
> > >>> > > he was able to access the jar from the driver.
> > >>> > >
> > >>> > > The issue appears to be a general Java issue: you can't directly
> > >>> > > instantiate a class from a dynamically loaded jar.
> > >>> > >
> > >>> > > I reproduced it locally outside of Spark with:
> > >>> > > ---
> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
URL[]
> {
> > new
> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> > >>> > >     Thread.currentThread().setContextClassLoader(urlClassLoader);
> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > >>> > > ---
> > >>> > >
> > >>> > > I was able to load the class with reflection.
> > >>> >
> > >>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message