tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hitesh Shah <hit...@apache.org>
Subject Re: ClassNotFoundException with custom InputFormat.
Date Thu, 18 Jun 2015 20:49:27 GMT
Hi Andre, 

You can use the PATTERN approach for both the AM and the tasks. The only issue as Sid pointed
out is the manipulation of classpath for the AM vs Tasks. Tasks handling is a bit different
as there are rules in place with respect to container re-use to correctly handle setting up
the necessary environment/classpath, etc. A new container can always be launched to handle
a task with different requirements but not really so for the AM.

There are a couple of improvement that could be done: 
   - enhance tez.aux.uris to be able to support archive patterns. tez.aux.uris is applied
to the AM as well as all tasks/containers so this would be the simplest to set up when there
is a need to universally add local resources. Adding archives should be trivial but pattern
will be a bit tricky.
   - make it simple to specify/modify class path. The TEZ_CLUSTER_ADDITIONAL_CLASSPATH_PREFIX
is also universally applied and could be an option if tez.aux.uris is enhanced. 
   - the other approach as Sid mentioned was to create helper functions that make it easy
to specify both local resources as well as implicitly changing class paths as needed via the
helper. 
  
Would you be willing to take a crack at one of the options above?

thanks
— Hitesh

On Jun 18, 2015, at 1:29 PM, Siddharth Seth <sseth@apache.org> wrote:

> Tasks can setup local resources and change the environment (specifically
> the classpath in this case). That's missing for AMs - where only
> LocalResources can be specified.
> An API to add a file to the classpath (including localization) - which
> works for the AM and tasks would be useful, and there's a jira for this -
> but hasn't been worked on yet.
> 
> On Thu, Jun 18, 2015 at 1:06 PM, Andre Kelpe <akelpe@concurrentinc.com>
> wrote:
> 
>> Hi,
>> 
>> so I have tried ARCHIVE and added it to
>> TEZ_CLUSTER_ADDITIONAL_CLASSPATH_PREFIX as you suggested. That seems to get
>> me further. The problem is now, that the same jar should be used in the
>> containers for the Dags, but that seems to work in a completely different
>> way.
>> 
>> We were using PATTERN for those before + a custom environment:
>> 
>> https://github.com/Cascading/cascading/blob/3.0/cascading-hadoop2-tez/src/main/java/cascading/flow/tez/util/TezUtil.java#L276-L311
>> This works, however I don't want to add the same jar twice, once as an
>> archive and once as a PATTERN.
>> 
>> I am a bit lost why there are two different ways of doing this for the
>> various JVMs at various stages.
>> 
>> - André
>> 
>> 
>> On Thu, Jun 18, 2015 at 9:57 AM, Hitesh Shah <hitesh@apache.org> wrote:
>> 
>>> Hi Andre
>>> 
>>> Are you using Local Resource type ARCHIVE? Using FILE may not help in
>> your
>>> scenario.
>>> 
>>> If you are using ARCHIVE, you can then use the classpath config (
>>> TEZ_CLUSTER_ADDITIONAL_CLASSPATH_PREFIX ) to modify the classpath.
>>> 
>>> For example, assume foo.jar and bar.jar ( in the structure that you
>>> called out ) are added to the map of local resources using keys foo and
>> bar:
>>>      - classpath prefix would be
>>> “$PWD/foo/*:$PWD/foo/lib/*:$PWD/bar/*:$PWD/bar/lib/*:”
>>> 
>>> As mentioned on the jira, the launch_container.sh from your cluster would
>>> help. Also, if you upload an example jar to the jira, I can help provide
>> a
>>> working example.
>>> 
>>> thanks
>>> — Hitesh
>>> 
>>> 
>>> On Jun 18, 2015, at 9:40 AM, Andre Kelpe <akelpe@concurrentinc.com>
>> wrote:
>>> 
>>>> On Wed, Jun 17, 2015 at 4:58 PM, Bikas Saha <bikas@hortonworks.com>
>>> wrote:
>>>> 
>>>>> If I understand this right, there is a jar with user code in it. The
>> jar
>>>>> needs to be available during split creation but it is not available.
>>>>> 
>>>>> 
>>>>> 
>>>>> Is split creation happening on the client or on the AM. If its
>> happening
>>>>> on the AM, and the AM is not getting the jars then how are you
>>> specifying
>>>>> the jars to be sent to the AM. There are different ways to do it.
>>>>> 
>>>> 
>>>> In our case the AM is doing the split calculation. We are sending the
>> jar
>>>> over as LocalResources given in the TezClient#create method
>>>> 
>>>> 
>>>>> 1)      Set tez.aux.uris in tez-site.xml to an HDFS location and copy
>>>>> user jars there
>>>>> 
>>>>> 2)      Upload the user jar to HDFS and create a YARN local resource
>> for
>>>>> it. Then use either of the following to add the local resource to the
>>>>> AM/DAG that needs it.
>>>>> 
>>>>> a.       TezClient#addAppMasterLocalFiles(…)
>>>>> 
>>>>> b.      DAG#addTaskLocalFiles(…)
>>>>> 
>>>>> 
>>>>> 
>>>>> Not sure what is meant by classic Hadoop style jars?
>>>>> 
>>>> 
>>>> Hadoop style jars are jar files, where you have the user code + all
>>>> required libs in a sub-directory within the jar. The layout that RunJar
>>>> understands since forever.
>>>> 
>>>> The thing is that we can't find a way to put the jars in the lib folder
>>> in
>>>> the job-jar on the classpath of the AM.
>>>> 
>>>> - André
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Bikas
>>>>> 
>>>>> 
>>>>> 
>>>>> *From:* Chris K Wensel [mailto:chris@wensel.net]
>>>>> *Sent:* Wednesday, June 17, 2015 4:41 PM
>>>>> *To:* dev@tez.apache.org
>>>>> *Cc:* user@tez.apache.org
>>>>> *Subject:* Re: ClassNotFoundException with custom InputFormat.
>>>>> 
>>>>> 
>>>>> 
>>>>> cross posting down to dev… should continue the discussion there I
>>> believe.
>>>>> 
>>>>> 
>>>>> 
>>>>> as I understand it, all Cascading users familiar with packaging a
>> Hadoop
>>>>> job jar with a lib folder, in which the packaged custom InputFormat is
>>>>> placed — pulled from maven etc, will have this issue.
>>>>> 
>>>>> 
>>>>> 
>>>>> this also expands to projects on top of Cascading including Scalding
>> and
>>>>> Cascalog.
>>>>> 
>>>>> 
>>>>> 
>>>>> oddly the org.apache.tez.client.AMConfiguration has a
>>>>> 
>>>>> 
>>>>> 
>>>>> private Map<String, String> env;
>>>>> 
>>>>> 
>>>>> 
>>>>> but is unused.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jun 17, 2015, at 4:32 PM, Andre Kelpe <akelpe@concurrentinc.com>
>>>>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> we are currently running into a problem when a user of Cascading uses
>> a
>>>>> custom InputFormat with Tez. The ApplicationMaster is running into a
>>>>> ClassNotFoundException when calculating the splits, since we are
>> unable
>>> to
>>>>> control the environment/classpath visibile to the ApplicationMaster.
>> We
>>>>> have a work-around, where the users have to supply a fat-jar to make
>> it
>>>>> work, but we need to be able to support other ways as well.
>>>>> 
>>>>> When interacting with the DAG, we are able to pass along a custom
>>>>> environment/classpath, but that API is missing on the TezClient,
>> causing
>>>>> the AppMaster to fail, when the user is using classic hadoop style
>> jars
>>>>> (embedded lib directory).
>>>>> 
>>>>> In order to get lingual, our SQL layer on top of Cascading to work
>>>>> correctly, we need a way to supply the environment in a more dynamic
>> way
>>>>> then one fatjar, so it would be great if the API could be extendend to
>>> do
>>>>> that.
>>>>> 
>>>>> I have opened https://issues.apache.org/jira/browse/TEZ-2563
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> 
>>>>> - André
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> André Kelpe
>>>>> andre@concurrentinc.com
>>>>> http://concurrentinc.com
>>>>> 
>>>>> 
>>>>> 
>>>>> —
>>>>> 
>>>>> Chris K Wensel
>>>>> 
>>>>> chris@wensel.net
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> André Kelpe
>>>> andre@concurrentinc.com
>>>> http://concurrentinc.com
>>> 
>>> 
>> 
>> 
>> --
>> André Kelpe
>> andre@concurrentinc.com
>> http://concurrentinc.com
>> 


Mime
View raw message