tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siddharth Seth <ss...@apache.org>
Subject Re: OOM with Hive on Tez
Date Fri, 07 Nov 2014 08:22:11 GMT
Kostas,
The additional information would definitely be useful - especially w.r.t
what the Processor's memory requirements are, and at what stage the OOM
occurred.

Ideally, we'd like DAG writers to configure the Input / Output memory
requirements based on data size estimates (if that's available), as well as
the amount of memory estimated to be used by the Processor itself. The
memory distributor is an attempt to rationalize requests, which could
otherwise be considered to be misconfigured. (e.g. two
OrderedGroupedOutputs asking for 400MB each on a JVM with an Xmx set to
600).

On Thu, Nov 6, 2014 at 11:46 PM, Kostas Tzoumas <ktzoumas@apache.org> wrote:

> Hey folks, increasing tez.task.scale.memory.reserve-fraction to 0.8 worked
> for small jobs.
>
> I will come back with more a more detailed breakdown to make sure I'm doing
> things properly.
>
> Thanks for the quick responses!
>
> Kostas
>
> On Thu, Nov 6, 2014 at 10:15 PM, Siddharth Seth <sseth@apache.org> wrote:
>
> > - hive-dev, +tez-dev
> >
> > Do you know at what stage of the processing the OOM occurs ? What other
> > processing has happened so far. Ideally, if this was just part of the
> > Inputs being initialized - you should not have seen an OOM.
> > In most likelihood, the Processor started using some memory (which by
> > default is counted as 30% of the JVM heap). You could try modifying this
> > setting. [tez.task.scale.memory.reserve-fraction could be set higher than
> > 0.3 (30%) for starters).
> >
> > The logs will definitely help figuring out what is happening. A heap dump
> > would be even better.
> >
> > On Thu, Nov 6, 2014 at 1:01 PM, Gopal V <gopalv@apache.org> wrote:
> >
> > > On 11/6/14, 11:09 AM, Kostas Tzoumas wrote:
> > >
> > >> I am running into the same error [1] with plain Tez (not Hive):
> > >>
> > >> Any advice on what configuration parameters I should start looking at?
> > >>
> > >
> > > Both issues are related to the Tez memory distributor
> > > (InitialMemoryAllocator) impl used.
> > >
> > > http://tez.apache.org/releases/0.5.1/tez-runtime-
> > > library-javadocs/org/apache/tez/runtime/library/resources/
> > > WeightedScalingMemoryDistributor.html
> > >
> > > This divides memory up between different inputs and outputs, so that
> the
> > > overall memory usage is capped without hitting GC issues.
> > >
> > > Suma's issue was probably that tez-0.4 (ergo, hive-13) didn't have a
> > > memory distributor implementation.
> > >
> > >
> http://people.apache.org/~gopalv/tpch-plans/q8_national_market_share.svg
> > >
> > > This means that Reducer_4 in that can divvy up memory between these
> > > buffers up.
> > >
> > > OrderedGroupedKVInputConfig::setShuffleBufferFraction() allows this
> > > particular tuning per input edge.
> > >
> > > For a shuffle JOIN, you can tune the left and right hand side of the
> > > buffers, as well as make reservations for the actual map-join in
> memory,
> > so
> > > that the plan's cost information can help  the memory scheduling Tez
> has.
> > >
> > > Cheers,
> > > Gopal
> > >
> > >
> > >
> > >> [1] java.lang.OutOfMemoryError: Java heap space
> > >> at
> > >> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(
> > >> BoundedByteArrayOutputStream.java:56)
> > >> at
> > >> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(
> > >> BoundedByteArrayOutputStream.java:46)
> > >> at
> > >>
> org.apache.tez.runtime.library.common.shuffle.MemoryFetchedInput.<init>(
> > >> MemoryFetchedInput.java:38)
> > >> at
> > >> org.apache.tez.runtime.library.common.shuffle.impl.
> > >> SimpleFetchedInputAllocator.allocate(SimpleFetchedInputAllocator.
> > >> java:139)
> > >> at
> > >> org.apache.tez.runtime.library.common.shuffle.
> > >> Fetcher.fetchInputs(Fetcher.java:713)
> > >> at
> > >> org.apache.tez.runtime.library.common.shuffle.
> > >> Fetcher.doHttpFetch(Fetcher.java:485)
> > >> at
> > >> org.apache.tez.runtime.library.common.shuffle.
> > >> Fetcher.doHttpFetch(Fetcher.java:394)
> > >> at
> > >> org.apache.tez.runtime.library.common.shuffle.
> > >> Fetcher.call(Fetcher.java:189)
> > >> at
> > >> org.apache.tez.runtime.library.common.shuffle.
> > >> Fetcher.call(Fetcher.java:71)
> > >> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > >> at
> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > >> ThreadPoolExecutor.java:1145)
> > >> at
> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > >> ThreadPoolExecutor.java:615)
> > >> at java.lang.Thread.run(Thread.java:745)
> > >>
> > >> On Tue, Aug 26, 2014 at 4:26 PM, Suma Shivaprasad <
> > >> sumasai.shivaprasad@gmail.com> wrote:
> > >>
> > >>  Am using Tez 0.4.0 and counters for the query run are as below
> > >>>
> > >>> 2014-08-26 14:06:41,203 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(171)) -
> > org.apache.tez.common.counters.DAGCounter:
> > >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    NUM_FAILED_TASKS: 67
> > >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    NUM_KILLED_TASKS: 312
> > >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    TOTAL_LAUNCHED_TASKS: 259
> > >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    DATA_LOCAL_TASKS: 59
> > >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    RACK_LOCAL_TASKS: 27
> > >>> 2014-08-26 14:06:41,207 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(171)) - File System Counters:
> > >>> 2014-08-26 14:06:41,208 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    FILE: BYTES_READ: 0
> > >>> 2014-08-26 14:06:41,208 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    FILE: BYTES_WRITTEN: 3201156949
> > >>> 2014-08-26 14:06:41,208 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    FILE: READ_OPS: 0
> > >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    FILE: LARGE_READ_OPS: 0
> > >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    FILE: WRITE_OPS: 0
> > >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    HDFS: BYTES_READ: 30052072845
> > >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    HDFS: BYTES_WRITTEN: 0
> > >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    HDFS: READ_OPS: 768
> > >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    HDFS: LARGE_READ_OPS: 0
> > >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    HDFS: WRITE_OPS: 0
> > >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(171)) - org.apache.tez.common.
> > >>> counters.TaskCounter:
> > >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    GC_TIME_MILLIS: 148639
> > >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    CPU_MILLISECONDS: 1420020
> > >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    PHYSICAL_MEMORY_BYTES: 304725393408
> > >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    VIRTUAL_MEMORY_BYTES: 440084279296
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    COMMITTED_HEAP_BYTES: 337806557184
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    INPUT_RECORDS_PROCESSED: 722420718
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    OUTPUT_RECORDS: 144488481
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    OUTPUT_BYTES: 6876509984
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    OUTPUT_BYTES_WITH_OVERHEAD:
> 7165487118
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    OUTPUT_BYTES_PHYSICAL: 3201154197
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(171)) -
> > >>> org.apache.hadoop.hive.ql.exec.FilterOperator$Counter:
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    FILTERED: 863123081
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    PASSED: 215782564
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(171)) -
> > >>> org.apache.hadoop.hive.ql.exec.MapOperator$Counter:
> > >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> > >>> (TezTask.java:execute(173)) -    DESERIALIZE_ERRORS: 0
> > >>>
> > >>> Thanks
> > >>> Suma
> > >>>
> > >>>
> > >>> On Tue, Aug 26, 2014 at 7:47 PM, Suma Shivaprasad <
> > >>> sumasai.shivaprasad@gmail.com> wrote:
> > >>>
> > >>> > Trying to run a query on Tez with the following configurations
> > >>> >
> > >>> >
> > >>> > *set hive.tez.container.size=5120*
> > >>> > *set mapreduce.map.child.java.opts=-Xmx5120M*
> > >>> > *set hive.tez.java.opts=-Xmx4096M*
> > >>> > *set hive.auto.convert.join.noconditionaltask.size=805306000*
> > >>> > *set tez.am.resource.memory.mb=5120*
> > >>> > *set tez.am.java.opts=-Xmx4096M*
> > >>> >
> > >>> > The above config settings were set after  running
> > >>> >
> > >>> https://github.com/hortonworks/hdp-configuration-
> > >>> utils/blob/master/2.1/hdp-configuration-utils.py
> > >>> > to get the right memory configs
> > >>> >
> > >>> > Tried with both
> > >>> >
> > >>> > set tez.runtime.io.sort.mb=512
> > >>> > set mapreduce.task.io.sort.mb=512
> > >>> >
> > >>> > and
> > >>> >
> > >>> > set tez.runtime.io.sort.mb=2048
> > >>> > set mapreduce.task.io.sort.mb=2048
> > >>> >
> > >>> >
> > >>> > The query I am trying run is
> > >>> >
> > >>> > *select sum(tab1.m1),sum(tab1.m2)*
> > >>> > * from tab1 join tab2 dm on tab1.col1=tab2.col1*
> > >>> > * where tab1.dt = '2014-06-01' *
> > >>> > * and tab2.col2 = '..'*
> > >>> > * and tab2.col3 IN ('..')*
> > >>> > * group by TAB1.col1*
> > >>> >
> > >>> > *where TAB1.col1 has high cardinality(around 700- 800 million)*
> > >>> >
> > >>> > And its going OOM during shuffle phase.
> > >>> >
> > >>> >  errorMessage=Fetch failed
> > >>> > Container released by application,
> > >>> > AttemptID:attempt_1407396011310_1577_1_01_000000_4 Info:Error:
> > >>> > exceptionThrown=java.lang.OutOfMemoryError: Java heap space
> > >>> >  at
> > >>> >
> > >>> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(
> > >>> BoundedByteArrayOutputStream.java:56)
> > >>> > at
> > >>> >
> > >>> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(
> > >>> BoundedByteArrayOutputStream.java:46)
> > >>> >  at
> > >>> >
> > >>>
> > org.apache.tez.runtime.library.shuffle.common.MemoryFetchedInput.<init>(
> > >>> MemoryFetchedInput.java:38)
> > >>> > at
> > >>> >
> > >>> org.apache.tez.runtime.library.shuffle.common.impl.
> > >>> SimpleFetchedInputAllocator.allocate(SimpleFetchedInputAllocator.
> > >>> java:137)
> > >>> >  at
> > >>> >
> > >>> org.apache.tez.runtime.library.shuffle.common.
> > >>> Fetcher.fetchInputs(Fetcher.java:252)
> > >>> > at
> > >>> >
> > >>> org.apache.tez.runtime.library.shuffle.common.
> > >>> Fetcher.call(Fetcher.java:184)
> > >>> >  at
> > >>> >
> > >>> org.apache.tez.runtime.library.shuffle.common.
> > >>> Fetcher.call(Fetcher.java:59)
> > >>> > at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > >>> >  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > >>> > at
> > >>> >
> > >>> java.util.concurrent.ThreadPoolExecutor$Worker.
> > >>> runTask(ThreadPoolExecutor.java:886)
> > >>> >  at
> > >>> >
> > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > >>> ThreadPoolExecutor.java:908)
> > >>> > at java.lang.Thread.run(Thread.java:662)
> > >>> >
> > >>> >
> > >>> > Please advice if the configurations look ok? Do I need to change
> > >>> anything?
> > >>> >
> > >>> >
> > >>> >
> > >>> > Thanks
> > >>> > Suma
> > >>> >
> > >>> >
> > >>> >
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message