giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vitaly Tsvetkov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-1026) New Out-of-core mechanism does not work
Date Tue, 25 Aug 2015 06:17:45 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710674#comment-14710674
] 

Vitaly Tsvetkov commented on GIRAPH-1026:
-----------------------------------------

We have found one more OOC bug. 
We implement LPAlgorithm for community detection (see code in attachment). 
{noformat}
hadoop jar graph-with-dependencies.jar \
 org.apache.giraph.GiraphRunner \
 -Dgiraph.yarn.task.heap.mb=58880 \
 -Dgiraph.useOutOfCoreGraph=true \
 -Dgiraph.enableFlowControlInput=true \
 -Dgiraph.userPartitionCount=45 \
 -Dgiraph.useBigDataIOForMessages=true \
 -Dgiraph.waitForPerWorkerRequests=true \
 -Dgiraph.maxNumberOfUnsentRequests=1000 \
 -Dgiraph.lowFreeMemoryFraction=0.32 -Dgiraph.midFreeMemoryFraction=0.4 -Dgiraph.fairFreeMemoryFraction=0.425
-Dgiraph.highFreeMemoryFraction=0.45 \
 ru.isys.LPAComputation \
 -vif ru.isys.CrawlerTextMessagesInputFormat -vip /tmp/bigdata \
 -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /tmp/giraph \
 -w 3 \
 -yj graph-with-dependencies.jar
{noformat}
During superstep 1 on a moderate graph (about 10-20M of vertices at input) we got next NPE:
{noformat}
INFO graph.GraphTaskManager: execute: 15 partitions to process with 1 compute thread(s), originally
1 thread(s) on superstep 1
INFO ooc.DiskBackedPartitionStore: startIteration: with 15 partitions in memory, there can
be maximum 15 partitions in memory out of 15 that belongs to this worker.
INFO ooc.DiskBackedPartitionStore: startIteration: done preparing the iteration
...
INFO graph.ComputeCallable: call: Completed 13 partitions, 15 remaining Memory (free/total/max)
= 17364.40M / 53541.00M / 53541.00M
INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 37 open
requests, 1000 unsent requests, MBytes/sec received = 0.0007, MBytesReceived = 0.1843, ave
received req MBytes = 0, secs waited = 280.045
MBytes/sec sent = 26.6224, MBytesSent = 7455.4833, ave sent req MBytes = 0.5002, secs waited
= 280.045
INFO netty.NettyClient: logInfoAboutOpenRequests: 20 requests for taskId=1, 17 requests for
taskId=2, 0 requests for taskId=-1, 
INFO ooc.CheckMemoryCallable: call: Memory is very limited now. Calling GC manually. freeMemory
= 16725.01MB
INFO ooc.CheckMemoryCallable: call: GC is done. GC time = 16.15 sec, and freeMemory = 21875.48MB
INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 20 open
requests, 1000 unsent requests, MBytes/sec received = 0.0006, MBytesReceived = 0.1846, ave
received req MBytes = 0, secs waited = 298.181
MBytes/sec sent = 25.0031, MBytesSent = 7455.4833, ave sent req MBytes = 0.5002, secs waited
= 298.181
INFO netty.NettyClient: logInfoAboutOpenRequests: 20 requests for taskId=1, 0 requests for
taskId=-1, 0 requests for taskId=2, 
INFO handler.RequestDecoder: decode: Server window metrics MBytes/sec received = 16.8038,
MBytesReceived = 523.2381, ave received req MBytes = 0.0234, secs waited = 31.138
INFO utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@403f0a22
INFO utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@503ecb24
INFO ooc.CheckMemoryCallable: call: Memory is very limited now. Calling GC manually. freeMemory
= 16866.85MB
INFO ooc.CheckMemoryCallable: call: GC is done. GC time = 17.28 sec, and freeMemory = 21043.31MB
INFO ooc.CheckMemoryCallable: call: signal out-of-core processor threads to start offloading.
These threads will spill vertex buffers of 0 partitions, edge buffers of 0 partitions, pending
message buffers of 0 partitions, and 1 whole partitions
INFO ooc.CheckMemoryCallable: call: waiting on OOC processors to finish offloading data to
disk
INFO ooc.OutOfCoreProcessorCallable: call: start offloading a partition
INFO ooc.DiskBackedPartitionStore: swapOnePartitionToDisk: decided to swap partition 0 to
disk
INFO ooc.DiskBackedPartitionStore: swapOnePartitionToDisk: start swapping partition 0 to disk.
INFO handler.RequestDecoder: decode: Server window metrics MBytes/sec received = 12.2145,
MBytesReceived = 394.357, ave received req MBytes = 0.0214, secs waited = 32.285
INFO ooc.AdaptiveOutOfCoreEngine: tetherClient: sending back-off signal to worker 2
INFO ooc.AdaptiveOutOfCoreEngine: tetherClient: sending back-off signal to worker 1
INFO graph.ComputeCallable: call: Completed 14 partitions, 15 remaining Memory (free/total/max)
= 20123.64M / 53855.00M / 53855.00M
ERROR utils.LogStacktraceCallable: Execution of callable failed
java.lang.NullPointerException
	at org.apache.giraph.comm.messages.SimpleMessageStore.hasMessagesForPartition(SimpleMessageStore.java:220)
	at org.apache.giraph.ooc.DiskBackedPartitionStore.offloadPartition(DiskBackedPartitionStore.java:1813)
	at org.apache.giraph.ooc.DiskBackedPartitionStore.swapOnePartitionToDisk(DiskBackedPartitionStore.java:500)
	at org.apache.giraph.ooc.DiskBackedPartitionStore.spillOnePartition(DiskBackedPartitionStore.java:537)
	at org.apache.giraph.ooc.OutOfCoreProcessorCallable.call(OutOfCoreProcessorCallable.java:144)
	at org.apache.giraph.ooc.OutOfCoreProcessorCallable.call(OutOfCoreProcessorCallable.java:43)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
INFO graph.ComputeCallable: call: Computation took 345.63174 secs for 15 partitions on superstep
1.  Flushing started
INFO worker.BspServiceWorker: finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 14995.59M / 53855.00M / 53855.00M
INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 40 open
requests, 999 unsent requests, MBytes/sec received = 0.0006, MBytesReceived = 0.2137, ave
received req MBytes = 0, secs waited = 344.92
MBytes/sec sent = 25.0503, MBytesSent = 8640.3787, ave sent req MBytes = 0.5001, secs waited
= 344.92
INFO netty.NettyClient: logInfoAboutOpenRequests: 20 requests for taskId=1, 20 requests for
taskId=2, 0 requests for taskId=-1, 
INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 40 open
requests, 946 unsent requests, MBytes/sec received = 0.0006, MBytesReceived = 0.2144, ave
received req MBytes = 0, secs waited = 360.787
MBytes/sec sent = 24.0235, MBytesSent = 8667.3877, ave sent req MBytes = 0.5001, secs waited
= 360.787
INFO netty.NettyClient: logInfoAboutOpenRequests: 20 requests for taskId=1, 20 requests for
taskId=2, 0 requests for taskId=-1, 
INFO handler.RequestDecoder: decode: Server window metrics MBytes/sec received = 14.2219,
MBytesReceived = 482.4782, ave received req MBytes = 0.0234, secs waited = 33.924
INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 20 open
requests, 51 unsent requests, MBytes/sec received = 0.0006, MBytesReceived = 0.2257, ave received
req MBytes = 0, secs waited = 375.798
MBytes/sec sent = 24.2513, MBytesSent = 9113.6003, ave sent req MBytes = 0.5001, secs waited
= 375.798
INFO netty.NettyClient: logInfoAboutOpenRequests: 20 requests for taskId=1, 0 requests for
taskId=-1, 0 requests for taskId=2, 
INFO netty.NettyClient: waitAllRequests: Finished all requests. MBytes/sec received = 0.0006,
MBytesReceived = 0.2266, ave received req MBytes = 0, secs waited = 377.231
MBytes/sec sent = 24.2276, MBytesSent = 9139.4425, ave sent req MBytes = 0.5001, secs waited
= 377.232
INFO worker.WorkerAggregatorHandler: finishSuperstep: Start gathering aggregators, workers
will send their aggregated values once they are done with superstep computation
INFO utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 2 more tasks to send
their aggregator data, task ids: [1, 2]
INFO utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 2 more tasks to send
their aggregator data, task ids: [1, 2]
INFO utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 2 more tasks to send
their aggregator data, task ids: [1, 2]
{noformat}
and job freeze in this state forever.
Note  that
1) we use text vertex ids
2) keep recomendations to get halt state eventually (component vertex values as <prevLabel
\t currentLabel>) and determinated strategy to "break the tie" (by using String#compareTo
in _getMaxFreqLabel()_).
3) after inputstep graph is still not fully completed because some of vertices generates "on
fly" (they are not present in input vertices but in their edge-endpoints). At superstep 0
we send empty messages to add missing vertices, at superstep 1 - values initialize and after
superstep 2 "label propagation" starts. It could propably impact on the behavior. 

What could be the reason for this problem?
Many thanks for any help from you!

> New Out-of-core mechanism does not work
> ---------------------------------------
>
>                 Key: GIRAPH-1026
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-1026
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 1.2.0-SNAPSHOT
>            Reporter: Max Garmash
>            Assignee: Hassan Eslami
>         Attachments: LPAComputation.txt
>
>
> After releasing new OOC mechanism we tried to test it on our data and it failed.
> Our environment:
> 4x (CPU 6 cores / 12 threads, RAM 64GB) 
> We can successfully process about 75 millions of vertices. 
> With 100-120M vertices it fails like this:
> {noformat}
> 2015-08-04 12:35:21,000 INFO  [AMRM Callback Handler Thread] yarn.GiraphApplicationMaster
(GiraphApplicationMaster.java:onContainersCompleted(574)) - Got container status for containerID=container_1438068521412_0193_01_000005,
state=COMPLETE, exitStatus=-104, diagnostics=Container [pid=6700,containerID=container_1438068521412_0193_01_000005]
is running beyond physical memory limits. Current usage: 20.3 GB of 20 GB physical memory
used; 22.4 GB of 42 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1438068521412_0193_01_000005 :
> 	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES)
RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> 	|- 6704 6700 6700 6700 (java) 78760 20733 24033841152 5317812 java -Xmx20480M -Xms20480M
-cp .:${CLASSPATH}:./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH:./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*::./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*::./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*:
org.apache.giraph.yarn.GiraphYarnTask 1438068521412 193 5 1 
> 	|- 6700 6698 6700 6700 (bash) 0 0 14376960 433 /bin/bash -c java -Xmx20480M -Xms20480M
-cp .:${CLASSPATH}:./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH:./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*:
org.apache.giraph.yarn.GiraphYarnTask 1438068521412 193 5 1 1>/var/log/hadoop-yarn/container/application_1438068521412_0193/container_1438068521412_0193_01_000005/task-5-stdout.log
2>/var/log/hadoop-yarn/container/application_1438068521412_0193/container_1438068521412_0193_01_000005/task-5-stderr.log
 
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> {noformat}
> Logs from container
> {noformat}
> 2015-08-04 12:34:51,258 INFO  [netty-server-worker-4] handler.RequestDecoder (RequestDecoder.java:channelRead(74))
- decode: Server window metrics MBytes/sec received = 12.5315, MBytesReceived = 380.217, ave
received req MBytes = 0.007, secs waited = 30.34
> 2015-08-04 12:35:16,258 INFO  [check-memory] ooc.CheckMemoryCallable (CheckMemoryCallable.java:call(221))
- call: Memory is very limited now. Calling GC manually. freeMemory = 924.27MB
> {noformat}
> We are running our job like this:
> {noformat}
> hadoop jar giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.6.0-cdh5.4.4-jar-with-dependencies.jar
\
>  org.apache.giraph.GiraphRunner \
>  -Dgiraph.yarn.task.heap.mb=20480 \
>  -Dgiraph.isStaticGraph=true \
>  -Dgiraph.useOutOfCoreGraph=true \
>  -Dgiraph.logLevel=info \
>  -Dgiraph.weightedPageRank.superstepCount=5 \
>  ru.isys.WeightedPageRankComputation \
>  -vif ru.isys.CrawlerInputFormat \
>  -vip /tmp/bigdata/input \
>  -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
>  -op /tmp/giraph \
>  -w 6 \
>  -yj giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.6.0-cdh5.4.4-jar-with-dependencies.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message