spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <>
Subject Re: Log hdfs blocks sending
Date Sat, 27 Sep 2014 09:03:49 GMT
Hi Alexey,

You're looking in the right place in the first log from the driver.
Specifically the locality is on the TaskSetManager INFO log level and looks
like this:

14/09/26 16:57:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0
(TID 10,, ANY, 1341 bytes)

The ANY there means you're not getting locality.  The big flag for me is
that you have an IP address for the host in that line as well.  Do you have
Spark configured to use hostnames instead of IP addresses?  You need to
check the Spark master webui and the Hadoo Namenode UI to make sure that
hosts appear exactly the same in both.  Most likely, you want both to have
the fqdn of each host.


On Fri, Sep 26, 2014 at 3:14 AM, Alexey Romanchuk <> wrote:

> Hello Andrew!
> Thanks for reply. Which logs and on what level should I check? Driver,
> master or worker?
> I found this on master node, but there is only ANY locality requirement.
> Here it is the driver (spark sql) log -
> and one of the workers
> log -
> Do you have any idea where to look at?
> Thanks!
> On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash <> wrote:
>> Hi Alexey,
>> You should see in the logs a locality measure like NODE_LOCAL,
>> PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
>> on them and you're reading out of HDFS, then you should be seeing almost
>> all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
>> uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
>> think the data is local and does remote reads which really kills
>> performance.
>> Hope that helps!
>> Andrew
>> On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk <
>>> wrote:
>>> Hello again spark users and developers!
>>> I have standalone spark cluster (1.1.0) and spark sql running on it. My
>>> cluster consists of 4 datanodes and replication factor of files is 3.
>>> I use thrift server to access spark sql and have 1 table with 30+
>>> partitions. When I run query on whole table (something simple like select
>>> count(*) from t) spark produces a lot of network activity filling all
>>> available 1gb link. Looks like spark sent data by network instead of local
>>> reading.
>>> Is it any way to log which blocks were accessed locally and which are
>>> not?
>>> Thanks!

View raw message