spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Romanchuk <>
Subject Re: Log hdfs blocks sending
Date Fri, 26 Sep 2014 10:14:19 GMT
Hello Andrew!

Thanks for reply. Which logs and on what level should I check? Driver,
master or worker?

I found this on master node, but there is only ANY locality requirement.
Here it is the driver (spark sql) log - and one of the workers
log -

Do you have any idea where to look at?


On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash <> wrote:

> Hi Alexey,
> You should see in the logs a locality measure like NODE_LOCAL,
> PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
> on them and you're reading out of HDFS, then you should be seeing almost
> all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
> uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
> think the data is local and does remote reads which really kills
> performance.
> Hope that helps!
> Andrew
> On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk <
>> wrote:
>> Hello again spark users and developers!
>> I have standalone spark cluster (1.1.0) and spark sql running on it. My
>> cluster consists of 4 datanodes and replication factor of files is 3.
>> I use thrift server to access spark sql and have 1 table with 30+
>> partitions. When I run query on whole table (something simple like select
>> count(*) from t) spark produces a lot of network activity filling all
>> available 1gb link. Looks like spark sent data by network instead of local
>> reading.
>> Is it any way to log which blocks were accessed locally and which are not?
>> Thanks!

View raw message