spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Li <liji...@gmail.com>
Subject Re: Loading Files from HDFS Incurs Network Communication
Date Mon, 26 Oct 2015 12:08:31 GMT
Hi, yes, it should be the same issue, but the solution doesn't apply in our
situation. Anyway, thanks a lot for your replies.

On Mon, Oct 26, 2015 at 7:44 PM Sean Owen <sowen@cloudera.com> wrote:

> Hm, now I wonder if it's the same issue here:
> https://issues.apache.org/jira/browse/SPARK-10149
>
> Does the setting described there help?
>
> On Mon, Oct 26, 2015 at 11:39 AM, Jinfeng Li <lijinf8@gmail.com> wrote:
>
>> Hi, I have already tried the same code with Spark 1.3.1, there is no such
>> problem. The configuration files are all directly copied from Spark 1.5.1.
>> I feel it is a bug on Spark 1.5.1.
>>
>> Thanks a lot for your response.
>>
>> On Mon, Oct 26, 2015 at 7:21 PM Sean Owen <sowen@cloudera.com> wrote:
>>
>>> Yeah, are these stats actually reflecting data read locally, like
>>> through the loopback interface? I'm also no expert on the internals here
>>> but this may be measuring effectively local reads. Or are you sure it's not?
>>>
>>> On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran <stevel@hortonworks.com
>>> > wrote:
>>>
>>>>
>>>> > On 26 Oct 2015, at 09:28, Jinfeng Li <lijinf8@gmail.com> wrote:
>>>> >
>>>> > Replication factor is 3 and we have 18 data nodes. We check HDFS
>>>> webUI, data is evenly distributed among 18 machines.
>>>> >
>>>>
>>>>
>>>> every block in HDFS (usually 64-128-256 MB) is distributed across three
>>>> machines, meaning 3 machines have it local, 15 have it remote.
>>>>
>>>> for data locality to work properly, you need the executors to be
>>>> reading in the blocks of data local to them, and not data from other parts
>>>> of the files. Spark does try to do locality, but if there's only a limited
>>>> set of executors, then more of the workload is remote vs local.
>>>>
>>>> I don't know of an obvious way to get the metrics here of local vs
>>>> remote; I don't see the HDFS client library tracking that —though it should
>>>> be the place to collect stats on local/remote/domain-socket-direct IO. Does
>>>> anyone know somewhere in the Spark metrics which tracks placement locality?
>>>> If not, both layers could have some more metrics added.
>>>
>>>
>>>
>

Mime
View raw message