hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13065) add per-operation stats to FileSystem.Statistics
Date Wed, 27 Apr 2016 22:48:13 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261118#comment-15261118

Colin Patrick McCabe commented on HADOOP-13065:

Thanks, [~liuml07].

Based on the discussion today, it sounds like we would like to have both global statistics
per FS class, and per-instance statistics for an individual FS or FC instance.  The rationale
for this is that in some cases we might want to differentiate between, say, the stats when
talking to one s3 bucket, and another s3 bucket.  Or another example is the stats talking
to one HDFS FS versus another HDFS FS (if we are using federation, or just multiple HDFS instances).

We talked a bit about metrics2, but there were several things that made it not a good fit
for this statistics interface.  One issue is that metrics2 assumes that statistics are permanent
once created.  Effectively, it keeps them around until the JVM terminates.  metrics2 also
tends to use a fair amount of memory and require a fair amount of boilerplate code compared
to other solutions.  Finally, because it is global, it can't do per-instance stats very effectively.

It would be nice for the new statistics interface to provide the same stats which are currently
provided by FileSystem#Statistics.  This would allow us to deprecate and eventually remove
FileSystem#Statistics as a public interface (although we might keep the implementation). 
This could be done only in a new release of Hadoop, of course.  We also talked about the benefits
of providing an iterator over all statistics rather than a map of all statistics.  Relatedly,
we talked about the desire to have a new interface that was abstract enough to accommodate
new, more efficient implementations in the future.

For now, the new interface will deal with per-FS stats, but not per-stream ones.  We should
revisit per-stream statistics later.

> add per-operation stats to FileSystem.Statistics
> ------------------------------------------------
>                 Key: HADOOP-13065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13065
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Ram Venkatesh
>            Assignee: Mingliang Liu
>         Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch, HDFS-10175.002.patch,
HDFS-10175.003.patch, HDFS-10175.004.patch, HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java
> Currently FileSystem.Statistics exposes the following statistics:
> BytesRead
> BytesWritten
> ReadOps
> LargeReadOps
> WriteOps
> These are in-turn exposed as job counters by MapReduce and other frameworks. There is
logic within DfsClient to map operations to these counters that can be confusing, for instance,
mkdirs counts as a writeOp.
> Proposed enhancement:
> Add a statistic for each DfsClient operation including create, append, createSymlink,
delete, exists, mkdirs, rename and expose them as new properties on the Statistics object.
The operation-specific counters can be used for analyzing the load imposed by a particular
job on HDFS. 
> For example, we can use them to identify jobs that end up creating a large number of
> Once this information is available in the Statistics object, the app frameworks like
MapReduce can expose them as additional counters to be aggregated and recorded as part of
job summary.

This message was sent by Atlassian JIRA

View raw message