hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6949) Reduces RPC packet size for primitive arrays, especially long[], which is used at block reporting
Date Wed, 23 Feb 2011 18:16:38 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998482#comment-12998482

Matt Foley commented on HADOOP-6949:

Regarding VersionedProtocols: By searching for all implementations of the VersionedProtocol
API "getProtocolVersion(String, long)", and the values they return, I found the following
16 version constants:
    hdfs.protocol.ClientDatanodeProtocol.versionID; (8)
    hdfs.protocol.ClientProtocol.versionID; (65)
    hdfs.server.protocol.DatanodeProtocol.versionID; (27)
    hdfs.server.protocol.InterDatanodeProtocol.versionID; (6)
    hdfs.server.protocol.NamenodeProtocol.versionID; (5)
    mapred.AdminOperationsProtocol.versionID; (3)
    mapred.InterTrackerProtocol.versionID; (31)
    mapred.TaskUmbilicalProtocol.versionID; (19)
    mapreduce.protocol.ClientProtocol.versionID; (36)

    hadoop.security.authorize.RefreshAuthorizationPolicyProtocol.versionID; (1)
    hadoop.security.RefreshUserMappingsProtocol.versionID; (1)
    hadoop.ipc.AvroRpcEngine.VERSION; (0)

    hadoop.security.TestDoAsEffectiveUser.TestProtocol.versionID (1)
    hadoop.ipc.MiniRPCBenchmark.MiniProtocol.versionID; (1)
    hadoop.ipc.TestRPC.TestProtocol.versionID; (1)
    mapred.TestTaskCommit.MyUmbilical unnamed constant (0)

The first nine are clearly production version numbers.  The next three (two security and one
avro) do not seem to have ever been incremented and I wonder if they need to be now.  The
last four are test-specific and I think should not be incremented.  So please advise:

1. Do all the first nine protocols use WritableRPCEngine and therefore need to have their
version numbers incremented?
2. Do the next three need to have their version numbers incremented for this change?
3. Do you agree that the four Test protocol versions should not change?
4. Did I miss any that you are aware of?

Thank you.  I will put together the versioning patch when we have consensus on what to change.

Konstantin, regarding your suggestion to extend this enhancement to arrays of non-Primitive
There is a simple way to extend this approach to arrays of Strings and Writables.  I coded
it up and have it available, it adds an ArrayWritable.Internal very similar to ArrayPrimitiveWritable.Internal.
 Nice and clean.  Of course the size improvement isn't as dramatic since the ratio of label
size to object size isn't as bad as with primitives, but the performance improvement is still
there (not having to go through the decision loop for every element of a long array).

However, using it would cause a significant change in semantics:
The current ObjectWritable can handle non-homogeneous arrays containing different kinds of
Writables, and nulls.
The optimization we are discussing here is removing the type-tagging of every array entry,
thereby assuming that the array is in fact strictly homogeneous, including having no null
entries.  There is also the question of what type declaration the container array has on entry,
and what type it should have on exit.  In the current code the only restriction on array type
    (Writable[]).type isAssignableFrom (X[]).type and
    X.type isAssignableFrom x[i].getClass() for all elements i
Also in the current code, the array produced on the receiving side is always simply Writable[].

1. Is it acceptable to assume/mandate that all arrays of Writables passed in to RPC shall
be homogeneous and have no null elements?  Note that this is a very strict form of homogeneity,
forbidding even subclass instances, because, for example, if you define a class FooWritable
and a subclass SubFooWritable, you can put them both in an array of declared type FooWritable[],
but the receiving-side deserializer will ONLY produce objects of type FooWritable, and will
fail entirely unless the serialized output of FooWritable.write() and SubFooWritable.write()
happen to be compatible (which is too complicated an exception to try to explain, IMO).

2. On the receiving side, should we (i) continue producing an array of type Writable[], or
(ii) preserve the type of the array during the implicit encoding process, or (iii) produce
an array of componentType same as the actual element type, assuming the array is indeed strictly
homogeneous so all elements are the same type?  All three are easily done, but have implications
about what can be done with the array later.

I would like to get the ArrayPrimitiveWritable committed soon, so unless the answers to the
above are really obvious to all interested parties, maybe I should open a new Jira for a longer

Finally, regarding putting it in v22, it should port trivially, but is it acceptable to have
an incompatible protocol change in a released version?  Thanks.

> Reduces RPC packet size for primitive arrays, especially long[], which is used at block
> -------------------------------------------------------------------------------------------------
>                 Key: HADOOP-6949
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6949
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Navis
>            Assignee: Matt Foley
>             Fix For: 0.23.0
>         Attachments: ObjectWritable.diff, arrayprim.patch, arrayprim_v4.patch
>   Original Estimate: 10m
>  Remaining Estimate: 10m
> Current implementation of oah.io.ObjectWritable marshals primitive array types as general
object array ; array type string + array length + (element type string + value)*n
> It would not be needed to specify each element types for primitive arrays.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message