hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <>
Subject [jira] [Commented] (HIVE-13275) Add a toString method to BytesRefArrayWritable
Date Tue, 31 May 2016 20:12:13 GMT


Gopal V commented on HIVE-13275:

Sure, the String ctor looks like it does not escape anything when converting to UTF8 - when
used against columns containing NUL bytes this might be a problem.

Utilities::formatBinaryString() was written for similar scenarios where non-printable binary
needs to be escaped, though that's in ql/ instead of being in common/

> Add a toString method to BytesRefArrayWritable
> ----------------------------------------------
>                 Key: HIVE-13275
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats, Serializers/Deserializers
>    Affects Versions: 1.1.0
>            Reporter: Harsh J
>            Assignee: Harsh J
>            Priority: Trivial
>         Attachments: HIVE-13275.000.patch
> RCFileInputFormat cannot be used externally for Hadoop Streaming today cause Streaming
generally relies on the K/V pairs to be able to emit text representations (via toString()).
> Since BytesRefArrayWritable has no toString() methods, the usage of the RCFileInputFormat
causes object representation prints which are not useful.
> Also, unlike SequenceFiles, RCFiles store multiple "values" per row (i.e. an array),
so its important to output them in a valid/parseable manner, as opposed to choosing a simple
joining delimiter over the string representations of the inner elements.
> I propose adding a standardised CSV formatting of the array data, such that users of
Streaming can then parse the results in their own script. Since we have OpenCSV as a dependency
already, we can make use of it for this purpose.

This message was sent by Atlassian JIRA

View raw message