trafodion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hans Zeller <hans.zel...@esgyn.com>
Subject Re: 答复: 答复: enhance TRANSLATE to support Chinese charset?
Date Wed, 06 Jan 2016 01:56:01 GMT
Hi again,

Sorry, I have to make a few more comments.

   - CQD HIVE_DEFAULT_CHARSET 'GBK': This won't work, at least not with
   further modification. If we would allow that, an INVOKE
   hive.hive.myhivetable would show string columns as mycol char(n) character
   set GBK. That's what this CQD means. Maybe we need another CQD,
   HIVE_FILE_CHARSET or something similar, that indicates the character set of
   the file to be converted to HIVE_DEFAULT_CHARSET.
   - If we do the implicit conversion for Hive columns, then we don't
   really need a new TRANSLATE function, at least not for users, it would only
   be used internally - or would there be any SQL expression where we would
   have a GBK character column?
   - You say that GBK is fixed at 2 bytes, but when I look at Wikipedia
   <https://en.wikipedia.org/wiki/GBK>, it says that US-ASCII characters
   (0-127) are encoded as single byte characters, and all other characters are
   2 bytes.
   - Allowing a NUL (null byte, code point 0) in strings: Actually,
   ISO8859-1, UTF-8 and GBK all allow that. UCS2 allows a double-byte NUL
   character. If we wouldn't allow it, we could not store binary data in
   ISO8859-1 columns. I'm not sure why Selva said in an earlier email that we
   can't handle NULs.

Thanks,

Hans

On Tue, Jan 5, 2016 at 5:16 PM, Liu, Ming (Ming) <ming.liu@esgyn.cn> wrote:

> Hi, all,
> Thanks all for the discussions, there are a lot of very good ideas and
> questions. I would like to summarize a little bit.
> 1. GBK vs. GB2312, and encoding.
> Yes, Dave, we will support GBK first, so it should cover GB2312 as well.
> We don't need to explicitly support GB2312 unless we meet specific issues.
> I am not aware of different GBK encodings, we will use Glibc standard API
> iconv() to do the conversion, it should handle all the details I assume.
> From some initial tests, it can convert some real source data correctly. I
> need to understand more about GBK here, but yes, we will support GBK, not
> GB2312.
>
> 2. Total support of GBK in Trafodion
> I agree to support GBK totally is a big project, so as Qifan suggested,
> for this task, we only allow GBK to appear in the TRANSLATE function. So we
> can allow user to change the hive character set to GBK via
> HIVE_DEFAULT_CHARSET, and use GBK2UTF8 in the TRANSLATE function to do the
> converting. This will not affect other part of the database, and not a lie.
> :-) And as Hans/Selva clarified, Hive cannot specific charset per column,
> so I feel we don't need to make it anymore complex. Qifan's proposal of
> allowing Trafodion to do implicit charset translation is very great, so we
> don't bother to write a very complex sql to add TRANSLATE for each 'string'
> column in Hive. Since we saw tables with 300+ columns in hive before...
>
> 3. Total support of GBK in Trafodion, again
> To many aspects, I feel GBK is very similar with UCS2. Its width is fixed
> as 2-bytes. And it will not allow '\0' (null) in any character byte. So I
> hope the effort will be controlled by simulating what Trafodion already do
> for UCS2. But we should launch a total support of GBK as a separate task,
> since for this proposal to TRANSLATE function, the goal is to allow GBK
> data load into UTF-8 columns in Trafodion simpler.
>
> Thanks,
> Ming
>
> -----邮件原件-----
> 发件人: Dave Birdsall [mailto:dave.birdsall@esgyn.com]
> 发送时间: 2016年1月6日 2:30
> 收件人: dev@trafodion.incubator.apache.org
> 主题: RE: 答复: enhance TRANSLATE to support Chinese charset?
>
> Hi,
>
> I was reading up on the character sets a little bit (e.g. Wikipedia
> articles on GB2312 and GBK). It seems GBK is an extension of GB2312, so I'm
> assuming that we are discussing support for GBK, correct? Also, it appears
> there are multiple possible encodings for GBK (no surprise, really; Unicode
> has multiple encodings too). We will want to specify which encoding(s) we
> wish to support.
>
> Dave
>
> -----Original Message-----
> From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> Sent: Tuesday, January 5, 2016 10:22 AM
> To: dev <dev@trafodion.incubator.apache.org>
> Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
>
> Sorry, in my previous email, I should also mention that the GBK charset is
> only valid within the TRANSLATE function.
>
> Thanks --Qifan
>
> On Tue, Jan 5, 2016 at 12:17 PM, Qifan Chen <qifan.chen@esgyn.com> wrote:
>
> > Hi Hans,
> >
> > Yes, either translate GDB to UTF8 or stored as binary data in ISO88591
> > is fine.
> >
> > I feel the work to allow HIVE_DEFAULT_CHARSET cqd to take GBK, and
> > disabling creating trafodion tables with GBK character columns and GBK
> > literals (both in the parser) is still reasonably sized.
> >
> > With this, Trafodion only recognizes GBK as a source charset for Hive
> > tables, and is able to convert the GBK source to other data format.
> >
> > Thanks --Qifan
> >
> >
> > On Tue, Jan 5, 2016 at 12:02 PM, Hans Zeller <hans.zeller@esgyn.com>
> > wrote:
> >
> >> Just want to clarify: I hate to lie to the engine about character
> >> sets just as much. If someone can make a workable proposal to avoid
> >> it that would be great.
> >>
> >> Unlike in the predecessor products, which tried to interpret the
> >> data, we would treat the GBK strings like binary data stored in
> >> ISO88591, except in the one new translate function that could convert
> >> it to UTF-8. So, we would reduce the lie to the bad but already
> >> accepted one of storing binary data in ISO88591.
> >>
> >> Allowing a new GBK character set without adequate support for it may
> >> be more trouble than treating GBK as binary data. A proposal to do
> >> that should specify the limitations very carefully.
> >>
> >> Thanks,
> >>
> >> Hans
> >>
> >> On Tue, Jan 5, 2016 at 9:48 AM, Dave Birdsall
> >> <dave.birdsall@esgyn.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm wondering how large a project it is to add a new character set
> >> > to Trafodion? I personally hate the idea of lying to the engine
> >> > (for
> >> example,
> >> > pretending that GBK is ISO88591 in most places except when we want
> >> > to translate to another character set; we tried this in one of our
> >> predecessor
> >> > products and it caused endless complications -- it is technical
> >> > debt
> >> that
> >> > will soon demand to be repaid).
> >> >
> >> > If we need to reduce the scope of the project, one way might be to
> >> > limit the new character set to Hive tables only (so we'd have to do
> >> > the metadata
> >> work
> >> > in Trafodion), and translate the data immediately in the scan nodes
> >> where
> >> > it
> >> > enters the engine (so we could limit run-time support to a
> >> > translate operator; other expression operators would operate only
> >> > on known
> >> character
> >> > sets).
> >> >
> >> > Dave
> >> >
> >> > -----Original Message-----
> >> > From: Hans Zeller [mailto:hans.zeller@esgyn.com]
> >> > Sent: Tuesday, January 5, 2016 9:21 AM
> >> > To: dev <dev@trafodion.incubator.apache.org>
> >> > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >> >
> >> > Hi,
> >> >
> >> > Thanks, Ming, for confirming what the use case is.
> >> >
> >> > As Qifan and Selva have pointed out, the HIVE_DEFAULT_CHARSET CQD
> >> > determines the character set of string columns in Hive tables. I
> >> > don't think,
> >> however,
> >> > that it is feasible to allow GBK as a value for
> >> > HIVE_DEFAULT_CHARSET, unless you are ready to do a lot of work.
> >> > That's because Hive columns have
> >> regular
> >> > Trafodion data types, and supporting GBK in regular data types
> >> > would be
> >> a
> >> > large project.
> >> >
> >> > Therefore, I would suggest that we use cqd HIVE_DEFAULT_CHARSET
> >> 'iso88591'
> >> > instead. We also don't have a BINARY data type and use ISO88591 for
> >> that as
> >> > well. To make that happen, you could either use some CQD as we
> >> > talked about, or a special TRANSLATE flavor, something like
> >> > GBK_AS_ISO_TO_UTF8 (sorry
> >> for
> >> > all the long names, maybe you can shorten them). This may be better
> >> > than the CQD to suppress character set checks. I agree with Qifan
> >> > that we should
> >> try
> >> > hard to check all character sets. For the same reason, we should
> >> > not
> >> allow
> >> > UTF-8 as HIVE_DEFAULT_CHARSET when we store GBK data in Hive tables.
> >> >
> >> > Qifan mentions different character sets in a single Hive table. My
> >> > impression is that the Hive folks did not anticipate that, since
> >> > Hive
> >> used
> >> > to be just simple text files. So, I hope that won't be a common case.
> >> >
> >> > Thanks,
> >> >
> >> > Hans
> >> >
> >> > On Tue, Jan 5, 2016 at 8:35 AM, Selva Govindarajan <
> >> > selva.govindarajan@esgyn.com> wrote:
> >> >
> >> > > Hi Ming,
> >> > >
> >> > > I believe Hive doesn't have a way to specify the character set
> >> > > encoding of individual columns. It relies on the JVM default
> >> > > encoding for the table as a whole. The JVM default encoding
> >> > > depends on the file.encoding default or the LANG variable.
> >> > >
> >> > > In Trafodion, we currently support TEXT and SEQUENCE format hive
> >> > > tables only. Trafodion also reads the hive data directly as
> >> > > stream of bytes and imposes the data type conversion on those
> >> > > stream of bytes to match with hive table definition, the columns
> >> > > being separated by the column delimiters.
> >> > > This
> >> > > processing in Trafodion done in C++ and is not wide character aware.
> >> > > As long as the needed character set doesn't have null byte
> >> > > embedded like UTF8 or
> >> > > ISO88591 encoding, this processing works fine.
> >> > >
> >> > > Currently the CQD hive_default_charset takes in UTF8 and ISO88591
> >> only.
> >> > > However, you can extend this CQD to take different character sets
> >> > > as long as these character set doesn't have null byte embedded in
> >> > > its encoding. For other encodings, this processing should be made
> >> > > wide character aware.
> >> > >
> >> > > Selva
> >> > >
> >> > > -----Original Message-----
> >> > > From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> >> > > Sent: Tuesday, January 5, 2016 7:56 AM
> >> > > To: dev <dev@trafodion.incubator.apache.org>
> >> > > Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
> >> > >
> >> > > Hi Ming,
> >> > >
> >> > > Thank you for explaining one of the bottlenecks in PoCs in China!
> >> > >
> >> > > From data integrity point of view, it is important that the data
> >> > > to be inserted into Trafodion tables is verified to be in the
> >> > > said charset, and Trafodion's TRANSLATE function can guarantee
> >> > > that. So I wonder if we can consider these possible cases as
> follows.
> >> > >
> >> > >    1. All char columns in the hive table are in GBK. Set CQD
> >> > >    hive_default_charset to GBK and replying on implicit casting
> >> > > to the target
> >> > >    column of the trafodion table. Note that since CQD
> >> > >    ALLOW_IMPLICIT_CHAR_CASTING is on by default, implicit casting
> >> calls
> >> > >    TRANSLATE under the table.
> >> > >    2. Only small number of char columns in the hive table are in
> GBK.
> >> > > Apply
> >> > >    the TRANSLATE function on these GBK columns in the UPSERT
> >> statement,
> >> > as
> >> > >    outlined in the Hans' email. The default charset of the source
> >> > > is
> >> > still
> >> > >    UTF8. If for a particular GBK column that the TRANSLATE
> >> > > function is
> >> > not
> >> > >    specified, Trafodion will give an error. Fix the UPSERT stmt
> >> > > and re-try.
> >> > >    3. Only small number of char columns in the hive table are
> >> > > *NOT* in GBK.
> >> > >    Set CQD hive_default_charset to GBK. Apply the TRANSLATE
> >> > > function on these
> >> > >    non-GBK columns in the UPSERT statement, similar to 2) above.
> >> > >    4. Non of the columns are in GBK. No special settings are
> >> necessary.
> >> > >
> >> > >
> >> > > With the above proposal, the data integrity is maintained and the
> >> > > amount of work to prepare the UPSERT for charset conversion is
> >> > > minimized. The needed work is to implement GBK2UTF8.
> >> > >
> >> > > Thanks --Qifan
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Jan 4, 2016 at 10:39 PM, Liu, Ming (Ming)
> >> > > <ming.liu@esgyn.cn>
> >> > > wrote:
> >> > >
> >> > > > Hi, Hans,
> >> > > >
> >> > > > You are right, each column has a charset type and TRANSLATE
> >> > > > need to match the source/target's charset before performing the
> >> > > > conversion.
> >> > > > I do meet this during the implementation after my proposal.
> >> > > >
> >> > > > My proposal is mainly for bulkloader usage, where the source
is
> >> > > > from hive table. I know very little about Hive, but it seems
to
> >> > > > me, a column in Hive table does not have a concept of charset.
> >> > > > As per my understanding ( I am learning hive), Hive treat all
> >> > > > string column as UTF8, since Hive is using java.
> >> > > > Hive does not check the value of a string column during normal
> >> > > > operation, it treats the string as a binary string without any
> >> > > > attempt to parse it.
> >> > > > So when Trafodion is doing bulk load, the first step is we copy
> >> > > > the raw data into HDFS, and create a hive external table. In
> >> > > > some cases, the source data is encoded in GBK, so we must
> >> > > > convert the source using 'iconv' before copy that file into
> >> > > > HDFS. This step is very time consuming if the source file is
> >> > > > big, iconv is not very fast, even it is fast by writing a
> >> > > > specific MapReduce job for this, it is still an extra step for
> >> > > > Trafodion bulk loading.
> >> > > > On the other hand, it is fine to leave the source data as
> >> > > > encoded in GBK, and create the hive external file. Hive will
> >> > > > not checking the string values's charset. And you can select
> >> > > > those GBK data via HQL, Hive will not do any conversion. So it
> >> > > > is possible that we can skip the 'iconv' step (which is very
> >> > > > slow, or will never success for a big file, say 100G ) , and
> >> > > > using the parallel bulk loader of Trafodion to load the data
in
> >> > > > GBK, but during the loading, for each row, do a TRANSLATE, so
> >> > > > convert it from GBK into UTF8. We assume this will be faster
> >> > > > than two separate step, and it is easier for the end user. This
> >> > > > is the goal of the proposal at first.
> >> > > >
> >> > > > But just as you pointed out, the source column's charset must
> >> > > > be
> >> > > matching.
> >> > > > So the GBKTOUTF8 transaction require the source column to be
in
> >> > > > GBK, which Trafodion doesn't support. And for unknown reason,
> >> > > > the column charset of any hive string column is UTF8 from
> >> > > > Trafodion point of view. So the TRANSLATE will fail. My current
> >> > > > solution is to bypass the source charset type matching , same
> >> > > > as a CQD ALLOW_TRANSLATE_SOURCE_CHARSTET_MISMATCH 'on'.
> >> > > >
> >> > > > In the long run, I can add the charset GBK as one of the
> >> > > > supported Trafodion column charset. Then we can use this
> >> > > > feature in more cases, and we can support a column to be
> >> > > > encoded as GBK which is also a nice feature.
> >> > > > But in short term, I think the most desired requirement is to
> >> > > > allow loading GBK data into Trafodion without an extra iconv
step.
> >> > > >
> >> > > > I think the CQD is a very good suggestion, I now ignore the
> >> > > > charset type matching, I can add this CQD and only bypass type
> >> > > > checking when this CQD is on. And we can turn this CQD on
> >> > > > during bulk loading and turn it off in other cases.
> >> > > >
> >> > > > Do you think this is better?
> >> > > >
> >> > > > Thanks,
> >> > > > Ming
> >> > > >
> >> > > > -----邮件原件-----
> >> > > > 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
> >> > > > 发送时间: 2016年1月5日 10:40
> >> > > > 收件人: dev <dev@trafodion.incubator.apache.org>
> >> > > > 主题: Re: enhance TRANSLATE to support Chinese charset?
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > Kevin's question make me think of another question: The
> >> > > > TRANSLATE function takes a value a with character set ca and
> >> > > > translates it into a new value b with character set cb. Since
> >> > > > we don't support GBK as character sets in data types, I wonder
> >> > > > how we could call this function. People may in some cases stuff
> >> > > > GBK data into an ISO88591 column, are you thinking of that
> >> > > > case, and would we then allow something like this:
> >> > > >
> >> > > > create table tiso(a char(32) character set iso88591); insert
> >> > > > into tiso
> >> > > ...
> >> > > > ; -- insert GBK data into the table without translation select
> >> > > > translate(a using GBKTOUTF8N) from tiso;
> >> > > >
> >> > > >
> >> > > > Or, maybe something like this, to load a GB2312 HDFS file into
> >> > > > a Trafodion
> >> > > > table:
> >> > > >
> >> > > > cqd hive_default_charset 'ISO88591'; -- data is really GB2312,
> >> > > > but that is not supported cqd
> >> > > > allow_translate_source_charset_mismatch
> >> > > > 'on';
> >> > > > -- some new CQD like the above could suppress errors on
> >> > > > mismatched source charset for translate upsert using load into
> >> > > > traf_utf8_table select translate(a using GBKTOUTF8N) from
> >> > > > hive.hive.src;
> >> > > >
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Hans
> >> > > >
> >> > > > On Mon, Jan 4, 2016 at 5:01 PM, Liu, Ming (Ming)
> >> > > > <ming.liu@esgyn.cn
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi, Kevin,
> >> > > > >
> >> > > > > I didn't notice GB18030 before, but after some initial
> >> > > > > search, it seems a must to have feature, so Trafodion should
> >> > > > > support it. I will mark it after the GBK support, we saw
GBK
> >> > > > > in real customer site, but not GB18030 yet, but we should
> >> > > > > assume wide requirement of GB18030 very
> >> > > > soon.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Ming
> >> > > > >
> >> > > > > -----邮件原件-----
> >> > > > > 发件人: Kevin DeYager [mailto:kevin.deyager@esgyn.com]
> >> > > > > 发送时间: 2016年1月5日 8:11
> >> > > > > 收件人: dev@trafodion.incubator.apache.org
> >> > > > > 主题: RE: enhance TRANSLATE to support Chinese charset?
> >> > > > >
> >> > > > > Hi Ming,
> >> > > > >
> >> > > > > I am no expert in this area, but is GB18030 translation
also
> >> > > > > needed / desirable?
> >> > > > >
> >> > > > > Regards,
> >> > > > > - Kevin
> >> > > > >
> >> > > > > -----Original Message-----
> >> > > > > From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> >> > > > > Sent: Monday, December 21, 2015 4:51 PM
> >> > > > > To: dev@trafodion.incubator.apache.org
> >> > > > > Subject: enhance TRANSLATE to support Chinese charset?
> >> > > > >
> >> > > > > Hello,
> >> > > > >
> >> > > > > Trafodion currently has a TRANSLATE function, which can
do
> >> > > > > charset conversion among ISO88591, SJIS, UCS2 and UTF8.
> >> > > > > I would like to add GBK conversion into this function, it
can
> >> > > > > help for data loading sometimes. As we saw previously, source
> >> > > > > data are very typically encoded in GB2312, especially in
> >> > > > > China, so we have to do a 'iconv' from GBK to UTF8 before
> >> > > > > loading, if the data files are huge, it will take a some
time.
> >> > > > > If TRANSLATE can support GBKTOUTF8, so that conversion can
be
> >> > > > > done in one step during the 'LOAD' SQL command. I think
there
> >> > > > > are some other use cases as well.
> >> > > > >
> >> > > > > Do you feel this is worthy? If so, I would like to file
a
> >> > > > > JIRA and can work on it.
> >> > > > >
> >> > > > > At first glance, I would like to propose several translate
> >> flavors:
> >> > > > > GBKTOUTF8N : which will try to do conversion from GB2312
to
> >> > > > > UTF8, in case there is an error during the conversion, return
> >> > > > > NULL, no SQL Error raised, silently continue.
> >> > > > > GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in
case
> >> > > > > there is an error during the conversion, return the original
> >> > > > > string without any conversion, no SQL Error raised, silently
> >> > continue.
> >> > > > > BGKTOUTF8: typical behavior, once there is a conversion
> >> > > > > error, raise a SQL Error.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Ming
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Regards, --Qifan
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > Regards, --Qifan
> >
> >
>
>
> --
> Regards, --Qifan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message