trafodion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Birdsall <dave.birds...@esgyn.com>
Subject RE: 答复: enhance TRANSLATE to support Chinese charset?
Date Tue, 05 Jan 2016 17:48:01 GMT
Hi,

I'm wondering how large a project it is to add a new character set to
Trafodion? I personally hate the idea of lying to the engine (for example,
pretending that GBK is ISO88591 in most places except when we want to
translate to another character set; we tried this in one of our predecessor
products and it caused endless complications -- it is technical debt that
will soon demand to be repaid).

If we need to reduce the scope of the project, one way might be to limit the
new character set to Hive tables only (so we'd have to do the metadata work
in Trafodion), and translate the data immediately in the scan nodes where it
enters the engine (so we could limit run-time support to a translate
operator; other expression operators would operate only on known character
sets).

Dave

-----Original Message-----
From: Hans Zeller [mailto:hans.zeller@esgyn.com]
Sent: Tuesday, January 5, 2016 9:21 AM
To: dev <dev@trafodion.incubator.apache.org>
Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?

Hi,

Thanks, Ming, for confirming what the use case is.

As Qifan and Selva have pointed out, the HIVE_DEFAULT_CHARSET CQD determines
the character set of string columns in Hive tables. I don't think, however,
that it is feasible to allow GBK as a value for HIVE_DEFAULT_CHARSET, unless
you are ready to do a lot of work. That's because Hive columns have regular
Trafodion data types, and supporting GBK in regular data types would be a
large project.

Therefore, I would suggest that we use cqd HIVE_DEFAULT_CHARSET 'iso88591'
instead. We also don't have a BINARY data type and use ISO88591 for that as
well. To make that happen, you could either use some CQD as we talked about,
or a special TRANSLATE flavor, something like GBK_AS_ISO_TO_UTF8 (sorry for
all the long names, maybe you can shorten them). This may be better than the
CQD to suppress character set checks. I agree with Qifan that we should try
hard to check all character sets. For the same reason, we should not allow
UTF-8 as HIVE_DEFAULT_CHARSET when we store GBK data in Hive tables.

Qifan mentions different character sets in a single Hive table. My
impression is that the Hive folks did not anticipate that, since Hive used
to be just simple text files. So, I hope that won't be a common case.

Thanks,

Hans

On Tue, Jan 5, 2016 at 8:35 AM, Selva Govindarajan <
selva.govindarajan@esgyn.com> wrote:

> Hi Ming,
>
> I believe Hive doesn't have a way to specify the character set
> encoding of individual columns. It relies on the JVM default encoding
> for the table as a whole. The JVM default encoding depends on the
> file.encoding default or the LANG variable.
>
> In Trafodion, we currently support TEXT and SEQUENCE format hive
> tables only. Trafodion also reads the hive data directly as stream of
> bytes and imposes the data type conversion on those stream of bytes to
> match with hive table definition, the columns being separated by the
> column delimiters.
> This
> processing in Trafodion done in C++ and is not wide character aware.
> As long as the needed character set doesn't have null byte embedded
> like UTF8 or
> ISO88591 encoding, this processing works fine.
>
> Currently the CQD hive_default_charset takes in UTF8 and ISO88591 only.
> However, you can extend this CQD to take different character sets as
> long as these character set doesn't have null byte embedded in its
> encoding. For other encodings, this processing should be made wide
> character aware.
>
> Selva
>
> -----Original Message-----
> From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> Sent: Tuesday, January 5, 2016 7:56 AM
> To: dev <dev@trafodion.incubator.apache.org>
> Subject: Re: 答复: enhance TRANSLATE to support Chinese charset?
>
> Hi Ming,
>
> Thank you for explaining one of the bottlenecks in PoCs in China!
>
> From data integrity point of view, it is important that the data to be
> inserted into Trafodion tables is verified to be in the said charset,
> and Trafodion's TRANSLATE function can guarantee that. So I wonder if
> we can consider these possible cases as follows.
>
>    1. All char columns in the hive table are in GBK. Set CQD
>    hive_default_charset to GBK and replying on implicit casting to the
> target
>    column of the trafodion table. Note that since CQD
>    ALLOW_IMPLICIT_CHAR_CASTING is on by default, implicit casting calls
>    TRANSLATE under the table.
>    2. Only small number of char columns in the hive table are in GBK.
> Apply
>    the TRANSLATE function on these GBK columns in the UPSERT statement, as
>    outlined in the Hans' email. The default charset of the source is still
>    UTF8. If for a particular GBK column that the TRANSLATE function is not
>    specified, Trafodion will give an error. Fix the UPSERT stmt and
> re-try.
>    3. Only small number of char columns in the hive table are *NOT* in
> GBK.
>    Set CQD hive_default_charset to GBK. Apply the TRANSLATE function
> on these
>    non-GBK columns in the UPSERT statement, similar to 2) above.
>    4. Non of the columns are in GBK. No special settings are necessary.
>
>
> With the above proposal, the data integrity is maintained and the
> amount of work to prepare the UPSERT for charset conversion is
> minimized. The needed work is to implement GBK2UTF8.
>
> Thanks --Qifan
>
>
>
> On Mon, Jan 4, 2016 at 10:39 PM, Liu, Ming (Ming) <ming.liu@esgyn.cn>
> wrote:
>
> > Hi, Hans,
> >
> > You are right, each column has a charset type and TRANSLATE need to
> > match the source/target's charset before performing the conversion.
> > I do meet this during the implementation after my proposal.
> >
> > My proposal is mainly for bulkloader usage, where the source is from
> > hive table. I know very little about Hive, but it seems to me, a
> > column in Hive table does not have a concept of charset. As per my
> > understanding ( I am learning hive), Hive treat all string column as
> > UTF8, since Hive is using java.
> > Hive does not check the value of a string column during normal
> > operation, it treats the string as a binary string without any
> > attempt to parse it.
> > So when Trafodion is doing bulk load, the first step is we copy the
> > raw data into HDFS, and create a hive external table. In some cases,
> > the source data is encoded in GBK, so we must convert the source
> > using 'iconv' before copy that file into HDFS. This step is very
> > time consuming if the source file is big, iconv is not very fast,
> > even it is fast by writing a specific MapReduce job for this, it is
> > still an extra step for Trafodion bulk loading.
> > On the other hand, it is fine to leave the source data as encoded in
> > GBK, and create the hive external file. Hive will not checking the
> > string values's charset. And you can select those GBK data via HQL,
> > Hive will not do any conversion. So it is possible that we can skip
> > the 'iconv' step (which is very slow, or will never success for a
> > big file, say 100G ) , and using the parallel bulk loader of
> > Trafodion to load the data in GBK, but during the loading, for each
> > row, do a TRANSLATE, so convert it from GBK into UTF8. We assume
> > this will be faster than two separate step, and it is easier for the
> > end user. This is the goal of the proposal at first.
> >
> > But just as you pointed out, the source column's charset must be
> matching.
> > So the GBKTOUTF8 transaction require the source column to be in GBK,
> > which Trafodion doesn't support. And for unknown reason, the column
> > charset of any hive string column is UTF8 from Trafodion point of
> > view. So the TRANSLATE will fail. My current solution is to bypass
> > the source charset type matching , same as a CQD
> > ALLOW_TRANSLATE_SOURCE_CHARSTET_MISMATCH 'on'.
> >
> > In the long run, I can add the charset GBK as one of the supported
> > Trafodion column charset. Then we can use this feature in more
> > cases, and we can support a column to be encoded as GBK which is
> > also a nice feature.
> > But in short term, I think the most desired requirement is to allow
> > loading GBK data into Trafodion without an extra iconv step.
> >
> > I think the CQD is a very good suggestion, I now ignore the charset
> > type matching, I can add this CQD and only bypass type checking when
> > this CQD is on. And we can turn this CQD on during bulk loading and
> > turn it off in other cases.
> >
> > Do you think this is better?
> >
> > Thanks,
> > Ming
> >
> > -----邮件原件-----
> > 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
> > 发送时间: 2016年1月5日 10:40
> > 收件人: dev <dev@trafodion.incubator.apache.org>
> > 主题: Re: enhance TRANSLATE to support Chinese charset?
> >
> > Hi,
> >
> > Kevin's question make me think of another question: The TRANSLATE
> > function takes a value a with character set ca and translates it
> > into a new value b with character set cb. Since we don't support GBK
> > as character sets in data types, I wonder how we could call this
> > function. People may in some cases stuff GBK data into an ISO88591
> > column, are you thinking of that case, and would we then allow
> > something like this:
> >
> > create table tiso(a char(32) character set iso88591); insert into
> > tiso
> ...
> > ; -- insert GBK data into the table without translation select
> > translate(a using GBKTOUTF8N) from tiso;
> >
> >
> > Or, maybe something like this, to load a GB2312 HDFS file into a
> > Trafodion
> > table:
> >
> > cqd hive_default_charset 'ISO88591'; -- data is really GB2312, but
> > that is not supported cqd allow_translate_source_charset_mismatch
> > 'on';
> > -- some new CQD like the above could suppress errors on mismatched
> > source charset for translate upsert using load into traf_utf8_table
> > select translate(a using GBKTOUTF8N) from hive.hive.src;
> >
> >
> > Thanks,
> >
> > Hans
> >
> > On Mon, Jan 4, 2016 at 5:01 PM, Liu, Ming (Ming) <ming.liu@esgyn.cn>
> > wrote:
> >
> > > Hi, Kevin,
> > >
> > > I didn't notice GB18030 before, but after some initial search, it
> > > seems a must to have feature, so Trafodion should support it. I
> > > will mark it after the GBK support, we saw GBK in real customer
> > > site, but not GB18030 yet, but we should assume wide requirement
> > > of GB18030 very
> > soon.
> > >
> > > Thanks,
> > > Ming
> > >
> > > -----邮件原件-----
> > > 发件人: Kevin DeYager [mailto:kevin.deyager@esgyn.com]
> > > 发送时间: 2016年1月5日 8:11
> > > 收件人: dev@trafodion.incubator.apache.org
> > > 主题: RE: enhance TRANSLATE to support Chinese charset?
> > >
> > > Hi Ming,
> > >
> > > I am no expert in this area, but is GB18030 translation also
> > > needed / desirable?
> > >
> > > Regards,
> > > - Kevin
> > >
> > > -----Original Message-----
> > > From: Liu, Ming (Ming) [mailto:ming.liu@esgyn.cn]
> > > Sent: Monday, December 21, 2015 4:51 PM
> > > To: dev@trafodion.incubator.apache.org
> > > Subject: enhance TRANSLATE to support Chinese charset?
> > >
> > > Hello,
> > >
> > > Trafodion currently has a TRANSLATE function, which can do charset
> > > conversion among ISO88591, SJIS, UCS2 and UTF8.
> > > I would like to add GBK conversion into this function, it can help
> > > for data loading sometimes. As we saw previously, source data are
> > > very typically encoded in GB2312, especially in China, so we have
> > > to do a 'iconv' from GBK to UTF8 before loading, if the data files
> > > are huge, it will take a some time.
> > > If TRANSLATE can support GBKTOUTF8, so that conversion can be done
> > > in one step during the 'LOAD' SQL command. I think there are some
> > > other use cases as well.
> > >
> > > Do you feel this is worthy? If so, I would like to file a JIRA and
> > > can work on it.
> > >
> > > At first glance, I would like to propose several translate flavors:
> > > GBKTOUTF8N : which will try to do conversion from GB2312 to UTF8,
> > > in case there is an error during the conversion, return NULL, no
> > > SQL Error raised, silently continue.
> > > GBKTOUTF8O: try to do conversion from GB2312 to UTF8, in case
> > > there is an error during the conversion, return the original
> > > string without any conversion, no SQL Error raised, silently continue.
> > > BGKTOUTF8: typical behavior, once there is a conversion error,
> > > raise a SQL Error.
> > >
> > > Thanks,
> > > Ming
> > >
> >
>
>
>
> --
> Regards, --Qifan
>

Mime
View raw message