carbondata-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xuchuan...@hust.edu.cn
Subject Grammar about supporting string longer than 32000 characters
Date Wed, 02 May 2018 02:30:22 GMT


Hi, community:

I'm implementing supporting string longer than 32000 characters in carbondata and have a question
about the grammar of this feature. Here I'd like to explain it and want to receive your feedbacks.

DESCRIPTION:

In previous implementation, carbondata internally uses a short to store the length of a string
(char,varchar,string are all treated as string) value. It does save memory and space for the
regular use case by using short instead of int, but will cause problem when the length of
string exceeds the range of short.

In order to support the above case and save memory/space if possible, we want to distinguish
which string columns are SHORT and which are LONG. (SHORT: lengthOfValue<32000, LONG: lengthOfValue>=32000)


SOLUTION:

Solution1. Add a TableProperty in CreateTableStatement
The grammar looks like below:
For SQL case:
```
    sql(
      s"""
         | CREATE TABLE if not exists $longStringTable(
         | id INT, name STRING, description STRING, address STRING
         | ) STORED BY 'carbondata'
         | TBLPROPERTIES('LONG_STRING_COLUMNS'='description', 'SORT_COLUMNS'='name')
         |""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
  .option("tableName", "longStringTable")
  .option("SORT_COLUMNS", 'name')
  .option('LONG_STRING_COLUMNS', 'description')
  .mode(SaveMode.Overwrite)
  .save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a property called 'LONG_STRING_COLUMNS' in the statement, its value contains all the
columns that are LONG. 
Note: This solution is available as PR2252.


Solution2. Add a new datatype
The grammar looks like below:
For SQL case:
```
    sql(
      s"""
         | CREATE TABLE if not exists $longStringTable(
         | id INT, name STRING, description TEXT, address STRING
         | ) STORED BY 'carbondata'
         | TBLPROPERTIES('SORT_COLUMNS'='name')
         |""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
  .option("tableName", "longStringTable")
  .option("SORT_COLUMNS", 'name')
  // SparkSQL does not have TEXT datatype, how to specify it?
  .mode(SaveMode.Overwrite)
  .save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a new datatype called 'TEXT' and treat 'description' as TEXT in the statement.

I'd prefer to solution1 since it is compatible with hive/sparksql while solution2 has problem
to migrate with them.

END



At last, how do you think the solutions provided above?

Please give your comments, moreover you can provide other solutions here to improve it.
Mime
View raw message