carbondata-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chenliang...@apache.org
Subject [1/3] carbondata git commit: [CARBONDATA-2915] Updated & enhanced Documentation of CarbonData
Date Fri, 07 Sep 2018 04:01:51 GMT
Repository: carbondata
Updated Branches:
  refs/heads/master f04850f39 -> 67a8a37bf


http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/docs/installation-guide.md
----------------------------------------------------------------------
diff --git a/docs/installation-guide.md b/docs/installation-guide.md
deleted file mode 100644
index f679338..0000000
--- a/docs/installation-guide.md
+++ /dev/null
@@ -1,198 +0,0 @@
-<!--
-    Licensed to the Apache Software Foundation (ASF) under one or more 
-    contributor license agreements.  See the NOTICE file distributed with
-    this work for additional information regarding copyright ownership. 
-    The ASF licenses this file to you under the Apache License, Version 2.0
-    (the "License"); you may not use this file except in compliance with 
-    the License.  You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software 
-    distributed under the License is distributed on an "AS IS" BASIS, 
-    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    See the License for the specific language governing permissions and 
-    limitations under the License.
--->
-
-# Installation Guide
-This tutorial guides you through the installation and configuration of CarbonData in the
following two modes :
-
-* [Installing and Configuring CarbonData on Standalone Spark Cluster](#installing-and-configuring-carbondata-on-standalone-spark-cluster)
-* [Installing and Configuring CarbonData on Spark on YARN Cluster](#installing-and-configuring-carbondata-on-spark-on-yarn-cluster)
-
-followed by :
-
-* [Query Execution using CarbonData Thrift Server](#query-execution-using-carbondata-thrift-server)
-
-## Installing and Configuring CarbonData on Standalone Spark Cluster
-
-### Prerequisites
-
-   - Hadoop HDFS and Yarn should be installed and running.
-
-   - Spark should be installed and running on all the cluster nodes.
-
-   - CarbonData user should have permission to access HDFS.
-
-
-### Procedure
-
-1. [Build the CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
project and get the assembly jar from `./assembly/target/scala-2.1x/carbondata_xxx.jar`. 
-
-2. Copy `./assembly/target/scala-2.1x/carbondata_xxx.jar` to `$SPARK_HOME/carbonlib` folder.
-
-     **NOTE**: Create the carbonlib folder if it does not exist inside `$SPARK_HOME` path.
-
-3. Add the carbonlib folder path in the Spark classpath. (Edit `$SPARK_HOME/conf/spark-env.sh`
file and modify the value of `SPARK_CLASSPATH` by appending `$SPARK_HOME/carbonlib/*` to the
existing value)
-
-4. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/`
folder and rename the file to `carbon.properties`.
-
-5. Repeat Step 2 to Step 5 in all the nodes of the cluster.
-    
-6. In Spark node[master], configure the properties mentioned in the following table in `$SPARK_HOME/conf/spark-defaults.conf`
file.
-
-| Property | Value | Description |
-|---------------------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
-| spark.driver.extraJavaOptions | `-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties`
| A string of extra JVM options to pass to the driver. For instance, GC settings or other
logging. |
-| spark.executor.extraJavaOptions | `-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties`
| A string of extra JVM options to pass to executors. For instance, GC settings or other logging.
**NOTE**: You can enter multiple values separated by space. |
-
-7. Add the following properties in `$SPARK_HOME/conf/carbon.properties` file:
-
-| Property             | Required | Description                                         
                                  | Example                             | Remark  |
-|----------------------|----------|----------------------------------------------------------------------------------------|-------------------------------------|---------|
-| carbon.storelocation | NO       | Location where data CarbonData will create the store
and write the data in its own format. If not specified then it takes spark.sql.warehouse.dir
path. | hdfs://HOSTNAME:PORT/Opt/CarbonStore      | Propose to set HDFS directory |
-
-
-8. Verify the installation. For example:
-
-```
-./spark-shell --master spark://HOSTNAME:PORT --total-executor-cores 2
---executor-memory 2G
-```
-
-**NOTE**: Make sure you have permissions for CarbonData JARs and files through which driver
and executor will start.
-
-To get started with CarbonData : [Quick Start](quick-start-guide.md), [Data Management on
CarbonData](data-management-on-carbondata.md)
-
-## Installing and Configuring CarbonData on Spark on YARN Cluster
-
-   This section provides the procedure to install CarbonData on "Spark on YARN" cluster.
-
-### Prerequisites
-   * Hadoop HDFS and Yarn should be installed and running.
-   * Spark should be installed and running in all the clients.
-   * CarbonData user should have permission to access HDFS.
-
-### Procedure
-
-   The following steps are only for Driver Nodes. (Driver nodes are the one which starts
the spark context.)
-
-1. [Build the CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
project and get the assembly jar from `./assembly/target/scala-2.1x/carbondata_xxx.jar` and
copy to `$SPARK_HOME/carbonlib` folder.
-
-    **NOTE**: Create the carbonlib folder if it does not exists inside `$SPARK_HOME` path.
-
-2. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/`
folder and rename the file to `carbon.properties`.
-
-3. Create `tar.gz` file of carbonlib folder and move it inside the carbonlib folder.
-
-```
-cd $SPARK_HOME
-tar -zcvf carbondata.tar.gz carbonlib/
-mv carbondata.tar.gz carbonlib/
-```
-
-4. Configure the properties mentioned in the following table in `$SPARK_HOME/conf/spark-defaults.conf`
file.
-
-| Property | Description | Value |
-|---------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|
-| spark.master | Set this value to run the Spark in yarn cluster mode. | Set yarn-client
to run the Spark in yarn cluster mode. |
-| spark.yarn.dist.files | Comma-separated list of files to be placed in the working directory
of each executor. |`$SPARK_HOME/conf/carbon.properties` |
-| spark.yarn.dist.archives | Comma-separated list of archives to be extracted into the working
directory of each executor. |`$SPARK_HOME/carbonlib/carbondata.tar.gz` |
-| spark.executor.extraJavaOptions | A string of extra JVM options to pass to executors. For
instance  **NOTE**: You can enter multiple values separated by space. |`-Dcarbon.properties.filepath
= carbon.properties` |
-| spark.executor.extraClassPath | Extra classpath entries to prepend to the classpath of
executors. **NOTE**: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append
the values in below parameter spark.driver.extraClassPath |`carbondata.tar.gz/carbonlib/*`
|
-| spark.driver.extraClassPath | Extra classpath entries to prepend to the classpath of the
driver. **NOTE**: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append
the value in below parameter spark.driver.extraClassPath. |`$SPARK_HOME/carbonlib/*` |
-| spark.driver.extraJavaOptions | A string of extra JVM options to pass to the driver. For
instance, GC settings or other logging. |`-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties`
|
-
-
-5. Add the following properties in `$SPARK_HOME/conf/carbon.properties`:
-
-| Property | Required | Description | Example | Default Value |
-|----------------------|----------|----------------------------------------------------------------------------------------|-------------------------------------|---------------|
-| carbon.storelocation | NO | Location where CarbonData will create the store and write the
data in its own format. If not specified then it takes spark.sql.warehouse.dir path.| hdfs://HOSTNAME:PORT/Opt/CarbonStore
| Propose to set HDFS directory|
-
-6. Verify the installation.
-
-```
- ./bin/spark-shell --master yarn-client --driver-memory 1g
- --executor-cores 2 --executor-memory 2G
-```
-  **NOTE**: Make sure you have permissions for CarbonData JARs and files through which driver
and executor will start.
-
-  Getting started with CarbonData : [Quick Start](quick-start-guide.md), [Data Management
on CarbonData](data-management-on-carbondata.md)
-
-## Query Execution Using CarbonData Thrift Server
-
-### Starting CarbonData Thrift Server.
-
-   a. cd `$SPARK_HOME`
-
-   b. Run the following command to start the CarbonData thrift server.
-
-```
-./bin/spark-submit
---class org.apache.carbondata.spark.thriftserver.CarbonThriftServer
-$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
-```
-
-| Parameter | Description | Example |
-|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
-| CARBON_ASSEMBLY_JAR | CarbonData assembly jar name present in the `$SPARK_HOME/carbonlib/`
folder. | carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar |
-| carbon_store_path | This is a parameter to the CarbonThriftServer class. This a HDFS path
where CarbonData files will be kept. Strongly Recommended to put same as carbon.storelocation
parameter of carbon.properties. If not specified then it takes spark.sql.warehouse.dir path.
| `hdfs://<host_name>:port/user/hive/warehouse/carbon.store` |
-
-**NOTE**: From Spark 1.6, by default the Thrift server runs in multi-session mode. Which
means each JDBC/ODBC connection owns a copy of their own SQL configuration and temporary function
registry. Cached tables are still shared though. If you prefer to run the Thrift server in
single-session mode and share all SQL configuration and temporary function registry, please
set option `spark.sql.hive.thriftServer.singleSession` to `true`. You may either add this
option to `spark-defaults.conf`, or pass it to `spark-submit.sh` via `--conf`:
-
-```
-./bin/spark-submit
---conf spark.sql.hive.thriftServer.singleSession=true
---class org.apache.carbondata.spark.thriftserver.CarbonThriftServer
-$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <carbon_store_path>
-```
-
-**But** in single-session mode, if one user changes the database from one connection, the
database of the other connections will be changed too.
-
-**Examples**
-   
-   * Start with default memory and executors.
-
-```
-./bin/spark-submit
---class org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
-$SPARK_HOME/carbonlib
-/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar
-hdfs://<host_name>:port/user/hive/warehouse/carbon.store
-```
-   
-   * Start with Fixed executors and resources.
-
-```
-./bin/spark-submit
---class org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
---num-executors 3 --driver-memory 20g --executor-memory 250g 
---executor-cores 32 
-/srv/OSCON/BigData/HACluster/install/spark/sparkJdbc/lib
-/carbondata_2.xx-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar
-hdfs://<host_name>:port/user/hive/warehouse/carbon.store
-```
-  
-### Connecting to CarbonData Thrift Server Using Beeline.
-
-```
-     cd $SPARK_HOME
-     ./sbin/start-thriftserver.sh
-     ./bin/beeline -u jdbc:hive2://<thriftserver_host>:port
-
-     Example
-     ./bin/beeline -u jdbc:hive2://10.10.10.10:10000
-```
-

http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/docs/quick-start-guide.md
----------------------------------------------------------------------
diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md
index 84f871d..1b3ffc2 100644
--- a/docs/quick-start-guide.md
+++ b/docs/quick-start-guide.md
@@ -7,7 +7,7 @@
     the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
-
+    
     Unless required by applicable law or agreed to in writing, software 
     distributed under the License is distributed on an "AS IS" BASIS, 
     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -16,10 +16,11 @@
 -->
 
 # Quick Start
-This tutorial provides a quick introduction to using CarbonData.
+This tutorial provides a quick introduction to using CarbonData.To follow along with this
guide, first download a packaged release of CarbonData from the [CarbonData website](https://dist.apache.org/repos/dist/release/carbondata/).Alternatively
it can be created following [Building CarbonData](https://github.com/apache/carbondata/tree/master/build)
steps.
 
 ##  Prerequisites
-* [Installation and building CarbonData](https://github.com/apache/carbondata/blob/master/build).
+* Spark 2.2.1 version is installed and running.CarbonData supports Spark versions upto 2.2.1.Please
follow steps described in [Spark docs website](https://spark.apache.org/docs/latest) for installing
and running Spark.
+
 * Create a sample.csv file using the following commands. The CSV file is required for loading
data into CarbonData.
 
   ```
@@ -43,7 +44,7 @@ Start Spark shell by running the following command in the Spark directory:
 ```
 ./bin/spark-shell --jars <carbondata assembly jar path>
 ```
-**NOTE**: Assembly jar will be available after [building CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
and can be copied from `./assembly/target/scala-2.1x/carbondata_xxx.jar`
+**NOTE**: Path where packaged release of CarbonData was downloaded or assembly jar will be
available after [building CarbonData](https://github.com/apache/carbondata/blob/master/build/README.md)
and can be copied from `./assembly/target/scala-2.1x/carbondata_xxx.jar`
 
 In this shell, SparkSession is readily available as `spark` and Spark context is readily
available as `sc`.
 
@@ -62,7 +63,7 @@ import org.apache.spark.sql.CarbonSession._
 val carbon = SparkSession.builder().config(sc.getConf)
              .getOrCreateCarbonSession("<hdfs store path>")
 ```
-**NOTE**: By default metastore location is pointed to `../carbon.metastore`, user can provide
own metastore location to CarbonSession like `SparkSession.builder().config(sc.getConf)
+**NOTE**: By default metastore location points to `../carbon.metastore`, user can provide
own metastore location to CarbonSession like `SparkSession.builder().config(sc.getConf)
 .getOrCreateCarbonSession("<hdfs store path>", "<local metastore path>")`
 
 #### Executing Queries

http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/docs/useful-tips-on-carbondata.md
----------------------------------------------------------------------
diff --git a/docs/useful-tips-on-carbondata.md b/docs/useful-tips-on-carbondata.md
index 641a7f3..5b543d6 100644
--- a/docs/useful-tips-on-carbondata.md
+++ b/docs/useful-tips-on-carbondata.md
@@ -7,7 +7,7 @@
     the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
-
+    
     Unless required by applicable law or agreed to in writing, software 
     distributed under the License is distributed on an "AS IS" BASIS, 
     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -30,21 +30,21 @@
 
   - **Table Column Description**
 
-  | Column Name | Data Type     | Cardinality | Attribution |
-  |-------------|---------------|-------------|-------------|
-  | msisdn      | String        | 30 million  | Dimension   |
-  | BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
-  | HOST        | String        | 1 million   | Dimension   |
-  | Dime_1      | String        | 1 Thousand  | Dimension   |
-  | counter_1   | Decimal       | NA          | Measure     |
-  | counter_2   | Numeric(20,0) | NA          | Measure     |
-  | ...         | ...           | NA          | Measure     |
-  | counter_100 | Decimal       | NA          | Measure     |
+| Column Name | Data Type     | Cardinality | Attribution |
+|-------------|---------------|-------------|-------------|
+| msisdn      | String        | 30 million  | Dimension   |
+| BEGIN_TIME  | BigInt        | 10 Thousand | Dimension   |
+| HOST        | String        | 1 million   | Dimension   |
+| Dime_1      | String        | 1 Thousand  | Dimension   |
+| counter_1   | Decimal       | NA          | Measure     |
+| counter_2   | Numeric(20,0) | NA          | Measure     |
+| ...         | ...           | NA          | Measure     |
+| counter_100 | Decimal       | NA          | Measure     |
 
 
-  - **Put the frequently-used column filter in the beginning**
+  - **Put the frequently-used column filter in the beginning of SORT_COLUMNS**
 
-  For example, MSISDN filter is used in most of the query then we must put the MSISDN in
the first column.
+  For example, MSISDN filter is used in most of the query then we must put the MSISDN as
the first column in SORT_COLUMNS property.
   The create table command can be modified as suggested below :
 
   ```
@@ -62,10 +62,10 @@
 
   Now the query with MSISDN in the filter will be more efficient.
 
-  - **Put the frequently-used columns in the order of low to high cardinality**
+  - **Put the frequently-used columns in the order of low to high cardinality in SORT_COLUMNS**
 
   If the table in the specified query has multiple columns which are frequently used to filter
the results, it is suggested to put
-  the columns in the order of cardinality low to high. This ordering of frequently used columns
improves the compression ratio and
+  the columns in the order of cardinality low to high in SORT_COLUMNS configuration. This
ordering of frequently used columns improves the compression ratio and
   enhances the performance of queries with filter on these columns.
 
   For example, if MSISDN, HOST and Dime_1 are frequently-used columns, then the column order
of table is suggested as
@@ -137,20 +137,18 @@
   If you do not have much memory to use, then you may prefer to slow the speed of data loading
instead of data load failure.
   You can configure CarbonData by tuning following properties in carbon.properties file to
get a better performance.
 
-  | Parameter | Default Value | Description/Tuning |
-  |-----------|-------------|--------|
-  |carbon.number.of.cores.while.loading|Default: 2.This value should be >= 2|Specifies
the number of cores used for data processing during data loading in CarbonData. |
-  |carbon.sort.size|Default: 100000. The value should be >= 100.|Threshold to write local
file in sort step when loading data|
-  |carbon.sort.file.write.buffer.size|Default:  50000.|DataOutputStream buffer. |
-  |carbon.number.of.cores.block.sort|Default: 7 | If you have huge memory and CPUs, increase
it as you will|
-  |carbon.merge.sort.reader.thread|Default: 3 |Specifies the number of cores used for temp
file merging during data loading in CarbonData.|
-  |carbon.merge.sort.prefetch|Default: true | You may want set this value to false if you
have not enough memory|
+| Parameter | Default Value | Description/Tuning |
+|-----------|-------------|--------|
+|carbon.number.of.cores.while.loading|Default: 2.This value should be >= 2|Specifies the
number of cores used for data processing during data loading in CarbonData. |
+|carbon.sort.size|Default: 100000. The value should be >= 100.|Threshold to write local
file in sort step when loading data|
+|carbon.sort.file.write.buffer.size|Default:  50000.|DataOutputStream buffer. |
+|carbon.merge.sort.reader.thread|Default: 3 |Specifies the number of cores used for temp
file merging during data loading in CarbonData.|
+|carbon.merge.sort.prefetch|Default: true | You may want set this value to false if you have
not enough memory|
 
   For example, if there are 10 million records, and i have only 16 cores, 64GB memory, will
be loaded to CarbonData table.
   Using the default configuration  always fail in sort step. Modify carbon.properties as
suggested below:
 
   ```
-  carbon.number.of.cores.block.sort=1
   carbon.merge.sort.reader.thread=1
   carbon.sort.size=5000
   carbon.sort.file.write.buffer.size=5000
@@ -162,18 +160,18 @@
   Recently we did some performance POC on CarbonData for Finance and telecommunication Field.
It involved detailed queries and aggregation
   scenarios. After the completion of POC, some of the configurations impacting the performance
have been identified and tabulated below :
 
-  | Parameter | Location | Used For  | Description | Tuning |
-  |----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-  | carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | Data loading
| During the loading of data, local temp is used to sort the data. This number specifies the
minimum number of intermediate files after which the  merge sort has to be initiated. | Increasing
the parameter to a higher value will improve the load performance. For example, when we increase
the value from 20 to 100, it increases the data load performance from 35MB/S to more than
50MB/S. Higher values of this parameter consumes  more memory during the load. |
-  | carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | Data loading
| Specifies the number of cores used for data processing during data loading in CarbonData.
| If you have more number of CPUs, then you can increase the number of CPUs, which will increase
the performance. For example if we increase the value from 2 to 4 then the CSV reading performance
can increase about 1 times |
-  | carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | Data loading
and Querying | For minor compaction, specifies the number of segments to be merged in stage
1 and number of compacted segments to be merged in stage 2. | Each CarbonData load will create
one segment, if every load is small in size it will generate many small file over a period
of time impacting the query performance. Configuring this parameter will merge the small segment
to one big segment which will sort the data and improve the performance. For Example in one
telecommunication scenario, the performance improves about 2 times after minor compaction.
|
-  | spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | The number
of task started when spark shuffle. | The value can be 1 to 2 times as much as the executor
cores. In an aggregation scenario, reducing the number from 200 to 32 reduced the query time
from 17 to 9 seconds. |
-  | spark.executor.instances/spark.executor.cores/spark.executor.memory | spark/conf/spark-defaults.conf
| Querying | The number of executors, CPU cores, and memory used for CarbonData query. | In
the bank scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good
performance. This 2 value does not mean more the better. It needs to be configured properly
in case of limited resources. For example, In the bank scenario, it has enough CPU 32 cores
each node but less memory 64 GB each node. So we cannot give more CPU but less memory. For
example, when 4 cores and 12GB for each executor. It sometimes happens GC during the query
which impact the query performance very much from the 3 second to more than 15 seconds. In
this scenario need to increase the memory or decrease the CPU cores. |
-  | carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading | The buffer
size to store records, returned from the block scan. | In limit scenario this parameter is
very important. For example your query limit is 1000. But if we set this value to 3000 that
means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining
are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the
performance increase about 2 times in comparison to if we set this value to 12000. |
-  | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use
YARN local directories for multi-table load disk load balance | If this is set it to true
CarbonData will use YARN local directories for multi-table load disk load balance, that will
improve the data load performance. |
-  | carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data loading | Whether
to use multiple YARN local directories during table data loading for disk load balance | After
enabling 'carbon.use.local.dir', if this is set to true, CarbonData will use all YARN local
directories during data load for disk load balance, that will improve the data load performance.
Please enable this property when you encounter disk hotspot problem during data loading. |
-  | carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data loading | Specify
the name of compressor to compress the intermediate sort temporary files during sort procedure
in data loading. | The optional values are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' and empty.
By default, empty means that Carbondata will not compress the sort temp files. This parameter
will be useful if you encounter disk bottleneck. |
-  | carbon.load.skewedDataOptimization.enabled | spark/carbonlib/carbon.properties | Data
loading | Whether to enable size based block allocation strategy for data loading. | When
loading, carbondata will use file size based block allocation strategy for task distribution.
It will make sure that all the executors process the same size of data -- It's useful if the
size of your input data files varies widely, say 1MB~1GB. |
-  | carbon.load.min.size.enabled | spark/carbonlib/carbon.properties | Data loading | Whether
to enable node minumun input data size allocation strategy for data loading.| When loading,
carbondata will use node minumun input data size allocation strategy for task distribution.
It will make sure the node load the minimum amount of data -- It's useful if the size of your
input data files very small, say 1MB~256MB,Avoid generating a large number of small files.
|
-  
+| Parameter | Location | Used For  | Description | Tuning |
+|----------------------------------------------|-----------------------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | Data loading
| During the loading of data, local temp is used to sort the data. This number specifies the
minimum number of intermediate files after which the  merge sort has to be initiated. | Increasing
the parameter to a higher value will improve the load performance. For example, when we increase
the value from 20 to 100, it increases the data load performance from 35MB/S to more than
50MB/S. Higher values of this parameter consumes  more memory during the load. |
+| carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | Data loading
| Specifies the number of cores used for data processing during data loading in CarbonData.
| If you have more number of CPUs, then you can increase the number of CPUs, which will increase
the performance. For example if we increase the value from 2 to 4 then the CSV reading performance
can increase about 1 times |
+| carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | Data loading and
Querying | For minor compaction, specifies the number of segments to be merged in stage 1
and number of compacted segments to be merged in stage 2. | Each CarbonData load will create
one segment, if every load is small in size it will generate many small file over a period
of time impacting the query performance. Configuring this parameter will merge the small segment
to one big segment which will sort the data and improve the performance. For Example in one
telecommunication scenario, the performance improves about 2 times after minor compaction.
|
+| spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | The number of
task started when spark shuffle. | The value can be 1 to 2 times as much as the executor cores.
In an aggregation scenario, reducing the number from 200 to 32 reduced the query time from
17 to 9 seconds. |
+| spark.executor.instances/spark.executor.cores/spark.executor.memory | spark/conf/spark-defaults.conf
| Querying | The number of executors, CPU cores, and memory used for CarbonData query. | In
the bank scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good
performance. This 2 value does not mean more the better. It needs to be configured properly
in case of limited resources. For example, In the bank scenario, it has enough CPU 32 cores
each node but less memory 64 GB each node. So we cannot give more CPU but less memory. For
example, when 4 cores and 12GB for each executor. It sometimes happens GC during the query
which impact the query performance very much from the 3 second to more than 15 seconds. In
this scenario need to increase the memory or decrease the CPU cores. |
+| carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading | The buffer
size to store records, returned from the block scan. | In limit scenario this parameter is
very important. For example your query limit is 1000. But if we set this value to 3000 that
means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining
are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the
performance increase about 2 times in comparison to if we set this value to 12000. |
+| carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use YARN
local directories for multi-table load disk load balance | If this is set it to true CarbonData
will use YARN local directories for multi-table load disk load balance, that will improve
the data load performance. |
+| carbon.use.multiple.temp.dir | spark/carbonlib/carbon.properties | Data loading | Whether
to use multiple YARN local directories during table data loading for disk load balance | After
enabling 'carbon.use.local.dir', if this is set to true, CarbonData will use all YARN local
directories during data load for disk load balance, that will improve the data load performance.
Please enable this property when you encounter disk hotspot problem during data loading. |
+| carbon.sort.temp.compressor | spark/carbonlib/carbon.properties | Data loading | Specify
the name of compressor to compress the intermediate sort temporary files during sort procedure
in data loading. | The optional values are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' and empty.
By default, empty means that Carbondata will not compress the sort temp files. This parameter
will be useful if you encounter disk bottleneck. |
+| carbon.load.skewedDataOptimization.enabled | spark/carbonlib/carbon.properties | Data loading
| Whether to enable size based block allocation strategy for data loading. | When loading,
carbondata will use file size based block allocation strategy for task distribution. It will
make sure that all the executors process the same size of data -- It's useful if the size
of your input data files varies widely, say 1MB~1GB. |
+| carbon.load.min.size.enabled | spark/carbonlib/carbon.properties | Data loading | Whether
to enable node minumun input data size allocation strategy for data loading.| When loading,
carbondata will use node minumun input data size allocation strategy for task distribution.
It will make sure the node load the minimum amount of data -- It's useful if the size of your
input data files very small, say 1MB~256MB,Avoid generating a large number of small files.
|
+
   Note: If your CarbonData instance is provided only for query, you may specify the property
'spark.speculation=true' which is in conf directory of spark.

http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/hadoop/src/main/java/org/apache/carbondata/hadoop/testutil/StoreCreator.java
----------------------------------------------------------------------
diff --git a/hadoop/src/main/java/org/apache/carbondata/hadoop/testutil/StoreCreator.java
b/hadoop/src/main/java/org/apache/carbondata/hadoop/testutil/StoreCreator.java
index c113228..65ab426 100644
--- a/hadoop/src/main/java/org/apache/carbondata/hadoop/testutil/StoreCreator.java
+++ b/hadoop/src/main/java/org/apache/carbondata/hadoop/testutil/StoreCreator.java
@@ -176,8 +176,6 @@ public class StoreCreator {
         new File("../hadoop/src/test/resources/data.csv").getCanonicalPath();
     File storeDir = new File(storePath);
     CarbonUtil.deleteFoldersAndFiles(storeDir);
-    CarbonProperties.getInstance().addProperty(CarbonCommonConstants.STORE_LOCATION_HDFS,
-        storePath);
 
     CarbonTable table = createTable(absoluteTableIdentifier);
     writeDictionary(factFilePath, table);

http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/integration/presto/src/test/scala/org/apache/carbondata/presto/util/CarbonDataStoreCreator.scala
----------------------------------------------------------------------
diff --git a/integration/presto/src/test/scala/org/apache/carbondata/presto/util/CarbonDataStoreCreator.scala
b/integration/presto/src/test/scala/org/apache/carbondata/presto/util/CarbonDataStoreCreator.scala
index e57312b..895d6a5 100644
--- a/integration/presto/src/test/scala/org/apache/carbondata/presto/util/CarbonDataStoreCreator.scala
+++ b/integration/presto/src/test/scala/org/apache/carbondata/presto/util/CarbonDataStoreCreator.scala
@@ -79,9 +79,6 @@ object CarbonDataStoreCreator {
       //   val factFilePath: String = new File(dataFilePath).getCanonicalPath
       val storeDir: File = new File(absoluteTableIdentifier.getTablePath)
       CarbonUtil.deleteFoldersAndFiles(storeDir)
-      CarbonProperties.getInstance.addProperty(
-        CarbonCommonConstants.STORE_LOCATION_HDFS,
-        absoluteTableIdentifier.getTablePath)
       val table: CarbonTable = createTable(absoluteTableIdentifier)
       writeDictionary(dataFilePath, table, absoluteTableIdentifier)
       val schema: CarbonDataLoadSchema = new CarbonDataLoadSchema(table)

http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/processing/src/main/java/org/apache/carbondata/processing/loading/DataLoadProcessBuilder.java
----------------------------------------------------------------------
diff --git a/processing/src/main/java/org/apache/carbondata/processing/loading/DataLoadProcessBuilder.java
b/processing/src/main/java/org/apache/carbondata/processing/loading/DataLoadProcessBuilder.java
index a628d41..225da26 100644
--- a/processing/src/main/java/org/apache/carbondata/processing/loading/DataLoadProcessBuilder.java
+++ b/processing/src/main/java/org/apache/carbondata/processing/loading/DataLoadProcessBuilder.java
@@ -207,8 +207,6 @@ public final class DataLoadProcessBuilder {
             loadModel.getTaskNo(), false, false);
     CarbonProperties.getInstance().addProperty(tempLocationKey,
         StringUtils.join(storeLocation, File.pathSeparator));
-    CarbonProperties.getInstance()
-        .addProperty(CarbonCommonConstants.STORE_LOCATION_HDFS, loadModel.getTablePath());
 
     return createConfiguration(loadModel);
   }

http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SingleThreadFinalSortFilesMerger.java
----------------------------------------------------------------------
diff --git a/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SingleThreadFinalSortFilesMerger.java
b/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SingleThreadFinalSortFilesMerger.java
index 646969a..5e9c28d 100644
--- a/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SingleThreadFinalSortFilesMerger.java
+++ b/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SingleThreadFinalSortFilesMerger.java
@@ -41,7 +41,6 @@ import org.apache.carbondata.core.util.CarbonProperties;
 import org.apache.carbondata.processing.loading.row.IntermediateSortTempRow;
 import org.apache.carbondata.processing.loading.sort.SortStepRowHandler;
 import org.apache.carbondata.processing.sort.exception.CarbonSortKeyAndGroupByException;
-import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
 
 public class SingleThreadFinalSortFilesMerger extends CarbonIterator<Object[]> {
   /**
@@ -61,11 +60,6 @@ public class SingleThreadFinalSortFilesMerger extends CarbonIterator<Object[]>
{
   private int fileCounter;
 
   /**
-   * fileBufferSize
-   */
-  private int fileBufferSize;
-
-  /**
    * recordHolderHeap
    */
   private AbstractQueue<SortTempFileChunkHolder> recordHolderHeapLocal;
@@ -153,16 +147,11 @@ public class SingleThreadFinalSortFilesMerger extends CarbonIterator<Object[]>
{
       LOGGER.info("No files to merge sort");
       return;
     }
-    this.fileBufferSize = CarbonDataProcessorUtil
-        .getFileBufferSize(this.fileCounter, CarbonProperties.getInstance(),
-            CarbonCommonConstants.CONSTANT_SIZE_TEN);
 
     LOGGER.info("Started Final Merge");
 
     LOGGER.info("Number of temp file: " + this.fileCounter);
 
-    LOGGER.info("File Buffer Size: " + this.fileBufferSize);
-
     // create record holder heap
     createRecordHolderQueue();
 

http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java
----------------------------------------------------------------------
diff --git a/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java
b/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java
index 502fa05..d3d538a 100644
--- a/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java
+++ b/processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java
@@ -197,14 +197,6 @@ public class SortParameters implements Serializable {
     this.complexDimColCount = complexDimColCount;
   }
 
-  public int getFileBufferSize() {
-    return fileBufferSize;
-  }
-
-  public void setFileBufferSize(int fileBufferSize) {
-    this.fileBufferSize = fileBufferSize;
-  }
-
   public int getNumberOfIntermediateFileToBeMerged() {
     return numberOfIntermediateFileToBeMerged;
   }
@@ -405,12 +397,6 @@ public class SortParameters implements Serializable {
     LOGGER.info("Number of intermediate file to be merged: " + parameters
         .getNumberOfIntermediateFileToBeMerged());
 
-    // get file buffer size
-    parameters.setFileBufferSize(CarbonDataProcessorUtil
-        .getFileBufferSize(parameters.getNumberOfIntermediateFileToBeMerged(), carbonProperties,
-            CarbonCommonConstants.CONSTANT_SIZE_TEN));
-
-    LOGGER.info("File Buffer Size: " + parameters.getFileBufferSize());
 
     String[] carbonDataDirectoryPath = CarbonDataProcessorUtil.getLocalDataFolderLocation(
         tableIdentifier.getDatabaseName(), tableIdentifier.getTableName(),
@@ -507,13 +493,6 @@ public class SortParameters implements Serializable {
     LOGGER.info("Number of intermediate file to be merged: " + parameters
         .getNumberOfIntermediateFileToBeMerged());
 
-    // get file buffer size
-    parameters.setFileBufferSize(CarbonDataProcessorUtil
-        .getFileBufferSize(parameters.getNumberOfIntermediateFileToBeMerged(), carbonProperties,
-            CarbonCommonConstants.CONSTANT_SIZE_TEN));
-
-    LOGGER.info("File Buffer Size: " + parameters.getFileBufferSize());
-
     String[] carbonDataDirectoryPath = CarbonDataProcessorUtil
         .getLocalDataFolderLocation(databaseName, tableName, taskNo, segmentId,
             isCompactionFlow, false);

http://git-wip-us.apache.org/repos/asf/carbondata/blob/67a8a37b/processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java
----------------------------------------------------------------------
diff --git a/processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java
b/processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java
index 218bac0..c2b21a6 100644
--- a/processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java
+++ b/processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java
@@ -67,29 +67,6 @@ public final class CarbonDataProcessorUtil {
   }
 
   /**
-   * Below method will be used to get the buffer size
-   *
-   * @param numberOfFiles
-   * @return buffer size
-   */
-  public static int getFileBufferSize(int numberOfFiles, CarbonProperties instance,
-      int deafultvalue) {
-    int configuredBufferSize = 0;
-    try {
-      configuredBufferSize =
-          Integer.parseInt(instance.getProperty(CarbonCommonConstants.SORT_FILE_BUFFER_SIZE));
-    } catch (NumberFormatException e) {
-      configuredBufferSize = deafultvalue;
-    }
-    int fileBufferSize = (configuredBufferSize * CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR
-        * CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR) / numberOfFiles;
-    if (fileBufferSize < CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR) {
-      fileBufferSize = CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR;
-    }
-    return fileBufferSize;
-  }
-
-  /**
    * This method will be used to delete sort temp location is it is exites
    */
   public static void deleteSortLocationIfExists(String[] locations) {
@@ -419,7 +396,6 @@ public final class CarbonDataProcessorUtil {
    * This method update the column Name
    *
    * @param schema
-   * @param tableName
    */
   public static Set<String> getSchemaColumnNames(CarbonDataLoadSchema schema) {
     Set<String> columnNames = new HashSet<String>(CarbonCommonConstants.DEFAULT_COLLECTION_SIZE);


Mime
View raw message