phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-2648) Phoenix Spark Integration does not allow Dynamic Columns to be mapped
Date Thu, 25 Aug 2016 12:21:20 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436743#comment-15436743
] 

ASF GitHub Bot commented on PHOENIX-2648:
-----------------------------------------

GitHub user xiaopeng-liao opened a pull request:

    https://github.com/apache/phoenix/pull/196

    [PHOENIX-2648] Add dynamic column support for spark integration

    It supports both RDD and Dataframe read /write, 
    Things needed consideration
    ======
    When loading from Dataframe, there is a need to convert from catalyst data type to Phoenix
type, ex. 
    StringType to Varchar, Array<Integer> to INTEGER_ARRAY,. etc. The code is under
phoenix-spark/src/main/scala/org.apache.phoenix.spark.DataFrameFunctions.scala
    
    Usages
    =======
    - **RDD**
    
    **Save**
    ```
    val dataSet = List((1L, "1", 1, 1), (2L, "2", 2, 2), (3L, "3", 3, 3))
    sc
      .parallelize(dataSet)
      .saveToPhoenix(
        "OUTPUT_TEST_TABLE",
        Seq("ID", "COL1", "COL2", "COL4<INTEGER"),
        hbaseConfiguration
    )
    ```
    
    **Read**
    ```
        val columnNames = Seq("ID", "COL1", "COL2", "COL5<INTEGER")
        // Load the results back
        val loaded = sc.phoenixTableAsRDD(
          "OUTPUT_TEST_TABLE",columnNames,
          conf = hbaseConfiguration
        )
    ```
    
    - **Dataframe**
    
    **Save**
    It will get data types from Dataframe and convert to Phoenix supported types
    ```
    val dataSet = List((1L, "1", 1, 1,"2"), (2L, "2", 2, 2,"3"), (3L, "3", 3, 3,"4"))
    sc
      .parallelize(dataSet).toDF("ID","COL1","COL2","COL6","COL7")
      .saveToPhoenix("OUTPUT_TEST_TABLE",zkUrl = Some(quorumAddress))
    ```
    
    **Read**
    ```
    val df1 = sqlContext.phoenixTableAsDataFrame("OUTPUT_TEST_TABLE", Array("ID", 
        "COL1","COL6<INTEGER", "COL7<VARCHAR"), conf = hbaseConfiguration)
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xiaopeng-liao/phoenix phoenix-addsparkdynamic

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/phoenix/pull/196.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #196
    
----
commit a2dc6101d96333f781ff9e905c47c035f8b89462
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-17T12:13:58Z

    add dynamic column support for SPARK rdd

commit 6969287db5ea341bc3876af55f7d0ef3acb035c2
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-18T09:46:38Z

    add dynamic column support for reading from PhoenixRDD.

commit 5688b6c90c66b02cc22fcac6e67b9712d7eb660e
Author: xiaopeng-liao <xp.em.liao@gmail.com>
Date:   2016-08-19T14:52:27Z

    Merge pull request #1 from apache/master
    
    merge in latest changes from phoenix

commit a9b217e55393f613e9ca168faccd93e7626c7324
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-23T10:51:34Z

    [PHOENIX-2648] add support for dynamic columns for RDD and Dataframe

commit 51190865375397581cbd1d6b960c79be7d727b97
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-23T10:52:27Z

    Merge branch 'phoenix-addsparkdynamic' of https://github.com/xiaopeng-liao/phoenix into
phoenix-addsparkdynamic

commit 6cbd6314782a6eb1a4c69eae25371791e4d64f90
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-23T13:00:55Z

    Remove the configuration for enable dynamic column as it is not used anyway

commit 8602554c875229f376499c082894cc33999f3e7b
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-23T15:01:29Z

    More clean up, remove the configuration for dynamic column

commit d3a4f1575f4b376df32f6d28aeba14270ce58088
Author: xiaopeng liao <xiaopeng liao>
Date:   2016-08-25T08:44:47Z

    [PHOENIX-2648] change dynamic column format from COL:DataType to COL<DataType becaues
it conflict with index syntax

----


> Phoenix Spark Integration does not allow Dynamic Columns to be mapped
> ---------------------------------------------------------------------
>
>                 Key: PHOENIX-2648
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2648
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.6.0
>         Environment: phoenix-spark-4.6.0-HBase-0.98  , spark-1.5.0-bin-hadoop2.4
>            Reporter: Suman Datta
>              Labels: patch, phoenixTableAsRDD, spark
>             Fix For: 4.6.0
>
>
> I am using spark-1.5.0-bin-hadoop2.4 and phoenix-spark-4.6.0-HBase-0.98 to load phoenix
tables on hbase to Spark RDD. Using the steps in https://phoenix.apache.org/phoenix_spark.html,
 I can successfully map standard columns in a table to Phoenix RDD. 
> But my table has some important dynamic columns (https://phoenix.apache.org/dynamic_columns.html)
which are not getting mapped to Spark RDD in this process.(using sc.phoenixTableAsRDD)
> This is proving a showstopper for me for using phoenix with spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message