spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Horváth Péter Gergely <horvath.peter.gerg...@gmail.com>
Subject Re: Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided
Date Wed, 13 Feb 2019 16:08:18 GMT
Hi Chris,

Thank you for the input, I know I can always write the table DDL manually.

But here I would like to rely on Spark generating the schema. What I don't
understand is the change in the behaviour of Spark: having the storage path
specified does not necessarily mean it should be an external table.

Is there any way to control/override this?

Thanks,
Peter


On Wed, Feb 13, 2019, 13:09 Chris Teoh <chris.teoh@gmail.com wrote:

> Hey there,
>
> Could you not just create a managed table using the DDL in Spark SQL and
> then written the data frame to the underlying folder or use Spark SQL to do
> an insert?
>
> Alternatively try create table as select. Iirc hive creates managed tables
> this way.
>
> I've not confirmed this works but I think that might be worth trying.
>
> I hope that helps.
>
> Kind regards
> Chris
>
> On Wed., 13 Feb. 2019, 10:44 pm Horváth Péter Gergely, <
> horvath.peter.gergely@gmail.com> wrote:
>
>> Dear All,
>>
>> I am facing a strange issue with Spark 2.3, where I would like to create
>> a MANAGED table out of the content of a DataFrame with the storage path
>> overridden.
>>
>> Apparently, when one tries to create a Hive table via
>> DataFrameWriter.saveAsTable, supplying a "path" option causes Spark to
>> automatically create an external table.
>>
>> This demonstrates the behaviour:
>>
>> scala> val numbersDF = sc.parallelize((1 to 100).toList).toDF("numbers")
>> numbersDF: org.apache.spark.sql.DataFrame = [numbers: int]
>>
>> scala> numbersDF.write.format("orc").saveAsTable("numbers_table1")
>>
>> scala> spark.sql("describe formatted
>> numbers_table1").filter(_.get(0).toString == "Type").show
>> +--------+---------+-------+
>> |col_name|data_type|comment|
>> +--------+---------+-------+
>> |    Type|  MANAGED|       |
>> +--------+---------+-------+
>>
>>
>> scala> numbersDF.write.format("orc").option("path",
>> "/user/foobar/numbers_table_data").saveAsTable("numbers_table2")
>>
>> scala> spark.sql("describe formatted
>> numbers_table2").filter(_.get(0).toString == "Type").show
>> +--------+---------+-------+
>> |col_name|data_type|comment|
>> +--------+---------+-------+
>> |    Type| EXTERNAL|       |
>> +--------+---------+-------+
>>
>>
>>
>> I am wondering if there is any way to force creation of a managed table
>> with a custom path (which as far as I know, should be possible via standard
>> Hive commands).
>>
>> I often seem to have the problem that I cannot find the appropriate
>> documentation for the option configuration of Spark APIs. Could someone
>> please point me to the right direction and tell me where these things are
>> documented?
>>
>> Thanks,
>> Peter
>>
>>

Mime
View raw message