spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Teoh <chris.t...@gmail.com>
Subject Re: Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided
Date Wed, 13 Feb 2019 20:29:53 GMT
Thanks Peter.

I'm not sure if that is possible yet. The closest I can think of to
achieving what you want is to try something like:-
df.registerTempTable("mytable")
sql("create table mymanagedtable as select * from mytable")

I haven't used CTAS in Spark SQL before but have heard it works. This would
infer the schema for you and from what I have heard CTAS creates managed
tables.

Let me know if this works for you.

Kind Regards
Chris

On Thu, 14 Feb 2019 at 03:08 Horváth Péter Gergely <
horvath.peter.gergely@gmail.com> wrote:

> Hi Chris,
>
> Thank you for the input, I know I can always write the table DDL manually.
>
> But here I would like to rely on Spark generating the schema. What I don't
> understand is the change in the behaviour of Spark: having the storage path
> specified does not necessarily mean it should be an external table.
>
> Is there any way to control/override this?
>
> Thanks,
> Peter
>
>
> On Wed, Feb 13, 2019, 13:09 Chris Teoh <chris.teoh@gmail.com wrote:
>
>> Hey there,
>>
>> Could you not just create a managed table using the DDL in Spark SQL and
>> then written the data frame to the underlying folder or use Spark SQL to do
>> an insert?
>>
>> Alternatively try create table as select. Iirc hive creates managed
>> tables this way.
>>
>> I've not confirmed this works but I think that might be worth trying.
>>
>> I hope that helps.
>>
>> Kind regards
>> Chris
>>
>> On Wed., 13 Feb. 2019, 10:44 pm Horváth Péter Gergely, <
>> horvath.peter.gergely@gmail.com> wrote:
>>
>>> Dear All,
>>>
>>> I am facing a strange issue with Spark 2.3, where I would like to create
>>> a MANAGED table out of the content of a DataFrame with the storage path
>>> overridden.
>>>
>>> Apparently, when one tries to create a Hive table via
>>> DataFrameWriter.saveAsTable, supplying a "path" option causes Spark to
>>> automatically create an external table.
>>>
>>> This demonstrates the behaviour:
>>>
>>> scala> val numbersDF = sc.parallelize((1 to 100).toList).toDF("numbers")
>>> numbersDF: org.apache.spark.sql.DataFrame = [numbers: int]
>>>
>>> scala> numbersDF.write.format("orc").saveAsTable("numbers_table1")
>>>
>>> scala> spark.sql("describe formatted
>>> numbers_table1").filter(_.get(0).toString == "Type").show
>>> +--------+---------+-------+
>>> |col_name|data_type|comment|
>>> +--------+---------+-------+
>>> |    Type|  MANAGED|       |
>>> +--------+---------+-------+
>>>
>>>
>>> scala> numbersDF.write.format("orc").option("path",
>>> "/user/foobar/numbers_table_data").saveAsTable("numbers_table2")
>>>
>>> scala> spark.sql("describe formatted
>>> numbers_table2").filter(_.get(0).toString == "Type").show
>>> +--------+---------+-------+
>>> |col_name|data_type|comment|
>>> +--------+---------+-------+
>>> |    Type| EXTERNAL|       |
>>> +--------+---------+-------+
>>>
>>>
>>>
>>> I am wondering if there is any way to force creation of a managed table
>>> with a custom path (which as far as I know, should be possible via standard
>>> Hive commands).
>>>
>>> I often seem to have the problem that I cannot find the appropriate
>>> documentation for the option configuration of Spark APIs. Could someone
>>> please point me to the right direction and tell me where these things are
>>> documented?
>>>
>>> Thanks,
>>> Peter
>>>
>>>

Mime
View raw message