spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Dataframe schema...
Date Thu, 20 Oct 2016 08:46:08 GMT
What is the issue you see when unioning?

On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar <babloo80@gmail.com> wrote:

> Hello Michael,
>
> Thank you for looking into this query. In my case there seem to be an
> issue when I union a parquet file read from disk versus another dataframe
> that I construct in-memory. The only difference I see is the containsNull =
> true. In fact, I do not see any errors with union on the simple schema of
> "col1 thru col4" above. But the problem seem to exist only on that
> "some_histogram" column which contains the mixed containsNull = true/false.
> Let me know if this helps.
>
> Thanks,
> Muthu
>
>
>
> On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> Nullable is just a hint to the optimizer that its impossible for there to
>> be a null value in this column, so that it can avoid generating code for
>> null-checks.  When in doubt, we set nullable=true since it is always safer
>> to check.
>>
>> Why in particular are you trying to change the nullability of the column?
>>
>> On Wed, Oct 19, 2016 at 6:07 PM, Muthu Jayakumar <babloo80@gmail.com>
>> wrote:
>>
>>> Hello there,
>>>
>>> I am trying to understand how and when does DataFrame (or Dataset) sets
>>> nullable = true vs false on a schema.
>>>
>>> Here is my observation from a sample code I tried...
>>>
>>>
>>> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c",
>>> 2.0d))).toDF("col1", "col2", "col3").withColumn("col4",
>>> lit("bla")).printSchema()
>>> root
>>>  |-- col1: integer (nullable = false)
>>>  |-- col2: string (nullable = true)
>>>  |-- col3: double (nullable = false)
>>>  |-- col4: string (nullable = false)
>>>
>>>
>>> scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c",
>>> 2.0d))).toDF("col1", "col2", "col3").withColumn("col4",
>>> lit("bla")).write.parquet("/tmp/sample.parquet")
>>>
>>> scala> spark.read.parquet("/tmp/sample.parquet").printSchema()
>>> root
>>>  |-- col1: integer (nullable = true)
>>>  |-- col2: string (nullable = true)
>>>  |-- col3: double (nullable = true)
>>>  |-- col4: string (nullable = true)
>>>
>>>
>>> The place where this seem to get me into trouble is when I try to union
>>> one data-structure from in-memory (notice that in the below schema the
>>> highlighted element is represented as 'false' for in-memory created schema)
>>> and one from file that starts out with a schema like below...
>>>
>>>  |-- some_histogram: struct (nullable = true)
>>>  |    |-- values: array (nullable = true)
>>>  |    |    |-- element: double (containsNull = true)
>>>  |    |-- freq: array (nullable = true)
>>>  |    |    |-- element: long (containsNull = true)
>>>
>>> Is there a way to convert this attribute from true to false without
>>> running any mapping / udf on that column?
>>>
>>> Please advice,
>>> Muthu
>>>
>>
>>
>

Mime
View raw message