spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Don Drake (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17341) Can't read Parquet data with fields containing periods "."
Date Thu, 01 Sep 2016 00:03:22 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453806#comment-15453806
] 

Don Drake commented on SPARK-17341:
-----------------------------------

I just downloaded the nightly build from 8/31/2016 and gave it a try.

And it worked:

{code}
scala> inSquare.take(2)
res2: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])

scala> inSquare.show(false)
+-----+-------------+
|value|squared.value|
+-----+-------------+
|1    |1            |
|2    |4            |
|3    |9            |
|4    |16           |
|5    |25           |
+-----+-------------+
{code}

Thanks.

> Can't read Parquet data with fields containing periods "."
> ----------------------------------------------------------
>
>                 Key: SPARK-17341
>                 URL: https://issues.apache.org/jira/browse/SPARK-17341
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Don Drake
>
> I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have encountered a
showstopper problem with Parquet dataset that have fields containing a "." in a field name.
 This data comes from an external provider (CSV) and we just pass through the field names.
 This has worked flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these parquet
files.  
> {code}
> Spark context available as 'sc' (master = local[*], app id = local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value",
"squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification
is not enabled so recording the schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression:
SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression:
SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression:
SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression:
SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression:
SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression:
SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression:
SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression:
SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary
is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary
is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation
is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary
is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation
is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary
is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation
is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary
is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary
is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation
is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary
is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation
is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation
is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation
is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary
is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation
is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager: Total allocation
exceeds 95.00% (906,992,000 bytes) of heap memory
> Scaling row group sizes to 96.54% for 7 writers
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager: Total allocation
exceeds 95.00% (906,992,000 bytes) of heap memory
> Scaling row group sizes to 84.47% for 8 writers
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written
39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written
39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written
39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written
39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written
39B for [value] INT
> scala> val inSquare = spark.read.parquet("squares")
> inSquare: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
> scala> inSquare.take(2)
> org.apache.spark.sql.AnalysisException: Unable to resolve squared.value given [value,
squared.value];
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
>   at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
>   at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
>   at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2558)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
>   ... 48 elided
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message