spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jianshi Huang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource
Date Wed, 25 Mar 2015 16:02:53 GMT

     [ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jianshi Huang updated SPARK-6533:
---------------------------------
    Description: 
If spark.sql.parquet.useDataSourceApi is not set to false, which is the default.

Loading parquet files using file pattern will throw errors.

*\*Wildcard*
{noformat}
scala> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0*")
15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot
be used because libhadoop cannot be loaded.
java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0*
  at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
  at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
{noformat}

And

*\[abc\]*
{noformat}
val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0[12]")
java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI.create(URI.java:859)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
  ... 49 elided
Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI$Parser.fail(URI.java:2829)
  at java.net.URI$Parser.checkChars(URI.java:3002)
  at java.net.URI$Parser.parseHierarchical(URI.java:3086)
  at java.net.URI$Parser.parse(URI.java:3034)
  at java.net.URI.<init>(URI.java:595)
  at java.net.URI.create(URI.java:857)
{noformat}

If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema
evolution etc, but being able to specify file pattern is also very important to applications.

Please add this important feature.

Jianshi

  was:
If spark.sql.parquet.useDataSourceApi is not set to false, which is the default.

Loading parquet files using file pattern will throw errors.

*\*Wildcard*
{noformat}
scala> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0*")
15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot
be used because libhadoop cannot be loaded.
java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0*
  at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
  at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
{noformat}

And

*\[abc\]*
{noformat}
val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0[12]")
java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI.create(URI.java:859)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
  at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388)
  at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
  ... 49 elided
Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12]
  at java.net.URI$Parser.fail(URI.java:2829)
  at java.net.URI$Parser.checkChars(URI.java:3002)
  at java.net.URI$Parser.parseHierarchical(URI.java:3086)
  at java.net.URI$Parser.parse(URI.java:3034)
  at java.net.URI.<init>(URI.java:595)
  at java.net.URI.create(URI.java:857)
{noformat}


Jianshi

     Issue Type: Improvement  (was: Bug)
        Summary: Allow using wildcard and other file pattern in Parquet DataSource  (was:
Cannot use wildcard and other file pattern in sqlContext.parquetFile if spark.sql.parquet.useDataSourceApi
is not set to false)

> Allow using wildcard and other file pattern in Parquet DataSource
> -----------------------------------------------------------------
>
>                 Key: SPARK-6533
>                 URL: https://issues.apache.org/jira/browse/SPARK-6533
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0, 1.3.1
>            Reporter: Jianshi Huang
>
> If spark.sql.parquet.useDataSourceApi is not set to false, which is the default.
> Loading parquet files using file pattern will throw errors.
> *\*Wildcard*
> {noformat}
> scala> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0*")
> 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
> 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot
be used because libhadoop cannot be loaded.
> java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0*
>   at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
>   at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
>   at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
>   at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388)
>   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
> {noformat}
> And
> *\[abc\]*
> {noformat}
> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0[12]")
> java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12]
>   at java.net.URI.create(URI.java:859)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
>   at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388)
>   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
>   ... 49 elided
> Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12]
>   at java.net.URI$Parser.fail(URI.java:2829)
>   at java.net.URI$Parser.checkChars(URI.java:3002)
>   at java.net.URI$Parser.parseHierarchical(URI.java:3086)
>   at java.net.URI$Parser.parse(URI.java:3034)
>   at java.net.URI.<init>(URI.java:595)
>   at java.net.URI.create(URI.java:857)
> {noformat}
> If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery,
schema evolution etc, but being able to specify file pattern is also very important to applications.
> Please add this important feature.
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message