spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jon Chase (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
Date Thu, 26 Mar 2015 13:00:54 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381831#comment-14381831
] 

Jon Chase commented on SPARK-6554:
----------------------------------

"spark.sql.parquet.filterPushdown" was the problem.  Leaving it set to false works around
the problem for now.  

Thanks for jumping on this.

> Cannot use partition columns in where clause when Parquet filter push-down is enabled
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-6554
>                 URL: https://issues.apache.org/jira/browse/SPARK-6554
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Jon Chase
>            Assignee: Cheng Lian
>            Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, the directory
structure looks like this:
> {noformat}
> /mydata
>     /probeTypeId=1
>         ...files...
>     /probeTypeId=2
>         ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>      optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition column.
> However, when I try to use a partition column in a where clause, I get an exception stating
that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0
failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException:
Column [probeTypeId] was not found in schema!
> 	at parquet.Preconditions.checkArgument(Preconditions.java:47)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> 	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
> 	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
> 	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> 	at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> 	at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> 	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug
attribute not set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate
appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender
as [STDOUT]
> 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming
default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting
level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching
appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End
of configuration.
> 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering
current configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
> INFO  org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls
disabled; users with view permissions: Set(jon); users with modify permissions: Set(jon)
> INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
> INFO  Remoting Starting remoting
> INFO  Remoting Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.134:62493]
> INFO  org.apache.spark.util.Utils Successfully started service 'sparkDriver' on port
62493.
> INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
> INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
> INFO  o.a.spark.storage.DiskBlockManager Created local directory at /var/folders/x7/9hdp8kw9569864088tsl4jmm0000gn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4
> INFO  org.apache.spark.storage.MemoryStore MemoryStore started with capacity 1966.1 MB
> INFO  org.apache.spark.HttpFileServer HTTP File server directory is /var/folders/x7/9hdp8kw9569864088tsl4jmm0000gn/T/spark-cf4687bd-1563-4ddf-b697-21c96fd95561/httpd-6343b9c9-bb66-43ac-ac43-6da80c7a1f95
> INFO  org.apache.spark.HttpServer Starting HTTP Server
> INFO  o.spark-project.jetty.server.Server jetty-8.y.z-SNAPSHOT
> INFO  o.s.jetty.server.AbstractConnector Started SocketConnector@0.0.0.0:62494
> INFO  org.apache.spark.util.Utils Successfully started service 'HTTP file server' on
port 62494.
> INFO  org.apache.spark.SparkEnv Registering OutputCommitCoordinator
> INFO  o.spark-project.jetty.server.Server jetty-8.y.z-SNAPSHOT
> INFO  o.s.jetty.server.AbstractConnector Started SelectChannelConnector@0.0.0.0:4040
> INFO  org.apache.spark.util.Utils Successfully started service 'SparkUI' on port 4040.
> INFO  org.apache.spark.ui.SparkUI Started SparkUI at http://192.168.1.134:4040
> INFO  org.apache.spark.executor.Executor Starting executor ID <driver> on host
localhost
> INFO  org.apache.spark.util.AkkaUtils Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.1.134:62493/user/HeartbeatReceiver
> INFO  o.a.s.n.n.NettyBlockTransferService Server created on 62495
> INFO  o.a.spark.storage.BlockManagerMaster Trying to register BlockManager
> INFO  o.a.s.s.BlockManagerMasterActor Registering block manager localhost:62495 with
1966.1 MB RAM, BlockManagerId(<driver>, localhost, 62495)
> INFO  o.a.spark.storage.BlockManagerMaster Registered BlockManager
> INFO  o.a.h.conf.Configuration.deprecation mapred.max.split.size is deprecated. Instead,
use mapreduce.input.fileinputformat.split.maxsize
> INFO  o.a.h.conf.Configuration.deprecation mapred.reduce.tasks.speculative.execution
is deprecated. Instead, use mapreduce.reduce.speculative
> INFO  o.a.h.conf.Configuration.deprecation mapred.committer.job.setup.cleanup.needed
is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
> INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size.per.rack is deprecated.
Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
> INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size is deprecated. Instead,
use mapreduce.input.fileinputformat.split.minsize
> INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size.per.node is deprecated.
Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
> INFO  o.a.h.conf.Configuration.deprecation mapred.reduce.tasks is deprecated. Instead,
use mapreduce.job.reduces
> INFO  o.a.h.conf.Configuration.deprecation mapred.input.dir.recursive is deprecated.
Instead, use mapreduce.input.fileinputformat.input.dir.recursive
> INFO  o.a.h.hive.metastore.HiveMetaStore 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> INFO  o.a.h.hive.metastore.ObjectStore ObjectStore, initialize called
> INFO  DataNucleus.Persistence Property hive.metastore.integral.jdo.pushdown unknown -
will be ignored
> INFO  DataNucleus.Persistence Property datanucleus.cache.level2 unknown - will be ignored
> INFO  o.a.h.hive.metastore.ObjectStore Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> INFO  o.a.h.h.metastore.MetaStoreDirectSql MySQL check failed, assuming we are not on
mysql: Lexical error at line 1, column 5.  Encountered: "@" (64), after : "".
> INFO  DataNucleus.Datastore The class "org.apache.hadoop.hive.metastore.model.MFieldSchema"
is tagged as "embedded-only" so does not have its own datastore table.
> INFO  DataNucleus.Datastore The class "org.apache.hadoop.hive.metastore.model.MOrder"
is tagged as "embedded-only" so does not have its own datastore table.
> INFO  DataNucleus.Datastore The class "org.apache.hadoop.hive.metastore.model.MFieldSchema"
is tagged as "embedded-only" so does not have its own datastore table.
> INFO  DataNucleus.Datastore The class "org.apache.hadoop.hive.metastore.model.MOrder"
is tagged as "embedded-only" so does not have its own datastore table.
> INFO  DataNucleus.Query Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0"
since the connection used is closing
> INFO  o.a.h.hive.metastore.ObjectStore Initialized ObjectStore
> INFO  o.a.h.hive.metastore.HiveMetaStore Added admin role in metastore
> INFO  o.a.h.hive.metastore.HiveMetaStore Added public role in metastore
> INFO  o.a.h.hive.metastore.HiveMetaStore No user is added in admin role, since config
is empty
> INFO  o.a.h.hive.ql.session.SessionState No Tez session required at this point. hive.execution.engine=mr.
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> root
>  |-- clientMarketId: integer (nullable = true)
>  |-- clientCountryId: integer (nullable = true)
>  |-- clientRegionId: integer (nullable = true)
>  |-- clientStateId: integer (nullable = true)
>  |-- clientAsnId: integer (nullable = true)
>  |-- reporterZoneId: integer (nullable = true)
>  |-- reporterCustomerId: integer (nullable = true)
>  |-- responseCode: integer (nullable = true)
>  |-- measurementValue: integer (nullable = true)
>  |-- year: integer (nullable = true)
>  |-- month: integer (nullable = true)
>  |-- day: integer (nullable = true)
>  |-- providerOwnerZoneId: integer (nullable = true)
>  |-- providerOwnerCustomerId: integer (nullable = true)
>  |-- providerId: integer (nullable = true)
>  |-- probeTypeId: integer (nullable = true)
> ======================================================
> INFO  hive.ql.parse.ParseDriver Parsing command: select probeTypeId from df where probeTypeId
= 1 limit 1
> INFO  hive.ql.parse.ParseDriver Parse Completed
> ==== results for select probeTypeId from df where probeTypeId = 1 limit 1
> ======================================================
> INFO  o.a.s.sql.parquet.ParquetRelation2 Reading 33.33333333333333% of partitions
> INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(191336) called with curMem=0,
maxMem=2061647216
> INFO  org.apache.spark.storage.MemoryStore Block broadcast_0 stored as values in memory
(estimated size 186.9 KB, free 1966.0 MB)
> INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(27530) called with curMem=191336,
maxMem=2061647216
> INFO  org.apache.spark.storage.MemoryStore Block broadcast_0_piece0 stored as bytes in
memory (estimated size 26.9 KB, free 1965.9 MB)
> INFO  o.a.spark.storage.BlockManagerInfo Added broadcast_0_piece0 in memory on localhost:62495
(size: 26.9 KB, free: 1966.1 MB)
> INFO  o.a.spark.storage.BlockManagerMaster Updated info of block broadcast_0_piece0
> INFO  org.apache.spark.SparkContext Created broadcast 0 from NewHadoopRDD at newParquet.scala:447
> INFO  o.a.h.conf.Configuration.deprecation mapred.max.split.size is deprecated. Instead,
use mapreduce.input.fileinputformat.split.maxsize
> INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size is deprecated. Instead,
use mapreduce.input.fileinputformat.split.minsize
> INFO  o.a.s.s.p.ParquetRelation2$$anon$1$$anon$2 Using Task Side Metadata Split Strategy
> INFO  org.apache.spark.SparkContext Starting job: runJob at SparkPlan.scala:121
> INFO  o.a.spark.scheduler.DAGScheduler Got job 0 (runJob at SparkPlan.scala:121) with
1 output partitions (allowLocal=false)
> INFO  o.a.spark.scheduler.DAGScheduler Final stage: Stage 0(runJob at SparkPlan.scala:121)
> INFO  o.a.spark.scheduler.DAGScheduler Parents of final stage: List()
> INFO  o.a.spark.scheduler.DAGScheduler Missing parents: List()
> INFO  o.a.spark.scheduler.DAGScheduler Submitting Stage 0 (MapPartitionsRDD[3] at map
at SparkPlan.scala:96), which has no missing parents
> INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(5512) called with curMem=218866,
maxMem=2061647216
> INFO  org.apache.spark.storage.MemoryStore Block broadcast_1 stored as values in memory
(estimated size 5.4 KB, free 1965.9 MB)
> INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(3754) called with curMem=224378,
maxMem=2061647216
> INFO  org.apache.spark.storage.MemoryStore Block broadcast_1_piece0 stored as bytes in
memory (estimated size 3.7 KB, free 1965.9 MB)
> INFO  o.a.spark.storage.BlockManagerInfo Added broadcast_1_piece0 in memory on localhost:62495
(size: 3.7 KB, free: 1966.1 MB)
> INFO  o.a.spark.storage.BlockManagerMaster Updated info of block broadcast_1_piece0
> INFO  org.apache.spark.SparkContext Created broadcast 1 from broadcast at DAGScheduler.scala:839
> INFO  o.a.spark.scheduler.DAGScheduler Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3]
at map at SparkPlan.scala:96)
> INFO  o.a.s.scheduler.TaskSchedulerImpl Adding task set 0.0 with 1 tasks
> INFO  o.a.spark.scheduler.TaskSetManager Starting task 0.0 in stage 0.0 (TID 0, localhost,
PROCESS_LOCAL, 1687 bytes)
> INFO  org.apache.spark.executor.Executor Running task 0.0 in stage 0.0 (TID 0)
> INFO  o.a.s.s.p.ParquetRelation2$$anon$1 Input split: ParquetInputSplit{part: file:/Users/jon/Downloads/sparksql/1partitionsminusgeo/year=2015/month=1/day=14/providerOwnerZoneId=0/providerOwnerCustomerId=0/providerId=287/probeTypeId=1/part-r-00001.parquet
start: 0 end: 8851183 length: 8851183 hosts: [] requestedSchema: message root {
>   optional int32 probeTypeId;
> }
>  readSupportMetadata: {org.apache.spark.sql.parquet.row.requested_schema={"type":"struct","fields":[{"name":"probeTypeId","type":"integer","nullable":true,"metadata":{}}]},
org.apache.spark.sql.parquet.row.metadata={"type":"struct","fields":[{"name":"clientMarketId","type":"integer","nullable":true,"metadata":{}},{"name":"clientCountryId","type":"integer","nullable":true,"metadata":{}},{"name":"clientRegionId","type":"integer","nullable":true,"metadata":{}},{"name":"clientStateId","type":"integer","nullable":true,"metadata":{}},{"name":"clientAsnId","type":"integer","nullable":true,"metadata":{}},{"name":"reporterZoneId","type":"integer","nullable":true,"metadata":{}},{"name":"reporterCustomerId","type":"integer","nullable":true,"metadata":{}},{"name":"responseCode","type":"integer","nullable":true,"metadata":{}},{"name":"measurementValue","type":"integer","nullable":true,"metadata":{}},{"name":"year","type":"integer","nullable":true,"metadata":{}},{"name":"month","type":"integer","nullable":true,"metadata":{}},{"name":"day","type":"integer","nullable":true,"metadata":{}},{"name":"providerOwnerZoneId","type":"integer","nullable":true,"metadata":{}},{"name":"providerOwnerCustomerId","type":"integer","nullable":true,"metadata":{}},{"name":"providerId","type":"integer","nullable":true,"metadata":{}},{"name":"probeTypeId","type":"integer","nullable":true,"metadata":{}}]}}}
> ERROR org.apache.spark.executor.Executor Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema!
> 	at parquet.Preconditions.checkArgument(Preconditions.java:47) ~[parquet-common-1.6.0rc3.jar:na]
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
~[parquet-column-1.6.0rc3.jar:na]
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
~[parquet-column-1.6.0rc3.jar:na]
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
~[parquet-column-1.6.0rc3.jar:na]
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
~[parquet-column-1.6.0rc3.jar:na]
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
~[parquet-column-1.6.0rc3.jar:na]
> 	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) ~[parquet-column-1.6.0rc3.jar:na]
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
~[parquet-column-1.6.0rc3.jar:na]
> 	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) ~[parquet-hadoop-1.6.0rc3.jar:na]
> 	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) ~[parquet-hadoop-1.6.0rc3.jar:na]
> 	at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
~[parquet-column-1.6.0rc3.jar:na]
> 	at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) ~[parquet-hadoop-1.6.0rc3.jar:na]
> 	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
~[parquet-hadoop-1.6.0rc3.jar:na]
> 	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) ~[parquet-hadoop-1.6.0rc3.jar:na]
> 	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.scheduler.Task.run(Task.scala:64) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) ~[spark-core_2.10-1.3.0.jar:1.3.0]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_31]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_31]
> 	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
> WARN  o.a.spark.scheduler.TaskSetManager Lost task 0.0 in stage 0.0 (TID 0, localhost):
java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema!
> 	at parquet.Preconditions.checkArgument(Preconditions.java:47)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> 	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
> 	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
> 	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> 	at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> 	at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> 	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> 	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> 	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
> 	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> 	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:64)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> ERROR o.a.spark.scheduler.TaskSetManager Task 0 in stage 0.0 failed 1 times; aborting
job
> INFO  o.a.s.scheduler.TaskSchedulerImpl Removed TaskSet 0.0, whose tasks have all completed,
from pool 
> INFO  o.a.s.scheduler.TaskSchedulerImpl Cancelling stage 0
> INFO  o.a.spark.scheduler.DAGScheduler Job 0 failed: runJob at SparkPlan.scala:121, took
0.132538 s
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/static,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/executors/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/executors,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/environment/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/environment,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/storage/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/storage,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
> INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/jobs,null}
> INFO  org.apache.spark.ui.SparkUI Stopped Spark web UI at http://192.168.1.134:4040
> INFO  o.a.spark.scheduler.DAGScheduler Stopping DAGScheduler
> INFO  o.a.s.MapOutputTrackerMasterActor MapOutputTrackerActor stopped!
> INFO  org.apache.spark.storage.MemoryStore MemoryStore cleared
> INFO  o.apache.spark.storage.BlockManager BlockManager stopped
> INFO  o.a.spark.storage.BlockManagerMaster BlockManagerMaster stopped
> INFO  o.a.s.s.OutputCommitCoordinator$OutputCommitCoordinatorActor OutputCommitCoordinator
stopped!
> INFO  org.apache.spark.SparkContext Successfully stopped SparkContext
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0
failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException:
Column [probeTypeId] was not found in schema!
> 	at parquet.Preconditions.checkArgument(Preconditions.java:47)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> 	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
> 	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
> 	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> 	at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> 	at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> 	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> 	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> 	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
> 	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> 	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:64)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> 	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> 	at scala.Option.foreach(Option.scala:236)
> 	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> 	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Mar 26, 2015 6:06:02 AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate:
eq(probeTypeId, 1)
> Mar 26, 2015 6:06:02 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize
counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> Mar 26, 2015 6:06:02 AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate:
eq(probeTypeId, 1)
> Process finished with exit code 255
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message