spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Using ORC input for mllib algorithms
Date Fri, 27 Mar 2015 18:07:33 GMT
This is a PR in review to support ORC via the SQL data source API:
https://github.com/apache/spark/pull/3753. You can try pulling that PR
and help test it. -Xiangrui

On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth <toth.zsolt.bme@gmail.com> wrote:
> Hi,
>
> I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class,
> OrcStruct.class) to use data in ORC format as an RDD. I made some
> benchmarking on ORC input vs Text input for MLlib and I ran into a few
> issues with ORC.
> Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor memory, 2g
> executor memoryOverhead, 1g driver memory. The cluster nodes have sufficient
> resources for the setup.
>
> Logistic regression: When using 1GB ORC input (stored in 4 blocks on hdfs),
> only one block (25%) is cached and only one executor is used, however the
> whole rdd could be cached even as Textfile (that's around 5.5GB). Is it
> possible to make Spark use the available resources?
>
> Decision tree: Using 8GB ORC input, the job fails every time with the "Size
> exceeds INTEGER.MAX_VALUE" error. Plus, I see errors from the JVM in the
> logs that "container is running beyond physical memory limits". Is it
> possible to avoid this when using ORC input format? Tried to set the
> min.split.size/max.split.size or dfs.blocksize but that didn't help.
>
> Again, none of these happen when using Text input.
>
> Cheers,
> Zsolt

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message