spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Hong <sungjinh...@devsisters.com>
Subject Re: [Spark SQL] Making InferSchema and JacksonParser public
Date Thu, 19 Jan 2017 03:40:18 GMT
Yes that is the option I took while implementing this under Spark 1.4.  But
every time there is a major update in Spark, I needed to re-copy the needed
parts, which is very time consuming.

The reason is that InferSchema and JacksonParser uses many more Spark
internal methods, which makes this very hard to copy and maintain.

Thanks!

On Thu, Jan 19, 2017 at 2:41 AM Reynold Xin <rxin@databricks.com> wrote:

> That is internal, but the amount of code is not a lot. Can you just copy
> the relevant classes over to your project?
>
> On Wed, Jan 18, 2017 at 5:52 AM Brian Hong <sungjinhong@devsisters.com>
> wrote:
>
> I work for a mobile game company. I'm solving a simple question: "Can we
> efficiently/cheaply query for the log of a particular user within given
> date period?"
>
> I've created a special JSON text-based file format that has these traits:
>  - Snappy compressed, saved in AWS S3
>  - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
>  - Sorted by a primary key (log_type) and a secondary key (user_id),
> Snappy block compressed by 5MB blocks
>  - Blocks are indexed with primary/secondary key in file 2017-01-01.json
>  - Efficient block based random access on primary key (log_type) and
> secondary key (user_id) using the index
>
> I've created a Spark SQL DataFrame relation that can query this file
> format.  Since the schema of each log type is fairly consistent, I've
> reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark
> SQL code to support structured querying.  I've also implemented filter
> push-down to optimize the file access.
>
> It is very fast when querying for a single user or querying for a single
> log type with a sampling ratio of 10000 to 1 compared to parquet file
> format.  (We do use parquet for some log types when we need batch analysis.)
>
> One of the problems we face is that the methods we use above are private
> API.  So we are forced to use hacks to use these methods.  (Things like
> copying the code or using the org.apache.spark.sql package namespace)
>
> I've been following Spark SQL code since 1.4, and the JSON schema
> inferencing code and JacksonParser seem to be relatively stable recently.
> Can the core-devs make these APIs public?
>
> We are willing to open source this file format because it is very
> excellent for archiving user related logs in S3.  The key dependency of
> private APIs in Spark SQL is the main hurdle in making this a reality.
>
> Thank you for reading!
>
>
>
>

Mime
View raw message