spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomas Bartalos (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26841) Timestamp pushdown on Kafka table
Date Fri, 08 Feb 2019 15:30:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763697#comment-16763697
] 

Tomas Bartalos commented on SPARK-26841:
----------------------------------------

Thank you for letting me know, I've filed the PR. I'm not addressing streaming query, only
the sql query.

It seems we're trying to achieve similar behaviour, the difference is you're introducing the
timestamp restriction during table creation time, while my solution during  table query time.

> Timestamp pushdown on Kafka table
> ---------------------------------
>
>                 Key: SPARK-26841
>                 URL: https://issues.apache.org/jira/browse/SPARK-26841
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.4.0
>            Reporter: Tomas Bartalos
>            Priority: Major
>              Labels: Kafka, pushdown, timestamp
>
> As a Spark user I'd like to have fast queries on Kafka table restricted by timestamp.
> I'd like to have quick answers on questions like:
>  * What was inserted to Kafka in past x minutes
>  * What was inserted to Kafka in specified time range
> Example:
> {quote}select * from kafka_table where timestamp > from_unixtime(unix_timestamp()
- 5 * 60, "YYYY-MM-dd HH:mm:ss")
> select * from kafka_table where timestamp > $from_time and timestamp < $end_time
> {quote}
> Currently timestamp restrictions are not pushdown to KafkaRelation and querying by timestamp
on a large Kafka topic takes forever to complete.
> *Technical solution*
> Technically its possible to retrieve Kafka's offsets by provided timestamp with org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..)
method. Afterwards we can query Kafka topic by retrieved timestamp ranges.
> Querying by timestamp range is already implemented so this change should have minor
impact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message