spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabor Somogyi <gabor.g.somo...@gmail.com>
Subject Re: Structured streaming from Kafka by timestamp
Date Thu, 24 Jan 2019 18:15:00 GMT
Hi Tomas,

As a general note don't fully understand your use-case. You've mentioned
structured streaming but your query is more like a one-time SQL statement.
Kafka doesn't support predicates how it's integrated with spark. What can
be done from spark perspective is to look for an offset for a specific
lowest timestamp and start the reading from there.

BR,
G


On Thu, Jan 24, 2019 at 6:38 PM Tomas Bartalos <tomas.bartalos@gmail.com>
wrote:

> Hello,
>
> I'm trying to read Kafka via spark structured streaming. I'm trying to
> read data within specific time range:
>
> select count(*) from kafka_table where timestamp > cast('2019-01-23 1:00'
> as TIMESTAMP) and timestamp < cast('2019-01-23 1:01' as TIMESTAMP);
>
>
> The problem is that timestamp query is not pushed-down to Kafka, so Spark
> tries to read the whole topic from beginning.
>
>
> explain query:
>
> ....
>
>          +- *(1) Filter ((isnotnull(timestamp#57) && (timestamp#57 >
> 1535148000000000)) && (timestamp#57 < 1535234400000000))
>
>
> Scan KafkaRelation(strategy=Subscribe[keeper.Ticket.avro.v1---production],
> start=EarliestOffsetRangeLimit, end=LatestOffsetRangeLimit)
> [key#52,value#53,topic#54,partition#55,offset#56L,timestamp#57,timestampType#58]
> *PushedFilters: []*, ReadSchema:
> struct<key:binary,value:binary,topic:string,partition:int,offset:bigint,timestamp:timestamp,times...
>
>
> Obviously the query takes forever to complete. Is there a solution to this
> ?
>
> I'm using kafka and kafka-client version 1.1.1
>
>
> BR,
>
> Tomas
>

Mime
View raw message