spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Burak Yavuz <brk...@gmail.com>
Subject Re: REST Structured Steaming Sink
Date Thu, 02 Jul 2020 01:12:53 GMT
I'm not sure having a built-in sink that allows you to DDOS servers is the
best idea either. foreachWriter is typically used for such use cases, not
foreachBatch. It's also pretty hard to guarantee exactly-once, rate
limiting, etc.

Best,
Burak

On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <holden@pigscanfly.ca> wrote:

> I think adding something like this (if it doesn't already exist) could
> help make structured streaming easier to use, foreachBatch is not the best
> API.
>
> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <kabhwan.opensource@gmail.com>
> wrote:
>
>> I guess the method, query parameter, header, and the payload would be all
>> different for almost every use case - that makes it hard to generalize and
>> requires implementation to be pretty much complicated to be flexible enough.
>>
>> I'm not aware of any custom sink implementing REST so your best bet would
>> be simply implementing your own with foreachBatch, but so someone might
>> jump in and provide a pointer if there is something in the Spark ecosystem.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <hussam.elamin@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>>
>>> We ingest alot of restful APIs into our lake and I'm wondering if it is
>>> at all possible to created a rest sink in structured streaming?
>>>
>>> For now I'm only focusing on restful services that have an incremental
>>> ID so my sink can just poll for new data then ingest.
>>>
>>> I can't seem to find a connector that does this and my gut instinct
>>> tells me it's probably because it isn't possible due to something
>>> completely obvious that I am missing
>>>
>>> I know some RESTful API obfuscate the IDs to a hash of strings and that
>>> could be a problem but since I'm planning on focusing on just numerical IDs
>>> that just get incremented I think I won't be facing that issue
>>>
>>>
>>> Can anyone let me know if this sounds like a daft idea? Will I need
>>> something like Kafka or kinesis as a buffer and redundancy or am I
>>> overthinking this?
>>>
>>>
>>> I would love to bounce ideas with people who runs structured streaming
>>> jobs in production
>>>
>>>
>>> Kind regards
>>> San
>>>
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Mime
View raw message