spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Melo <andrew.m...@gmail.com>
Subject Re: REST Structured Steaming Sink
Date Thu, 02 Jul 2020 01:40:13 GMT
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz <brkyvz@gmail.com> wrote:
>
> I'm not sure having a built-in sink that allows you to DDOS servers is the best idea
either. foreachWriter is typically used for such use cases, not foreachBatch. It's also pretty
hard to guarantee exactly-once, rate limiting, etc.

If you control the machines and can run arbitrary code, you can DDOS
whatever you want. What's the difference between this proposal and
writing a UDF that opens 1,000 connections to a target machine?

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <holden@pigscanfly.ca> wrote:
>>
>> I think adding something like this (if it doesn't already exist) could help make
structured streaming easier to use, foreachBatch is not the best API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <kabhwan.opensource@gmail.com>
wrote:
>>>
>>> I guess the method, query parameter, header, and the payload would be all different
for almost every use case - that makes it hard to generalize and requires implementation to
be pretty much complicated to be flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet would be
simply implementing your own with foreachBatch, but so someone might jump in and provide a
pointer if there is something in the Spark ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <hussam.elamin@gmail.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>
>>>> We ingest alot of restful APIs into our lake and I'm wondering if it is at
all possible to created a rest sink in structured streaming?
>>>>
>>>> For now I'm only focusing on restful services that have an incremental ID
so my sink can just poll for new data then ingest.
>>>>
>>>> I can't seem to find a connector that does this and my gut instinct tells
me it's probably because it isn't possible due to something completely obvious that I am missing
>>>>
>>>> I know some RESTful API obfuscate the IDs to a hash of strings and that could
be a problem but since I'm planning on focusing on just numerical IDs that just get incremented
I think I won't be facing that issue
>>>>
>>>>
>>>> Can anyone let me know if this sounds like a daft idea? Will I need something
like Kafka or kinesis as a buffer and redundancy or am I overthinking this?
>>>>
>>>>
>>>> I would love to bounce ideas with people who runs structured streaming jobs
in production
>>>>
>>>>
>>>> Kind regards
>>>> San
>>>>
>>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message