spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Processing huge amount of data from paged API
Date Sun, 21 Jan 2018 21:26:11 GMT
Which device provides messages as thousands of http pages? This is obviously inefficient and
it will not help much to run them in parallel. Furthermore with paging you risk that messages
get los or you get duplicate messages. I still not get why nowadays applications download
a lot of data through services that provide a paging mechanism - it has failed in the past
it fails today and will fail in the future.

 Can’t the device push data on a bus eg Kafka? Maybe via stomp or similar ? In doubt the
device could prepare a file with all the measurements and make the file available through
http (this would be of course with resumeable downloads).

> On 21. Jan 2018, at 21:33, anonymous <> wrote:
> Hello,
> I'm in an IoT company, and I have a use case for which I would like to know
> if Apache Spark could be helpful. It's a very broad question, and sorry if
> it's long winded.
> We have HTTP GET APIs to get two kinds of information:
> 1) The Device Messages API returns data about device messages (in JSON).
> 2) The Devices API returns information about devices (in JSON) -- for
> example, device name, device owner, etc. Each Device Message has a Device ID
> field, which points to the device which sent it.
> To make it clearer, we have devices, and each device can send many device
> messages.
> Our goal is to device messages and send them to an ElasticSearch index.
> The two major problems are:
> 1) We need data to be denormalized. That is, we don't want to have one index
> for device messages, and a separate index for device information -- we want
> each device message to have the corresponding device's information attached
> to it. That is because ElasticSearch works best with denormalized data. So,
> we would like a solution that can join (as in an SQL join) the Device
> Message data with the Device data and apply some transformations to them
> before sending it to ElasticSearch. We can potentially have millions of
> devices and device messages, so this solution needs to be scalable.
> 2) Both the Device Messages API and the Devices API are paged, and can
> potentially have thousands of pages. We can potentially have millions of
> devices and device messages. Making HTTP requests for thousands of pages can
> become inefficient. So, it would be good to have a way to parallelize this
> process.
> So, to be short, we would like a solution that can help with:
> 1) Joining and transforming large amounts of data (from a paged API) before
> sending it to ElasticSearch.
> 2) Making the process of sifting through all the pages in the paged APIs
> more efficient.
> Can Apache Spark help with all this?
> Thank you in advance.
> --
> Sent from:
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

To unsubscribe e-mail:

View raw message