spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Call http request from within Spark
Date Thu, 14 Jul 2016 19:42:05 GMT
Second to what Pedro said in the second paragraph.

Issuing http request per row would not scale.

On Thu, Jul 14, 2016 at 12:26 PM, Pedro Rodriguez <ski.rodriguez@gmail.com>
wrote:

> Hi Amit,
>
> Have you tried running a subset of the IDs locally on a single thread? It
> would be useful to benchmark your getProfile function for a subset of the
> data then estimate how long the full data set would take then divide by
> number of spark executor cores. This should at least serve as a sanity
> check. If things are much slower than expected is it possible that the
> service has a rate limit per ip address that you are hitting?
>
> If requests is more efficient at batching requests together (I don't know
> much about its internal implementation and connection pools) you could do
> that with mapPartitions. This is useful when the initialization time of the
> function in the map call is expensive (eg uses a connection pool for a db
> or web) as it allows you to initialize that resource once per partition
> then reuse it for all the elements in the partition.
>
> Pedro
>
> On Thu, Jul 14, 2016 at 8:52 AM, Amit Dutta <amitkrdutta@outlook.com>
> wrote:
>
>> Hi All,
>>
>>
>> I have a requirement to call a rest service url for 300k customer ids.
>>
>> Things I have tried so far is
>>
>>
>> custid_rdd = sc.textFile('file:////Users/zzz/CustomerID_SC/Inactive User
>> Hashed LCID List.csv') #getting all the customer ids and building adds
>>
>> profile_rdd = custid_rdd.map(lambda r: getProfile(r.split(',')[0]))
>>
>> profile_rdd.count()
>>
>>
>> #getprofile is the method to do the http call
>>
>> def getProfile(cust_id):
>>
>>     api_key = 'txt'
>>
>>     api_secret = 'yuyuy'
>>
>>     profile_uri = 'https://profile.localytics.com/x1/customers/{}'
>>
>>     customer_id = cust_id
>>
>>
>>     if customer_id is not None:
>>
>>         data = requests.get(profile_uri.format(customer_id),
>> auth=requests.auth.HTTPBasicAuth(api_key, api_secret))
>>
>> #         print json.dumps(data.json(), indent=4)
>>
>>     return data
>>
>>
>> when I print the json dump of the data i see it returning results from
>> the rest call. But the count never stops.
>>
>>
>> Is there an efficient way of dealing this? Some post says we have to
>> define a batch size etc but don't know how.
>>
>>
>> Appreciate your help
>>
>>
>> Regards,
>>
>> Amit
>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Mime
View raw message