drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Holsman <...@holsman.com.au>
Subject Re: Jeff Dean on fast response in an unreliable world
Date Wed, 12 Sep 2012 22:53:33 GMT
Hi Ted.

While you might not be able to use this to predict the query itself, my thoughts were to use
this at the file/block level to try and have more copies of the data that the queries would
be processing, similar to how traditional RDBMS cache tables. It might be a better discussion
to have on the Hadoop list though, as it wouldn't be specific to Drill.


On Sep 13, 2012, at 2:32 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I would like to point out that neither of these options (both good) would
> affect query processing because replication is far too slow to help at
> query time.
> In another life, I found that we could predict popularity of video items
> using only the very early life history of the items.  Similarly, I have had
> good success predicting first weekend and total life-time revenue for a
> movie based on the first 3 hours on opening night.  These are very
> different domains, but I would think that data assets might be subject to
> the same flash crowd effects and thus be somewhat predictable given early
> interest.
> Seasonality and similar effects are also clearly visible in real customers.
> For instance, it is common for traffic summaries to be very popular for
> the first week and then have a popularity bump on the month, quarter and
> annual anniversaries.
> On Tue, Sep 11, 2012 at 9:46 PM, Ian Holsman <ian@holsman.com.au> wrote:
>> I don't know of any papers off hand, but I would think you could go down
>> two routes. A predictive trend algo to 'guess' which blocks could get hot
>> based on seasonal traffic and a reactive one based on response time
>> regularized by #replicas it is on.
>> Sent from my iPhone
>> On 12/09/2012, at 2:21 PM, Worthy LaFollette <worthyl@gmail.com> wrote:
>>> As Ian explained down thread, the paper gave two examples.  The first was
>>> static seeding of duplicates, the second was dynamic with a suggestion
>> of a
>>> monitor which seeds additional copies based on some algorithm in response
>>> to "hot" queries (China being the topic of the example given).  I am
>>> curious if anyone was aware of any papers about this second part.  I can
>>> almost see a cost model where the query measures the overall cost of a
>>> query (latency, risk of latency?) and then generates copies in response.
>>> Part of this of course would be a recovery mechanism which removes these
>>> extra copies.
>>> W-
>>> On Tue, Sep 11, 2012 at 9:31 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>>>> What do you mean be selective replication?
>>>> On Tue, Sep 11, 2012 at 7:23 PM, Worthy LaFollette <worthyl@gmail.com
>>>>> wrote:
>>>>> Very good paper. Am curious now to the strategies for selective
>>>>> replication, which looks if done right would make the query generation
>>>> more
>>>>> efficient.  Do you know of any papers on that subject?
>>>>> On Tue, Sep 11, 2012 at 1:37 PM, Ted Dunning <ted.dunning@gmail.com>
>>>>> wrote:
>>>>>> Headed into Thursday's meetup, this paper by Jeff Dean provides a
>>>>> good
>>>>>> description of strategies for getting fast response times with
>> variable
>>>>>> quality infrastructure.
>>>>>> http://research.google.com/people/jeff/latency.html
>>>>>> The key point here is that it is very important to have asynchronous
>>>>>> queries with a cancel.  Above that level, there needs to be a simple
>>>>>> strategy for pushing second versions of queries out to the workers
>>>>>> canceling defunct or redundant queries.

Ian Holsman
PH: +61-400-988-964 Skype:iholsman

View raw message