trafficserver-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yongming Zhao <ming....@gmail.com>
Subject Re: generating hash from packet content
Date Fri, 29 Aug 2014 07:22:10 GMT
I’d agree that Leif point out the problem here, we may call this a de-duplicate solution
but mostly after we save the content when we get from the origin, it is already wasting your
disk storage, you will get the same hash after all the data is completed from the origin,
and the disk already wasted in this duplicated file.

a good solution would be:
the origin send out the content with common headers plus SHA hash string and(or) MD5 hash
string, and then we can go lookup the key in our storage, then it should work as expected




在 2014年8月29日,上午4:09,Leif Hedstrom <zwoop@apache.org> 写道:

> 
> On Aug 28, 2014, at 12:19 PM, Bill Zeng <billzeng2009@gmail.com> wrote:
> 
>> 
>> 
>> 
>> On Thu, Aug 28, 2014 at 10:41 AM, Leif Hedstrom <zwoop@apache.org> wrote:
>> 
>> On Aug 28, 2014, at 11:35 AM, Bill Zeng <billzeng2009@gmail.com> wrote:
>> 
>>> Just to throw another idea your way. We can insert another level of indirection
between URL's and objects. Every object has a unique hash. URL's point to the hashes instead
of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated
and hence their hashes, they always point to the same object. It seems a non-easy project
though. It requires major changes to ATS.
>> 
>> 
>> I’m not sure I understand this, or how it helps this problem? However, isn’t
this sort of how the cache already works? There’s a hash from URL to the “header” entry,
which then has its own hash to the actual object. Alan?
>> 
>> Maybe I did not understand it correctly. Currently, ATS calculates a hash from a
URL and uses the hash to look up the actual object. That is "URL --> actual object". My
idea is to "URL --> hash of an object --> actual object". We calculate the hash of a
URL and use that to look up the hash of an actual object and then use the hash of the actual
object to look up the actual object.
> 
> 
> But what problem does that solve? You have URL <A> and <B>, both which  point
to the same object. How do you find that object based only on the client request (URL + headers)?
How do you generate the “object hash” for the lookup, without going to origin first? That’s
the problem here, afaik?
> 
> Or is your suggestion here to solve the cache deduping problem (which is not what the
OP asked for)? If so, there was the beginning for that in the cache code, storing the hash
of objects in the cache as well (but maybe that’s gone now?). There is also a CRC (checksum)
feature in the cache, maybe the intention back then was to generalizing the cache dedup with
these checksums. Only John Plevyak would know :).
> 
> Fwiw, this problem is what Metalink is intended to solve for some use cases (e.g. site
mirrors), but Metalink requires cooperation (additional Metalink headers) from the origin.
It does not solve (or intend to solve) the issue where e.g. YouTube rotates the content URLs
frequently.
> 
> — Leif

- Yongming Zhao 赵永明


Mime
View raw message