gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alparslan Avc─▒ <alparslan.a...@agmlab.com>
Subject Re: Getting statistics about crawled pages
Date Wed, 19 Feb 2014 13:04:43 GMT
Hi Lewis,

Sorry for my bad, I have sent the mail to dev@gora list instead of 
dev@nutch accidentally. :-(

Thanks for the comments by the way, especially about the data model.


On 19-02-2014 14:37, Lewis John Mcgibbney wrote:
> Hi Alparslan,
> On Wed, Feb 19, 2014 at 12:04 PM, <dev-digest-help@gora.apache.org> wrote:
>> I think. After Nutch provides this info, a data analysis process (with
>> help of Pig, for example) can be run over the collected datum. (Google also
>> saves this kind of info.
> I know that Alfonso has been a user of Pig over data persisted through Gora
> so maybe he can help with you on this one.
>> We can create a new field for this, however this will change the data
>> model.
> Generally speaking, as you've indicated, changing the data model should be
> used as a last resort. Avro schema evolution is something which I do not
> quite fully understand yet. This article [0] helped a _bit_ as the basics
> are covered e.g. "Avro requires schemas when data is written or read. Most
> interesting is that you can use different schemas for serialization and
> deserialization, and Avro will handle the missing/extra/modified fields.",
> however in my experience, typically, if the tools you've created to read
> data from your data set are reading from what is essentially unstructured
> data e.g. data that has been changed over time with new fields being added
> to the data model, then you run in to NPE. In reflection however I feel
> that this was most likely because I was accessing the WebPage object fields
> directly instead of using Avro to access and deserialize the data.
>> Instead, we can add the tag info into the metadata map. (We can also add a
>> prefix to map key to differ the tag content data from other info.)
> In short, yes, I find that this is the most efficient method of persisting
> additional information to WebPage's. It may have some overhead if for
> example in Cassandra your metadata is persisted into a super column and
> that you need to unpack, read the contents of the super column, then pack
> it up again before pushing the data over the wire, however on the other
> hand this may not be an issue if your metadata is not too large for example.
>> What do you think about this? Any comments or suggestions?
> Generally speaking, I think that you're gunning for the right target by
> saving the overhead of changing your data model, however I am by far no
> expert using Pig for data analysis on Gora data so I cannot comment too
> much on that side of the issue.
> hth
> Lewis
> [0]
> http://blog.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message