incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingfai <mingfai...@gmail.com>
Subject Re: ParseData to support custom data?
Date Sat, 04 Apr 2009 13:10:57 GMT
hi,

Let's not to go to SAX Vs DOM parsing discussion first. I agree with you
that SAX parsing performs better and consumes less resource. I picked
Jericho HTML Parser as its API is easier to use.


On Thu, Apr 2, 2009 at 7:49 PM, Thorsten Scherler <
thorsten.scherler.ext@juntadeandalucia.es> wrote:

> I am not 100% sure whether it is advisable to store the parse (more
> because you later on talk about DOM) the problem I see is the
> consumption of resources. I recon in may cases parsing 2 times is faster
> the parse once and reuse the DOM object.
>

For SAX parser, it should not need to store the parser. There should be no
much benefit for reuse.

the re-use scenario is only relevant if we use DOM parser. I did a quick
test to try to HttpProtoocol.load().obtainContent() and pass the input
stream to Jericho to create a DOM source. (it's not exactly a DOM parser but
works similarly) My test fetch a 165k page from a popular portal for 100
times. The whole process takes 9.1s and the parsing consumes 30%/2.7s. I
think *if* anyone use DOM parsing, the process should be done once only.


>
> >
> >    in fact, if i have only one handler, there is no different for me to
> do
> >    my parsing and handling in the handler.
>
> The concept of the handler is to act on input. Being it a stream of the
> direct URI or an object we retrieve via parsing. Remember parsing is
> optional! Meaning we should not have a fixed connection between parser
> and handler. The parser stage is to determine new task or limit the
> object that we pass on (extracting e.g. outlinks or certain information
> for filtering purpose.
>
> >
> >    And as I have implemented my own parsing anyway, the original outlink
> >    extraction could be skipped and there won't be duplicated parsing.
>
> Not sure about that.
>

for my case, i have to use DOM for the handler anyway. The question is
whether it is better to:

   1. use the SAX parsing in the parsing stage for creating the task. And do
   the handler in my DOM way. or
   2. replace the SAX Link Extractor with a DOM Link extractor, and store
   the parsed DOM for the handler.

anyway, as Droids allows to store a custom data. I prefer to go for the 2nd
approach first and consider to optimize it to 1 in the future.

I did consider one more case that the handler may be executed on another
distributed node. So any custom data to be stored as to be serializable. And
it's preferred not to store anything.



>
> >
> >    - There are some minor comments to the API as follows:
> >    - it's good to merge Parse and ParseData. The meaning of "Parse" isn't
> >       too clear. ParseData is more meaningful. And ParsedData or
> ParseResult is
> >       more clear to me.
>

After a deep thought, calling it Parse is just fine. For naming, the shorter
the better.


>
> >       - I suggest to write some lines in the class comment to mention the
> >       design purpose of these classes.
> >       - If the Parse/ParseData also store a reference of the Parser, for
> SAX
> >       Parser, it could be re-used by the handler. (however, for DOM
> > parser, it's
>
>       confusing, as it should store the parsed data only)
>
> I strongly discourage DOM parsing/storing for droids. However droids
> allows you even that. I am not sure whether you really mean keep a
> reference or the parsed object. If you mean a reference to the parser
> than I am not convince. Having references on a object blocks this object
> from GC. We would need to clean all this reference after all handler are
> finished.


i agree with you it's not a good idea to pass parser reference now.



>
>
>
> sounds swell but you mean Paser.parse(...), right?


yes, it was a typo.


>
>
> >       - re. Object getParseObject(); , I suggest to call it Object
> getData
> >       instead.
> >
> >
> > btw, my understanding of Droids is largely come from the SimpleRuntime
> > usage. I hope i didn't miss the big picture.
>
> The simpleRuntime is nice to show the different set up of components
> however it lakes to show features like automatic extensibility of the
> droid. I have shown that in my presentation @apacheCon when I used the
> droids-spring sample.
>
> Thanks for your feedback mingfai.
>
> salu2
>
> >
>

Thanks for your comments.

regards,
mingfai

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message