incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingfai <mingfai...@gmail.com>
Subject Re: [jira] Commented: (DROIDS-45) Fail to resovle outlink correctly
Date Fri, 03 Apr 2009 09:40:32 GMT
hi,

thx. I'd like to see the requirement be included in Droids, and the
implementation is not an issue to me. My code are not in droids package and
I didn't change the Link Extractor to use it at all. They are just for
reference and don't worry if you want to discard them. It would be great if
the Link Extractor could be refactored, so for the next time, i could submit
a patch.

re. instance. I believe in modern JVM , there is no difference in
performance, unless you keep a reference to it and not releasing it for GC.
And btw, I don't think my implementation that uses indexof instead of regex
makes meaningful improvement, too. It's more my coding style.  And indeed, I
don't really care about "micro" performance for a crawler. My primary
language for the project is Groovy that is way slower than pure Java. :-)

regards,
mingfai


On Fri, Apr 3, 2009 at 4:11 PM, Thorsten Scherler <
thorsten.scherler.ext@juntadeandalucia.es> wrote:

> On Thu, 2009-04-02 at 06:10 -0700, Mingfai Ma (JIRA) wrote:
> > [
> https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694995#action_12694995]
> >
> > Mingfai Ma commented on DROIDS-45:
> > ----------------------------------
> >
> > the LinkExtractor doesn't append '/' automatically.
>
> Hmm, I just asked Javier to have a look into this since he had been the
> last that worked with the code. Will try to find some time to debug this
> weekend since need to finish a project ATM.
>
> One small thing I found in your class is the constructor, it is not
> optimal since we would be forced to create a lot of instances (for each
> base/link), that needs rethinking to reuse the class.
>
> salu2
>
>
> > and I think it shouldn't, as it is possible for a server to handle with
> and without '/' differently. For root domain URL, it may be ok. but for
> deeper URL, we can't just assume the last segment of the request path is a
> directory
> >
> > Apache mod_dir should append a trailing slash but unfortunately, not all
> web server on the internet have this feature enabled :-)
> > http://httpd.apache.org/docs/2.2/mod/mod_dir.html
> >
> > > Fail to resovle outlink correctly
> > > ---------------------------------
> > >
> > >                 Key: DROIDS-45
> > >                 URL: https://issues.apache.org/jira/browse/DROIDS-45
> > >             Project: Droids
> > >          Issue Type: Bug
> > >          Components: core
> > >    Affects Versions: 0.01
> > >            Reporter: Mingfai Ma
> > >
> > > I've encountered several cases that outlinks are not extracted
> correctly. Most are cause by the use of URI.resolve().
> > > 1. For a base URI of new URI("http://www.domain.com"), <a
> href="test.html">test.html</a> will be resolved to
> http://www.domain.comtest.html
> > > 2. For a base URI of new URI("http://www.domain.com/index.php"), <a
> href="?test=true">test with param</a> will be resolved to
> http://www.domain.com/?test=true
> > > 3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve
> will throw exception. And in a browser, it can resolves the URI. (remarks: I
> didn't check if this scenario affect the default Tika/NekoHTML parsing. )
> > > I suspect there are many different scenarios, many of them are probably
> caused by non-standard usage. (but a crawler has to handle non-standard
> usage in order to function) Obviously, we cannot cater every case, and I
> suggest to consider a resolve failure as a bug if a link works in a Mozilla
> browser but not in Droids LinkExtractor.
> > > this issue is related to the LinkExtractor created in DROIDS-8
> >
> --
> Thorsten Scherler <thorsten.at.apache.org>
> Open Source Java <consulting, training and solutions>
>
> Sociedad Andaluza para el Desarrollo de la Sociedad
> de la InformaciĆ³n, S.A.U. (SADESI)
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message