uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <...@apache.org>
Subject Re: Avoid indexing of old UIMA documentation
Date Thu, 07 Apr 2016 20:57:18 GMT
We could try this:


# robots.txt for http://uima.apache.org

User-agent: *
Disallow: /docs/d/
Allow: /docs/d/ruta-current/
Allow: /docs/d/uima-addons-current/
Allow: /docs/d/uima-as-current/
Allow: /docs/d/uima-ducc-current/
Allow: /docs/d/uimacpp-current/
Allow: /docs/d/uimafit-current/
Allow: /docs/d/uimaj-current/


Sources on the net say that "Allow" wasn't originally defined, so if we do
the above, it might be that some search engines don't index the docs anymore
at all. We might want to set the user-agent to "googlebot".

Also, not all of the documentations use the "*-current" trick yet. But that
is easy to fix.


-- Richard

> On 07.04.2016, at 22:40, Richard Eckart de Castilho <rec@apache.org> wrote:
> We can just disallow /d and then allow all the  *-current folders
> under it explicitly. The only difference I see is that we'd have
> a couple of more entries in the robots.txt.
> -- Richard
>> On 07.04.2016, at 22:36, Marshall Schor <msa@schor.com> wrote:
>> Hi,
>> This sounds like a good idea to me :-)
>> There's one small issue possibly, to changing the folder structure.  The DOCBOOK
>> schemes have some fancy way to link between docbooks; these require that the
>> books be kept relative to one another in some file tree structure.  As long as
>> that's not changed, I think there will be no problem. 
>> If anyone's curious, the relevant bits of config info are in the
>> uima-docbook-olink project, in the various "site.xml" files.  You can see refs
>> to the famous "d" folder there.  There may be a dependency on the "books" being
>> just one directory layer under d/, so putting an extra layer might break things
>> (but I'm not sure...).
>> Maybe there's a way to do this without introducing a new level in the directory?
>> -Marshall
>> On 4/6/2016 4:43 PM, Richard Eckart de Castilho wrote:
>>> Hi all,
>>> I believe some time back we were talking about a strategy to avoid search engines
pointing to ancient version of the UIMA documentation.
>>> I have read a bit on rel="canonical" and robots.txt.
>>> 1) per webpage - Apparently, one can place a `link rel="canonical"` element on
any HTML page. Search engines seeing this tag will then not index this page because it is
considered to be a duplicate of whatever other page the link points to.
>>> 2) via http header/htaccess - Since we probably don't want to patch up all our
JavaDoc files, the information about a canonical source can also be sent in the HTTP header,
e.g. via a suitable htaccess file.
>>> I guess the idea would be that for any old documentation page, we would want
it to point to its latest version as its canonical source. I mean for every page, not only
for the index page. This seems a bit tedious.
>>> My suggestion would be an alternative that exploits the website folder structure
and uses robots.txt.
>>> We disallow indexing of the "d" folder on the UIMA website.
>>> We place all the "*-current" folders (svn copies of the latest documentation
versions) under a dedicated folder (e.g. "d/current") and allow indexing that.
>>> In that way, the outdated versions of the documentation should be hidden from
the search engines and the respective latest versions should be indexed.
>>> Opinions? Does anybody have experience with SEO?
>>> Cheers,
>>> -- Richard

View raw message