uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Avoid indexing of old UIMA documentation
Date Thu, 07 Apr 2016 21:21:35 GMT
+1 -Marshall

On 4/7/2016 4:57 PM, Richard Eckart de Castilho wrote:
> We could try this:
>
> --- 
>
> # robots.txt for http://uima.apache.org
>
> User-agent: *
> Disallow: /docs/d/
> Allow: /docs/d/ruta-current/
> Allow: /docs/d/uima-addons-current/
> Allow: /docs/d/uima-as-current/
> Allow: /docs/d/uima-ducc-current/
> Allow: /docs/d/uimacpp-current/
> Allow: /docs/d/uimafit-current/
> Allow: /docs/d/uimaj-current/
>
> ---
>
> Sources on the net say that "Allow" wasn't originally defined, so if we do
> the above, it might be that some search engines don't index the docs anymore
> at all. We might want to set the user-agent to "googlebot".
>
> Also, not all of the documentations use the "*-current" trick yet. But that
> is easy to fix.
>
> Cheers,
>
> -- Richard
>
>> On 07.04.2016, at 22:40, Richard Eckart de Castilho <rec@apache.org> wrote:
>>
>> We can just disallow /d and then allow all the  *-current folders
>> under it explicitly. The only difference I see is that we'd have
>> a couple of more entries in the robots.txt.
>>
>> -- Richard
>>
>>> On 07.04.2016, at 22:36, Marshall Schor <msa@schor.com> wrote:
>>>
>>> Hi,
>>>
>>> This sounds like a good idea to me :-)
>>>
>>> There's one small issue possibly, to changing the folder structure.  The DOCBOOK
>>> schemes have some fancy way to link between docbooks; these require that the
>>> books be kept relative to one another in some file tree structure.  As long as
>>> that's not changed, I think there will be no problem. 
>>>
>>> If anyone's curious, the relevant bits of config info are in the
>>> uima-docbook-olink project, in the various "site.xml" files.  You can see refs
>>> to the famous "d" folder there.  There may be a dependency on the "books" being
>>> just one directory layer under d/, so putting an extra layer might break things
>>> (but I'm not sure...).
>>>
>>> Maybe there's a way to do this without introducing a new level in the directory?
>>>
>>> -Marshall
>>>
>>> On 4/6/2016 4:43 PM, Richard Eckart de Castilho wrote:
>>>> Hi all,
>>>>
>>>> I believe some time back we were talking about a strategy to avoid search
engines pointing to ancient version of the UIMA documentation.
>>>>
>>>> I have read a bit on rel="canonical" and robots.txt.
>>>>
>>>> 1) per webpage - Apparently, one can place a `link rel="canonical"` element
on any HTML page. Search engines seeing this tag will then not index this page because it
is considered to be a duplicate of whatever other page the link points to.
>>>>
>>>> 2) via http header/htaccess - Since we probably don't want to patch up all
our JavaDoc files, the information about a canonical source can also be sent in the HTTP header,
e.g. via a suitable htaccess file.
>>>>
>>>> I guess the idea would be that for any old documentation page, we would want
it to point to its latest version as its canonical source. I mean for every page, not only
for the index page. This seems a bit tedious.
>>>>
>>>> My suggestion would be an alternative that exploits the website folder structure
and uses robots.txt.
>>>>
>>>> We disallow indexing of the "d" folder on the UIMA website.
>>>> We place all the "*-current" folders (svn copies of the latest documentation
versions) under a dedicated folder (e.g. "d/current") and allow indexing that.
>>>>
>>>> In that way, the outdated versions of the documentation should be hidden
from the search engines and the respective latest versions should be indexed.
>>>>
>>>> Opinions? Does anybody have experience with SEO?
>>>>
>>>> Cheers,
>>>>
>>>> -- Richard
>>>>
>>>>
>


Mime
View raw message