manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: PostgreSQL version to support MCF v2.10
Date Wed, 05 Sep 2018 07:57:27 GMT
I coded up the web connector feature I think we need.  See CONNECTORS-1528;
I've attached a patch.  Please apply and test it out to see if it solves
the case problem for your IIS site.

For the "//" issue, can you be more specific about the mapping you need to
do?

Karl


On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddywri@gmail.com> wrote:

> Hi Steph,
>
> Right, you wouldn't want to touch the framework.
>
> The effect of lower-casing the documentURI parameter in the
> addOrReplaceDocumentWithException method in an output connector would be to
> map multiple, independently-fetched, documents that differ only by the case
> of the URL together into one document in the index.  The ManifoldCF
> assumption is that a document with a certain URI can be tracked in the
> index using exactly that URI.  Mapping the URI to lower case would break
> that assumption so the framework would make the wrong decision in many
> cases.
>
> If you are picking up documents using the web connector, therefore, and
> you are getting duplicate documents because the document URLs are sloppy,
> it is therefore essential that INSTEAD of mapping the document URI to lower
> case in the output connector, you map to lower case in the repository
> connector.  Otherwise the framework will not work right.
>
> There is a tab in the web connector that allows you to configure URL
> normalization, called "Canonicalization".  This would be a very appropriate
> place to add URL mapping to lower case.  It should be as simple as adding
> one more checkbox column in the table, and modifying the method that does
> the URL processing to include lower-casing.
>
> Karl
>
>
>
> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <steph@remcam.net>
> wrote:
>
>> Unless I have a massive misunderstanding somewhere...
>>
>>
>>
>>
>> *Steph van Schalkwyk*
>> Principal, Remcam Search Engines
>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
>> <http://www.remcam.net/> Skype: svanschalkwyk
>> <https://mail.google.com/mail/u/0/#>
>> <http://linkedin.com/in/vanschalkwyk>
>>
>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <steph@remcam.net>
>> wrote:
>>
>>> Hi Karl
>>> I'm addressing it in the ES Output Connector.
>>> Not touching the framework :)
>>> S
>>>
>>>
>>>
>>> *Steph van Schalkwyk*
>>> Principal, Remcam Search Engines
>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>> <https://mail.google.com/mail/u/0/#>
>>> <http://linkedin.com/in/vanschalkwyk>
>>>
>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Let's make sure we're talking about the same thing.
>>>>
>>>> Here is the output connector method that receives the ID (as the
>>>> documentURI parameter):
>>>>
>>>>   public int addOrReplaceDocumentWithException(String documentURI,
>>>> VersionContext pipelineDescription, RepositoryDocument document, String
>>>> authorityNameString, IOutputAddActivity activities)
>>>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>>>
>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive.  If
>>>> you make it case insensitive in an output connector, this will potentially
>>>> break a lot of things, for example incremental indexing (which organizes
>>>> the last indexed version by document ID).
>>>>
>>>> I therefore highly recommend that any "sloppyness" in this parameter be
>>>> addressed in the Repository Connector that constructs it.  If the connector
>>>> is crawling a repository that believes that URLs are case insensitive then
>>>> it should map these IDs to lower case.  If not, then it shouldn't.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net>
>>>> wrote:
>>>>
>>>>> Hi Karl.
>>>>> The issue is that the ES Output Connector uses the uri to create the
>>>>> _id. When used with IIS which allows case variation in the URI, it creates
>>>>> multiple documents. Clients on Windows IIS are rarely cognizant of that
>>>>> issue as IIS is so lax in policing that OTB.
>>>>> Currently, every case variation in URI results in a new doc in the
>>>>> index. This is only in the ES output connector.
>>>>> I can add an optional checkbox to do determien that particular action
>>>>> if that would help?
>>>>> Regards,
>>>>> Steph
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Steph van Schalkwyk*
>>>>> Principal, Remcam Search Engines
>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>> <https://mail.google.com/mail/u/0/#>
>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> THanks for the update.
>>>>>> Lower-casing the ID would be fine except there are some connectors
>>>>>> that care about case.  The web connector is one such because it's
up to the
>>>>>> web service to decide if case matters, so the web connector does
not view
>>>>>> urls with case differences as being the same.  Other connectors also
will
>>>>>> likely care as well. So I don't think lower-casing the document id
is a
>>>>>> smart thing to do.
>>>>>>
>>>>>> You could add this bit of configuration to the web connector, if
>>>>>> that's what you are using, or to whatever other connector constructs
the ID.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <steph@remcam.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Karl.
>>>>>>>
>>>>>>> I'll look into that.
>>>>>>>
>>>>>>> Another note:
>>>>>>> Regarding the ES connector - I have made two additions to it
and
>>>>>>> should probably diff them for inclusion after approval:
>>>>>>> 1. lowercased _id (the doc URI).
>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources,
>>>>>>> particularly IIS...)
>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x
does
>>>>>>> not allow accedd to _id in the schema anymore, so no copy_field
etc. from
>>>>>>> _id). Hence "url".
>>>>>>>
>>>>>>> Regards,
>>>>>>> Steph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Steph van Schalkwyk*
>>>>>>> Principal, Remcam Search Engines
>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource,
and we may
>>>>>>>> need to upgrade it.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <
>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>
>>>>>>>>> Olivier
>>>>>>>>> By all means.
>>>>>>>>> The only issue I have seen (totally unrelated) is with
Jetty,
>>>>>>>>> which has to be restarted about once a week. Still trying
to find the issue.
>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with
Postgres10
>>>>>>>>> may be a bit slower. I have no empiric evidence at the
moment as I'm still
>>>>>>>>> delivering the project to UAT. Will keep you posted.
>>>>>>>>> Regards,
>>>>>>>>> Steph
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype:
svanschalkwyk
>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <
>>>>>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration
(sorry for
>>>>>>>>>> the late answer). I will test it soon.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Olivier TAVARD
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <steph@remcam.net>
>>>>>>>>>> a écrit :
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> These are the rpm installs:
>>>>>>>>>> -
>>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>> -
>>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>> -
>>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>> -
>>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>
>>>>>>>>>> postgresql_version: 10
>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>>> postgresql_packages:
>>>>>>>>>> - postgresql10-libs
>>>>>>>>>> - postgresql10
>>>>>>>>>> - postgresql10-server
>>>>>>>>>> - postgresql10-contrib
>>>>>>>>>> # - postgresql10-devel
>>>>>>>>>>
>>>>>>>>>> postgresql_hba_entries:
>>>>>>>>>> - { type: local, database: all, user: postgres, auth_method:
peer
>>>>>>>>>> }
>>>>>>>>>> - { type: local, database: all, user: all, auth_method:
peer }
>>>>>>>>>> - { type: host, database: all, user: all, address:
'127.0.0.1/32',
>>>>>>>>>> auth_method: md5 }
>>>>>>>>>> - { type: host, database: all, user: all, address:
'::1/128',
>>>>>>>>>> auth_method: md5 }
>>>>>>>>>> - { type: host, database: all, user: all, address:
'0.0.0.0/0',
>>>>>>>>>> auth_method: md5 }
>>>>>>>>>> - { type: host, database: all, user: all, address:
'::0/0',
>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>
>>>>>>>>>> postgresql_global_config_options:
>>>>>>>>>> - option: unix_socket_directories
>>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",")
}}'
>>>>>>>>>>
>>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>>> value: 'on'
>>>>>>>>>>
>>>>>>>>>> - option: shared_buffers
>>>>>>>>>> value: '1024MB'
>>>>>>>>>>
>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB
>>>>>>>>>> # checkpoint_segments=300
>>>>>>>>>> - option: max_wal_size
>>>>>>>>>> value: '14400MB'
>>>>>>>>>>
>>>>>>>>>> - option: min_wal_size
>>>>>>>>>> value: '80MB'
>>>>>>>>>>
>>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>>> value: '2MB'
>>>>>>>>>>
>>>>>>>>>> - option: listen_addresses
>>>>>>>>>> value: '*'
>>>>>>>>>>
>>>>>>>>>> - option: max_connections
>>>>>>>>>> value: '400'
>>>>>>>>>>
>>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>>> value: '900'
>>>>>>>>>>
>>>>>>>>>> - option: datestyle
>>>>>>>>>> value: "iso, mdy"
>>>>>>>>>>
>>>>>>>>>> - option: autovacuum
>>>>>>>>>> value: 'off'
>>>>>>>>>>
>>>>>>>>>> # vacuum all databases every night (full vacuum on
Sunday night,
>>>>>>>>>> lazy vacuum every night)
>>>>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>>>>> cron:
>>>>>>>>>> name: lazy_vacuum
>>>>>>>>>> hour: 8
>>>>>>>>>> minute: 0
>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze
--quiet'"
>>>>>>>>>> - name: add postgresql cron full vacuum
>>>>>>>>>> cron:
>>>>>>>>>> name: full_vacuum
>>>>>>>>>> weekday: 0
>>>>>>>>>> hour: 10
>>>>>>>>>> minute: 0
>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze
--quiet'"
>>>>>>>>>> # re-index all databases once a week
>>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>>> cron:
>>>>>>>>>> name: reindex
>>>>>>>>>> weekday: 0
>>>>>>>>>> hour: 12
>>>>>>>>>> minute: 0
>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname
from
>>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\"
-- psql -U postgres
>>>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is how I run 2.10.
>>>>>>>>>> Been running fine for some weeks without user intervention.
>>>>>>>>>> @Karl: Any comments please?
>>>>>>>>>> Steph
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>

Mime
View raw message