Unless I have a massive misunderstanding somewhere...




Steph van Schalkwyk
Principal, Remcam Search Engines


On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <steph@remcam.net> wrote:
Hi Karl
I'm addressing it in the ES Output Connector. 
Not touching the framework :)
S



Steph van Schalkwyk
Principal, Remcam Search Engines


On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com> wrote:
Let's make sure we're talking about the same thing.

Here is the output connector method that receives the ID (as the documentURI parameter):

  public int addOrReplaceDocumentWithException(String documentURI, VersionContext pipelineDescription, RepositoryDocument document, String authorityNameString, IOutputAddActivity activities)
    throws ManifoldCFException, ServiceInterruption, IOException;

ManifoldCF doesn't say anywhere that this ID is case insensitive.  If you make it case insensitive in an output connector, this will potentially break a lot of things, for example incremental indexing (which organizes the last indexed version by document ID).

I therefore highly recommend that any "sloppyness" in this parameter be addressed in the Repository Connector that constructs it.  If the connector is crawling a repository that believes that URLs are case insensitive then it should map these IDs to lower case.  If not, then it shouldn't.

Karl


On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net> wrote:
Hi Karl.
The issue is that the ES Output Connector uses the uri to create the _id. When used with IIS which allows case variation in the URI, it creates multiple documents. Clients on Windows IIS are rarely cognizant of that issue as IIS is so lax in policing that OTB.
Currently, every case variation in URI results in a new doc in the index. This is only in the ES output connector. 
I can add an optional checkbox to do determien that particular action if that would help?
Regards,
Steph





Steph van Schalkwyk
Principal, Remcam Search Engines


On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com> wrote:
THanks for the update.
Lower-casing the ID would be fine except there are some connectors that care about case.  The web connector is one such because it's up to the web service to decide if case matters, so the web connector does not view urls with case differences as being the same.  Other connectors also will likely care as well. So I don't think lower-casing the document id is a smart thing to do.

You could add this bit of configuration to the web connector, if that's what you are using, or to whatever other connector constructs the ID.

Karl



On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <steph@remcam.net> wrote:
Thanks Karl. 

I'll look into that.

Another note:
Regarding the ES connector - I have made two additions to it and should probably diff them for inclusion after approval:
1. lowercased _id (the doc URI).
2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources, particularly IIS...)
3. Added a "url" metadata field to the ES connector (as ES 6.x does not allow accedd to _id in the schema anymore, so no copy_field etc. from _id). Hence "url".

Regards,
Steph




Steph van Schalkwyk
Principal, Remcam Search Engines


On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Steph, I suspect that Jetty is leaking some resource, and we may need to upgrade it.

Karl


On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <steph@remcam.net> wrote:
Olivier
By all means.
The only issue I have seen (totally unrelated) is with Jetty, which has to be restarted about once a week. Still trying to find the issue.
I may be overly sensitive, but I suspect MCF 2.10 with Postgres10 may be a bit slower. I have no empiric evidence at the moment as I'm still delivering the project to UAT. Will keep you posted.
Regards,
Steph



Steph van Schalkwyk
Principal, Remcam Search Engines


On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <olivier.tavard@francelabs.com> wrote:
Hello,

Thanks a lot for sharing your PostgreSQL configuration (sorry for the late answer). I will test it soon.

Best regards,


Olivier TAVARD


Le 23 août 2018 à 19:20, Steph van Schalkwyk <steph@remcam.net> a écrit :



These are the rpm installs:

postgresql_version: 10
postgresql_data_dir: /var/lib/pgsql/10/data
postgresql_bin_path: /usr/pgsql-10/bin
postgresql_config_path: /var/lib/pgsql/10/data
postgresql_daemon: postgresql-10.service
postgresql_packages:
- postgresql10-libs
- postgresql10
- postgresql10-server
- postgresql10-contrib
# - postgresql10-devel

postgresql_hba_entries:
- { type: local, database: all, user: postgres, auth_method: peer }
- { type: local, database: all, user: all, auth_method: peer }
- { type: host, database: all, user: all, address: '127.0.0.1/32', auth_method: md5 }
- { type: host, database: all, user: all, address: '::1/128', auth_method: md5 }
- { type: host, database: all, user: all, address: '0.0.0.0/0', auth_method: md5 }
- { type: host, database: all, user: all, address: '::0/0', auth_method: md5 }

postgresql_global_config_options:
- option: unix_socket_directories
value: '{{ postgresql_unix_socket_directories | join(",") }}'

- option: standard_conforming_strings
value: 'on'

- option: shared_buffers
value: '1024MB'

# max_wal_size = (3 * checkpoint_segments) * 16MB
# checkpoint_segments=300
- option: max_wal_size
value: '14400MB'

- option: min_wal_size
value: '80MB'

- option: maintenance_work_mem
value: '2MB'

- option: listen_addresses
value: '*'

- option: max_connections
value: '400'

- option: checkpoint_timeout
value: '900'

- option: datestyle
value: "iso, mdy"

- option: autovacuum
value: 'off'

# vacuum all databases every night (full vacuum on Sunday night, lazy vacuum every night)
- name: add postgresql cron lazy vacuum
cron:
name: lazy_vacuum
hour: 8
minute: 0
job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
- name: add postgresql cron full vacuum
cron:
name: full_vacuum
weekday: 0
hour: 10
minute: 0
job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'"
# re-index all databases once a week
- name: add postgresql cron reindex
cron:
name: reindex
weekday: 0
hour: 12
minute: 0
job: "su - postgres -c 'psql -t -c \"select datname from pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U postgres {} -c \"reindex database {};\"' "


This is how I run 2.10.
Been running fine for some weeks without user intervention.
@Karl: Any comments please?
Steph