james-server-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benoit Tellier (Jira)" <server-...@james.apache.org>
Subject [jira] [Commented] (JAMES-3107) Log request when P99 is exceeded
Date Sun, 16 May 2021 04:29:00 GMT

    [ https://issues.apache.org/jira/browse/JAMES-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345621#comment-17345621

Benoit Tellier commented on JAMES-3107:

https://github.com/apache/james-project/pull/434 (includes detailed performance insights)

Using metrics to capture p99 lead to over-snapshoting, and incurs a 33% throughtput penalty
on JMAP draft.

Other SLOW logging implementations do rely on manual time
measurements (eg Datastax Cassandra driver).

As such, due to the low benefits and high costs I propose
to deprecate the logP99 API and migrate away from it.

Tools like Glowroot are able to capture slow traces at a
lighter cost and should rather be used.

> Log request when P99 is exceeded
> --------------------------------
>                 Key: JAMES-3107
>                 URL: https://issues.apache.org/jira/browse/JAMES-3107
>             Project: James Server
>          Issue Type: New Feature
>          Components: Metrics
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.5.0
> Given our current tooling I struggle to correctly review slow requests from James.
> My current procedure is:
>   - In grafana identify timestamp of a spike
>   - Groke logs in kibana until I find something that could correspond
>   - Pray and hope my analisys stands.
> This is both time consumming, hard to do and unreliable.
> Identifying slow queries is important as it can point us to critical path to optimize.
> Hence I propose to log an info message when p99 is exceeded for high level function (JMAP
methods, IMAP processors, matcher mailet and overall processing, mailbox listeners, and remote
> In order to avoid log spamming I propose to only log when a function-specified threshold
is exceeded (defaulting to 100ms)
> I belive it will help us coming up with more meaningful performance analysis and better
fixes for the greater goods of our prduction platforms.

This message was sent by Atlassian Jira

To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

View raw message