Cache inconsistency: `routeOwnersByClusterNode` out of sync
Description
Environment
is related to
Activity
Guus der Kinderen March 24, 2025 at 8:33 PMEdited
These two log messages were recorded on one cluster node:
2025.03.24 16:52:54 TRACE [hz.openfire.event-5]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCache[cache: Routing Users Cache] - Processing entry added event of node 'fd3e5003-0d3c-4a5a-ab1d-74d503aceefa' for key 'takurafelix@igniterealtime.org/phone'
2025.03.24 16:53:13 WARN [TaskEngine-pool-1930]: org.jivesoftware.openfire.SessionManager - Not removing detached session 'takurafelix@igniterealtime.org/phone' (9wdudv22r) that appears to have been replaced by another session.
Note that the first message is the result of a cache update that happens on a different cluster node than the local server.
The second message is generated by a bit of code that has this comment:
OF-1923: Only close the session if it has not been replaced by another session (if the session has been replaced, then the condition below will compare to distinct instances). This should not occur (but has been observed, prior to the fix of OF-1923). This check is left in as a safeguard.
Assuming that the first message is the result of a client reconnecting (using the same full JID) to a different cluster node, then this seems to prevent the old session on this cluster node to be removed. This is likely a source of the data consistency issues that are the subject of this issue.
It is undesirable that a new session (on a different cluster node) is established, prior to removal of an old session for the same full JID from the collection of 'detached' sessions. That detached session should be removed (as it triggers a presence unavailable broadcast) before the new session can bind to the same resource/full JID.
If the above rationale holds true, then the cause for this issue may be related to the fix for https://igniterealtime.atlassian.net/browse/OF-3039
Guus der Kinderen March 24, 2025 at 7:58 PM
Bumping up the priority of this issue again, as the first commit did not resolve the problem, and additional cache inconsistencies (with regards to registered occupants to MUCs) have been observed that seem to be related to this.
Guus der Kinderen March 20, 2025 at 3:54 PM
I’ve reduced the priority of this issue, as it affects data that has a decent chance of being ‘self-corrected’, while the data is only used in exceptional circumstances (cluster splits).
This problem seems to have appeared recently in unreleased development builds of Openfire 5.0.0.
org.jivesoftware.openfire.spi.RoutingTableImpl#routeOwnersByClusterNode
is a map that, on each cluster node, keeps track of what sessions (identified by full JID) live on what cluster nodes (identified by NodeID).Recently, cache inconsistency messages of this format were detected:
It appears that there was an inconsistent state in which an entry was not removed from
routeOwnersByClusterNode
. So far, I’ve not been able to reproduce the problem.This problem seems to be ‘new’. I suspect that it got introduced as a byproduct of behavior changes introduced by recent fixes around session management, such as https://igniterealtime.atlassian.net/browse/OF-1811 and https://igniterealtime.atlassian.net/browse/OF-3039. There have been various related changes applied recently.