Skip to content

[Java] Avoid bind conflicts when removing and adding a subscription to the same channel.#1955

Merged
vyazelenko merged 1 commit intomasterfrom
retry-connect-on-bind-error
Mar 9, 2026
Merged

[Java] Avoid bind conflicts when removing and adding a subscription to the same channel.#1955
vyazelenko merged 1 commit intomasterfrom
retry-connect-on-bind-error

Conversation

@ZachBray
Copy link
Copy Markdown
Contributor

@ZachBray ZachBray commented Mar 9, 2026

CHANGE TO SUBSCRIPTIONS

In 1.49, we started to align the socket opening with the C driver, by moving the socket opening onto the conductor thread, but we left socket closing on the receiver thread. Therefore, it was possible that a REMOVE_SUBSCRIPTION command was not "finished" by the time the conductor started to process the subsequent ADD_SUBSCRIPTION and could result in a bind error. In a recent commit, we started to align the Java driver with the C driver. It now opens and closes sockets within the conductor agent. As discussed in the previous commit, the flow now looks like this:

Client -> Conductor: Remove Subscription
Conductor -> Receiver: Stop using associated sockets
Receiver -> Conductor: I've stopped using the associated sockets
Conductor -> Receiver: Stop using associated sockets
Receiver -> Conductor: I've stopped using the associated sockets
Conductor: closes sockets
Conductor: closes status indicator
Conductor -> Client: operation completed

However, this change was not sufficient to prevent bind conflicts, as it was possible to see commands like ADD_SUBSCRIPTION interleaved. For example:

Client -> Conductor: Remove Subscription
Conductor -> Receiver: Stop using associated sockets
Client -> Conductor: Add Subscription
Conductor: bind exception

In the Java driver, we now set the endpoints statusIndictor counter's to CLOSING to indicate that it should not be reused. The driver also detects this state when adding a subscription and sends a RESOURCE_TEMPORARILY_UNAVAILABLE error back to the client. This matches the C driver behaviour.

Client -> Conductor: Remove Subscription
Conductor: Send endpoint status to CLOSING
Conductor -> Receiver: Stop using associated sockets
Client -> Conductor: Add Subscription
		 FAILURE when opening socket with same port
Conductor: Find endpoint with CLOSING status.
Conductor -> Client: Error RESOURCE_TEMPORARILY_UNAVAILABLE

There are now two ways to safely close and reopen a subscription for the same channel:

  1. Wait for the Subscription's channel status indicator to disappear after closing, before reopening. N.b., this only works when the closed subscription is the sole user of the endpoint.

  2. Catch RegistrationException when opening a subscription and retry on errorCode == RESOURCE_TEMPORARILY_UNAVAILABLE.

Ideally, it would be possible to hide this complexity from the user, but that would be a far more involved change.


CHANGE TO AERON CLUSTER CLIENTS

One place where users are likely to run into this issue is when creating a new AeronCluster session, for example, after a session timeout. To improve usability of AeronCluster, we now handle
RESOURCE_TEMPORARILY_UNAVAILABLE when creating the egress publication.


PUBLICATIONS NOT FIXED

There are similar problems around publications in the Java media driver; however, these are harder to fix, due to the way the endpoint enters the equivalent CLOSING state on a time event, rather than due to a publication removal.

There are also complications when it comes to recreating a publication with the same explicit session-id, as one needs to consider the lifetime of the entries in the collection used to prevent session clashes.


OTHER TIDBITS

We discovered some issues/surprises on our journey:

  1. The channel status indicator counter is created the first time an endpoint is created with the registrationId of its initial resource, but the endpoint could be reused and the initial resource may be closed; therefore, it is not obviously correct to look up the channel status indicator by registrationId, as we do in some places. Instead, one should use Subscription#channelStatusId to get the relevant counter identifier.

  2. There are still conflicts when removing and adding MDS destinations that map onto the same socket bind address.

  3. The C driver uses different channel endpoint status indicator values.

@ZachBray ZachBray requested a review from vyazelenko March 9, 2026 10:54
@ZachBray ZachBray force-pushed the retry-connect-on-bind-error branch from 6ab1d61 to ccd0859 Compare March 9, 2026 11:18
…o the same channel.

CHANGE TO SUBSCRIPTIONS
-----------------------

In 1.49, we started to align the socket opening with the C driver, by
moving the socket opening onto the conductor thread, but we left socket
closing on the receiver thread. Therefore, it was possible that a
REMOVE_SUBSCRIPTION command was not "finished" by the time the conductor
started to process the subsequent ADD_SUBSCRIPTION and could result in a
bind error.  In a recent commit, we started to align the Java driver
with the C driver. It now opens and closes sockets within the conductor
agent. As discussed in the previous commit, the flow now looks like
this:

```
Client -> Conductor: Remove Subscription
Conductor -> Receiver: Stop using associated sockets
Receiver -> Conductor: I've stopped using the associated sockets
Conductor -> Receiver: Stop using associated sockets
Receiver -> Conductor: I've stopped using the associated sockets
Conductor: closes sockets
Conductor: closes status indicator
Conductor -> Client: operation completed
```

However, this change was not sufficient to prevent bind conflicts, as it
was possible to see commands like ADD_SUBSCRIPTION interleaved. For
example:

```
Client -> Conductor: Remove Subscription
Conductor -> Receiver: Stop using associated sockets
Client -> Conductor: Add Subscription
Conductor: bind exception
```

In the Java driver, we now set the endpoints statusIndictor counter's to
CLOSING to indicate that it should not be reused. The driver also
detects this state when adding a subscription and sends a
RESOURCE_TEMPORARILY_UNAVAILABLE error back to the client. This matches
the C driver behaviour.

```
Client -> Conductor: Remove Subscription
Conductor: Send endpoint status to CLOSING
Conductor -> Receiver: Stop using associated sockets
Client -> Conductor: Add Subscription
		 FAILURE when opening socket with same port
Conductor: Find endpoint with CLOSING status.
Conductor -> Client: Error RESOURCE_TEMPORARILY_UNAVAILABLE
```

There are now two ways to safely close and reopen a subscription for the
same channel:

1. Wait for the Subscription's channel status indicator to disappear
after closing, before reopening. N.b., this only works when the
closed subscription is the sole user of the endpoint.

2. Catch `RegistrationException` when opening a subscription and retry
on `errorCode == RESOURCE_TEMPORARILY_UNAVAILABLE`.

Ideally, it would be possible to hide this complexity from the user, but
that would be a far more involved change.

---

CHANGE TO AERON CLUSTER CLIENTS
-------------------------------

One place where users are likely to run into this issue is when creating
a new AeronCluster session, for example, after a session timeout. To
improve usability of AeronCluster, we now handle
RESOURCE_TEMPORARILY_UNAVAILABLE when creating the egress publication.

---

PUBLICATIONS NOT FIXED
----------------------

There are similar problems around publications in the Java media driver;
however, these are harder to fix, due to the way the endpoint enters the
equivalent CLOSING state on a time event, rather than due to a
publication removal.

There are also complications when it comes to recreating a publication
with the same explicit session-id, as one needs to consider the lifetime
of the entries in the collection used to prevent session clashes.

---

OTHER TIDBITS
-------------

We discovered some issues/surprises on our journey:

1. The channel status indicator counter is created the first time an
endpoint is created with the registrationId of its initial resource, but
the endpoint could be reused and the initial resource may be closed;
therefore, it is not obviously correct to look up the channel status
indicator by registrationId, as we do in some places. Instead, one
should use Subscription#channelStatusId to get the relevant counter
identifier.

2. There are still conflicts when removing and adding MDS destinations
that map onto the same socket bind address.

3. The C driver uses different channel endpoint status indicator values.

Co-authored-by: Dmytro Vyazelenko <696855+vyazelenko@users.noreply.github.com>
@ZachBray ZachBray force-pushed the retry-connect-on-bind-error branch from ccd0859 to 131fa73 Compare March 9, 2026 11:44
@vyazelenko vyazelenko merged commit f08e152 into master Mar 9, 2026
43 checks passed
vyazelenko pushed a commit that referenced this pull request Mar 11, 2026
…o the same channel. (#1955)

CHANGE TO SUBSCRIPTIONS
-----------------------

In 1.49, we started to align the socket opening with the C driver, by
moving the socket opening onto the conductor thread, but we left socket
closing on the receiver thread. Therefore, it was possible that a
REMOVE_SUBSCRIPTION command was not "finished" by the time the conductor
started to process the subsequent ADD_SUBSCRIPTION and could result in a
bind error.  In a recent commit, we started to align the Java driver
with the C driver. It now opens and closes sockets within the conductor
agent. As discussed in the previous commit, the flow now looks like
this:

```
Client -> Conductor: Remove Subscription
Conductor -> Receiver: Stop using associated sockets
Receiver -> Conductor: I've stopped using the associated sockets
Conductor -> Receiver: Stop using associated sockets
Receiver -> Conductor: I've stopped using the associated sockets
Conductor: closes sockets
Conductor: closes status indicator
Conductor -> Client: operation completed
```

However, this change was not sufficient to prevent bind conflicts, as it
was possible to see commands like ADD_SUBSCRIPTION interleaved. For
example:

```
Client -> Conductor: Remove Subscription
Conductor -> Receiver: Stop using associated sockets
Client -> Conductor: Add Subscription
Conductor: bind exception
```

In the Java driver, we now set the endpoints statusIndictor counter's to
CLOSING to indicate that it should not be reused. The driver also
detects this state when adding a subscription and sends a
RESOURCE_TEMPORARILY_UNAVAILABLE error back to the client. This matches
the C driver behaviour.

```
Client -> Conductor: Remove Subscription
Conductor: Send endpoint status to CLOSING
Conductor -> Receiver: Stop using associated sockets
Client -> Conductor: Add Subscription
		 FAILURE when opening socket with same port
Conductor: Find endpoint with CLOSING status.
Conductor -> Client: Error RESOURCE_TEMPORARILY_UNAVAILABLE
```

There are now two ways to safely close and reopen a subscription for the
same channel:

1. Wait for the Subscription's channel status indicator to disappear
after closing, before reopening. N.b., this only works when the
closed subscription is the sole user of the endpoint.

2. Catch `RegistrationException` when opening a subscription and retry
on `errorCode == RESOURCE_TEMPORARILY_UNAVAILABLE`.

Ideally, it would be possible to hide this complexity from the user, but
that would be a far more involved change.

---

CHANGE TO AERON CLUSTER CLIENTS
-------------------------------

One place where users are likely to run into this issue is when creating
a new AeronCluster session, for example, after a session timeout. To
improve usability of AeronCluster, we now handle
RESOURCE_TEMPORARILY_UNAVAILABLE when creating the egress publication.

---

PUBLICATIONS NOT FIXED
----------------------

There are similar problems around publications in the Java media driver;
however, these are harder to fix, due to the way the endpoint enters the
equivalent CLOSING state on a time event, rather than due to a
publication removal.

There are also complications when it comes to recreating a publication
with the same explicit session-id, as one needs to consider the lifetime
of the entries in the collection used to prevent session clashes.

---

OTHER TIDBITS
-------------

We discovered some issues/surprises on our journey:

1. The channel status indicator counter is created the first time an
endpoint is created with the registrationId of its initial resource, but
the endpoint could be reused and the initial resource may be closed;
therefore, it is not obviously correct to look up the channel status
indicator by registrationId, as we do in some places. Instead, one
should use Subscription#channelStatusId to get the relevant counter
identifier.

2. There are still conflicts when removing and adding MDS destinations
that map onto the same socket bind address.

3. The C driver uses different channel endpoint status indicator values.

Co-authored-by: Dmytro Vyazelenko <696855+vyazelenko@users.noreply.github.com>

(cherry picked from commit f08e152)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants