Issue with Call-Home #578

sjd-xlnx · 2023-11-17T10:30:49Z

Hello,

I have an issue with the Call-Home procedure. I am using netopeer2-server NETCONF server, and ncclient NETCONF client. Sometimes, about 25%-50% of the time, when listening for call-home, I get an error with the client:

[ERROR]: ncclient.transport.ssh> Exception (client): Error reading SSH protocol banner[WinError 10054] An existing connection was forcibly closed by the remote host
[ERROR]: ncclient.transport.ssh> Traceback (most recent call last):
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\transport.py", line 2271, in _check_banner
[ERROR]: ncclient.transport.ssh>     buf = self.packetizer.readline(timeout)
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\packet.py", line 380, in readline
[ERROR]: ncclient.transport.ssh>     buf  = self._read_timeout(timeout)
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\packet.py", line 607, in _read_timeout
[ERROR]: ncclient.transport.ssh>     x = self.__socket.recv(128)
[ERROR]: ncclient.transport.ssh> ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
[ERROR]: ncclient.transport.ssh>
[ERROR]: ncclient.transport.ssh> During handling of the above exception, another exception occurred:
[ERROR]: ncclient.transport.ssh>
[ERROR]: ncclient.transport.ssh> Traceback (most recent call last):
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\transport.py", line 2094, in run
[ERROR]: ncclient.transport.ssh>     self._check_banner()
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\transport.py", line 2275, in _check_banner
[ERROR]: ncclient.transport.ssh>     raise SSHException(
[ERROR]: ncclient.transport.ssh> paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[WinError 10054] An existing connection was forcibly closed by the remote host
[ERROR]: ncclient.transport.ssh>

When performing call-home, the netopeer2-server cycles through it's known clients with a timeout of 500ms per attempt, and 3 attempts per client. (If there's only one client, it will keep retrying the same one over and over with 500ms gap between each attempt).

From looking at captured pcaps, it looks like the ncclient accepts the socket connect right at the edge of the server's timeout, and the server issues a RST message which causes the ncclient to throw the above error.

If I start the ncclient "call-home" listen first and then start the server "call-home" it works perfectly. It also seems to work fine on Linux. The error only exists on Windows, and only when we catch the end of the server's timeout window.

Furthermore, when it does fail, the socket remains open in listening state, which means further attempts fail with the following error OSError(10048, 'Only one usage of each socket address (protocol/network address/port) is normally permitted', None, 10048, None) until the socket is closed.

I looked at the call_home() and connect() code in manager.py, and ended-up writing my own version to help in debugging. Here's what I ended up with in my code:

        # Setup socket to listen on the port
        srv_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        srv_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        srv_socket.bind(("", port))
        srv_socket.settimeout(10)
        srv_socket.listen()
        sock, remote_host = srv_socket.accept()
        logger.info(f"Call-home connection initiated from remote host {remote_host}")

        try:
            self._mgr = manager.connect(host=None, port=port, username=user, password=password, hostkey_verify=False, sock=sock)
        except Exception:
            srv_socket.close()
            raise

I added srv_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) and also the call to manager.connect() is in a try/except block to catch any error and close the socket.

This has greatly improved the situation. I guess I could go further and re-try the whole process in event of failure.

Are there any other suggestions for fixing this problem?

The text was updated successfully, but these errors were encountered:

sjd-xlnx · 2023-11-17T13:14:25Z

Slightly improved workaround, with 3 retries and disabled logging...

        # Disable logging
        logging.getLogger('ncclient.transport.ssh').disabled = True
        for _ in range(3):
            # Setup socket to listen on the port
            srv_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            srv_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
            srv_socket.bind(("", port))
            srv_socket.settimeout(10)
            srv_socket.listen()
            sock, remote_host = srv_socket.accept()
            logger.info(f"Call-home connection initiated from remote host {remote_host}")

            try:
                self._mgr = manager.connect(host=None,
                                            port=port,
                                            username=user,
                                            password=password,
                                            hostkey_verify=False,
                                            sock=sock)
            except Exception as e:
                logger.error(repr(e))
                srv_socket.close()
                logger.info("Trying again...")
            else:
                break
        # Re-enable logging
        logging.getLogger('ncclient.transport.ssh').disabled = False

PS: I see in #561 that there was similar error reported, and so I adopted their idea to disable the errors during the call-home attempt.

einarnn · 2024-01-05T16:17:29Z

@sjd-xlnx, sorry for the long delay. Would you like to propose a PR?

I'd also comment that a 500ms timeout (as you cite the netopeer agent to use) for attempting a call-home operation is quite aggressive and makes a number of assumptions about network topology and the prevailing RTT.

einarnn added the Waiting Waiting for a response from issue or PR author or for someone to volunteer to pick up an enhancement label Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Call-Home #578

Issue with Call-Home #578

sjd-xlnx commented Nov 17, 2023

sjd-xlnx commented Nov 17, 2023 •

edited

Loading

einarnn commented Jan 5, 2024

Issue with Call-Home #578

Issue with Call-Home #578

Comments

sjd-xlnx commented Nov 17, 2023

sjd-xlnx commented Nov 17, 2023 • edited Loading

einarnn commented Jan 5, 2024

sjd-xlnx commented Nov 17, 2023 •

edited

Loading