Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Call-Home #578

Open
sjd-xlnx opened this issue Nov 17, 2023 · 2 comments
Open

Issue with Call-Home #578

sjd-xlnx opened this issue Nov 17, 2023 · 2 comments
Labels
Waiting Waiting for a response from issue or PR author or for someone to volunteer to pick up an enhancement

Comments

@sjd-xlnx
Copy link

Hello,

I have an issue with the Call-Home procedure. I am using netopeer2-server NETCONF server, and ncclient NETCONF client. Sometimes, about 25%-50% of the time, when listening for call-home, I get an error with the client:

[ERROR]: ncclient.transport.ssh> Exception (client): Error reading SSH protocol banner[WinError 10054] An existing connection was forcibly closed by the remote host
[ERROR]: ncclient.transport.ssh> Traceback (most recent call last):
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\transport.py", line 2271, in _check_banner
[ERROR]: ncclient.transport.ssh>     buf = self.packetizer.readline(timeout)
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\packet.py", line 380, in readline
[ERROR]: ncclient.transport.ssh>     buf  = self._read_timeout(timeout)
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\packet.py", line 607, in _read_timeout
[ERROR]: ncclient.transport.ssh>     x = self.__socket.recv(128)
[ERROR]: ncclient.transport.ssh> ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
[ERROR]: ncclient.transport.ssh>
[ERROR]: ncclient.transport.ssh> During handling of the above exception, another exception occurred:
[ERROR]: ncclient.transport.ssh>
[ERROR]: ncclient.transport.ssh> Traceback (most recent call last):
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\transport.py", line 2094, in run
[ERROR]: ncclient.transport.ssh>     self._check_banner()
[ERROR]: ncclient.transport.ssh>   File "C:\Users\stevend\AppData\Local\Programs\Python\Python310\lib\site-packages\paramiko\transport.py", line 2275, in _check_banner
[ERROR]: ncclient.transport.ssh>     raise SSHException(
[ERROR]: ncclient.transport.ssh> paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[WinError 10054] An existing connection was forcibly closed by the remote host
[ERROR]: ncclient.transport.ssh>

When performing call-home, the netopeer2-server cycles through it's known clients with a timeout of 500ms per attempt, and 3 attempts per client. (If there's only one client, it will keep retrying the same one over and over with 500ms gap between each attempt).

From looking at captured pcaps, it looks like the ncclient accepts the socket connect right at the edge of the server's timeout, and the server issues a RST message which causes the ncclient to throw the above error.

If I start the ncclient "call-home" listen first and then start the server "call-home" it works perfectly. It also seems to work fine on Linux. The error only exists on Windows, and only when we catch the end of the server's timeout window.

Furthermore, when it does fail, the socket remains open in listening state, which means further attempts fail with the following error OSError(10048, 'Only one usage of each socket address (protocol/network address/port) is normally permitted', None, 10048, None) until the socket is closed.

I looked at the call_home() and connect() code in manager.py, and ended-up writing my own version to help in debugging. Here's what I ended up with in my code:

        # Setup socket to listen on the port
        srv_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        srv_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        srv_socket.bind(("", port))
        srv_socket.settimeout(10)
        srv_socket.listen()
        sock, remote_host = srv_socket.accept()
        logger.info(f"Call-home connection initiated from remote host {remote_host}")

        try:
            self._mgr = manager.connect(host=None, port=port, username=user, password=password, hostkey_verify=False, sock=sock)
        except Exception:
            srv_socket.close()
            raise

I added srv_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) and also the call to manager.connect() is in a try/except block to catch any error and close the socket.

This has greatly improved the situation. I guess I could go further and re-try the whole process in event of failure.

Are there any other suggestions for fixing this problem?

@sjd-xlnx
Copy link
Author

sjd-xlnx commented Nov 17, 2023

Slightly improved workaround, with 3 retries and disabled logging...

        # Disable logging
        logging.getLogger('ncclient.transport.ssh').disabled = True
        for _ in range(3):
            # Setup socket to listen on the port
            srv_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            srv_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
            srv_socket.bind(("", port))
            srv_socket.settimeout(10)
            srv_socket.listen()
            sock, remote_host = srv_socket.accept()
            logger.info(f"Call-home connection initiated from remote host {remote_host}")

            try:
                self._mgr = manager.connect(host=None,
                                            port=port,
                                            username=user,
                                            password=password,
                                            hostkey_verify=False,
                                            sock=sock)
            except Exception as e:
                logger.error(repr(e))
                srv_socket.close()
                logger.info("Trying again...")
            else:
                break
        # Re-enable logging
        logging.getLogger('ncclient.transport.ssh').disabled = False

PS: I see in #561 that there was similar error reported, and so I adopted their idea to disable the errors during the call-home attempt.

@einarnn
Copy link
Contributor

einarnn commented Jan 5, 2024

@sjd-xlnx, sorry for the long delay. Would you like to propose a PR?

I'd also comment that a 500ms timeout (as you cite the netopeer agent to use) for attempting a call-home operation is quite aggressive and makes a number of assumptions about network topology and the prevailing RTT.

@einarnn einarnn added the Waiting Waiting for a response from issue or PR author or for someone to volunteer to pick up an enhancement label Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting Waiting for a response from issue or PR author or for someone to volunteer to pick up an enhancement
Projects
None yet
Development

No branches or pull requests

2 participants