You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
version='#44~20.04.1-Ubuntu SMP Thu Jun 22 12:21:12 UTC 2023'
A snippet of code that can reproduce the problem
Have not been able to reliably reproduce. However, I think it may be acceptable to give clients the ability to address the symptoms of this problem with configurable timeouts.
Problem Description
We had an incident recently where we noticed all Py4J threads (64) were blocked. Here is an example of a blocked thread:
{threadName154} (blockedCount: 5, daemon: true, lockOwnerId: 356, threadState: WAITING, waitedCount: 4, waitingOn: <475476020> (a java.util.concurrent.locks.ReentrantLock$FairSync))
at jdk.internal.misc.Unsafe.park(Unsafe.java:-2)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:938)
at java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
at py4j.PythonClient.giveBackConnection(PythonClient.java:239)
at py4j.CallbackClient.sendCommand(CallbackClient.java:406)
at py4j.CallbackClient.sendCommand(CallbackClient.java:356)
This thread is blocked on threadName356. This is the thread-trace:
{threadName356} (blockedCount: 0, daemon: true, isNative: true, threadState: RUNNABLE, waitedCount: 1)
at sun.nio.ch.Net.connect0(Net.java:-2)
at sun.nio.ch.Net.connect(Net.java:579)
at sun.nio.ch.Net.connect(Net.java:568)
at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.net.Socket.connect(Socket.java:633)
at java.net.Socket.connect(Socket.java:583)
at java.net.Socket.(Socket.java:507)
at java.net.Socket.(Socket.java:319)
at javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277)
at py4j.PythonClient.startClientSocket(PythonClient.java:192)
at py4j.PythonClient.getConnection(PythonClient.java:213)
at py4j.CallbackClient.getConnectionLock(CallbackClient.java:250)
at py4j.CallbackClient.sendCommand(CallbackClient.java:377)
at py4j.CallbackClient.sendCommand(CallbackClient.java:356)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:106)
It appears that, for some reason, threads were blocked on an initial command from Java to Python when we open a socket connection. Specifically, this line of code
PY4J does not specify a timeout, and thus we fallback to the JDK's default connection, which is indefinite block.
It is unclear to me why the OS blocked on this, and I cannot repro this. Still, this resulted in a non-trivial incident on our side! Therefore, I wanted to propose the following changes:
Potential Solutions
Solution 1: Add a way to specify a socket connection timeout down to the JDK level.
Solution 2: Add a timeout for this lock . The intention is that this would release, other threads would attempt to send a command to a socket that is not connected, and we'd fail loudly.
The text was updated successfully, but these errors were encountered:
version='#44~20.04.1-Ubuntu SMP Thu Jun 22 12:21:12 UTC 2023'
Problem Description
We had an incident recently where we noticed all Py4J threads (64) were blocked. Here is an example of a blocked thread:
{threadName154} (blockedCount: 5, daemon: true, lockOwnerId: 356, threadState: WAITING, waitedCount: 4, waitingOn: <475476020> (a java.util.concurrent.locks.ReentrantLock$FairSync))
at jdk.internal.misc.Unsafe.park(Unsafe.java:-2)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:938)
at java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
at py4j.PythonClient.giveBackConnection(PythonClient.java:239)
at py4j.CallbackClient.sendCommand(CallbackClient.java:406)
at py4j.CallbackClient.sendCommand(CallbackClient.java:356)
This thread is blocked on threadName356. This is the thread-trace:
{threadName356} (blockedCount: 0, daemon: true, isNative: true, threadState: RUNNABLE, waitedCount: 1)
at sun.nio.ch.Net.connect0(Net.java:-2)
at sun.nio.ch.Net.connect(Net.java:579)
at sun.nio.ch.Net.connect(Net.java:568)
at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.net.Socket.connect(Socket.java:633)
at java.net.Socket.connect(Socket.java:583)
at java.net.Socket.(Socket.java:507)
at java.net.Socket.(Socket.java:319)
at javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277)
at py4j.PythonClient.startClientSocket(PythonClient.java:192)
at py4j.PythonClient.getConnection(PythonClient.java:213)
at py4j.CallbackClient.getConnectionLock(CallbackClient.java:250)
at py4j.CallbackClient.sendCommand(CallbackClient.java:377)
at py4j.CallbackClient.sendCommand(CallbackClient.java:356)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:106)
It appears that, for some reason, threads were blocked on an initial command from Java to Python when we open a socket connection. Specifically, this line of code
PY4J does not specify a timeout, and thus we fallback to the JDK's default connection, which is indefinite block.
It is unclear to me why the OS blocked on this, and I cannot repro this. Still, this resulted in a non-trivial incident on our side! Therefore, I wanted to propose the following changes:
Potential Solutions
The text was updated successfully, but these errors were encountered: