Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcurrentDictionary.TryRemove sometimes does not remove an existing key #107525

Open
odinserj opened this issue Sep 9, 2024 · 3 comments
Open

Comments

@odinserj
Copy link

odinserj commented Sep 9, 2024

Description

I noticed strange behavior of the ConcurrentDictionary.TryRemove method in the latest stable version of .NET 8.0. While it is expected to always remove an existing key as per Stephen Toub's answer and de-facto behavior in all the previous versions of .NET (including .NET 6.0 and 7.0) and .NET Framework, it doesn't do so in all the latest versions.

The problem likely occurs, when there are concurrent calls to GetOrAdd or TryAdd method that even don't modify anything, returning already existing item or false respectively, so looks like there's a race condition somewhere.

This behavior started to appear in the SDK 8.0.100-preview.2 version, and doesn't exist in the SDK 8.0.100-preview.1 version or earlier.

Please let me know if more information is required.

Reproduction Steps

Please consider the following sample program. It creates an instance of the ConcurrentDictionary class and performs simple parallel workloads in each iteration of the while loop. During each parallel iteration, it calls the GetOrAdd method that will create an entry that can be processed and removed only by a single thread at the same time – other threads will be waiting until the first winner remove the entry until GetOrAdd will modify anything.

All the waits are global – locks are performed on the dictionary instance to highlight the problem. Each entry is processed only by a single thread, and the behavior is enforced by such a "global" lock, so it's impossible that any other thread will remove anything between the calls to ContainsKey and TryRemove methods.

So when ContainsKey returns true, it is expected that the TryRemove method call succeeds and also returns true. But as we see from the messages in .NET 8.0, this is not true anymore.

The problem does not occur, if we disable parallelism.

As a workaround, we can force the removal by calling TryRemove in a loop, and eventually it will remove the entry, but it is not something that's expected to add in a concurrent dictionary unless documented.

using System.Collections.Concurrent;
using System.Diagnostics;

var dictionary = new ConcurrentDictionary<Guid, Entry>();
var started = Stopwatch.StartNew();

while (true)
{
    var parallelism = Environment.ProcessorCount;
    var key = Guid.NewGuid();

    Parallel.For(0, parallelism, new ParallelOptions { MaxDegreeOfParallelism = parallelism }, i =>
    {
        Entry entry;
        while (true)
        {
            entry = dictionary.GetOrAdd(key, static _ => new Entry());

            lock (dictionary)
            {
                while (entry.Acquired) Monitor.Wait(dictionary);
                if (entry.Finalized) { continue; }

                entry.Acquired = true;
                break;
            }
        }
        
        lock (dictionary)
        {
            entry.Acquired = false;
            entry.Finalized = true;

            var exists = dictionary.ContainsKey(key);
            if (!exists) Console.WriteLine($"Key {key} does not exists in the dictionary");

            var result = dictionary.TryRemove(key, out _);
            if (result == false)
            {
                var message = $"Failed to remove entry for key {key}, existed before: {exists}";
                Console.WriteLine(message);

                if (exists)
                {
                    while (!dictionary.TryRemove(key, out _)) { }
                    Console.WriteLine($"Key {key} successfully removed after enforcing removal");
                }
            }
            
            Monitor.PulseAll(dictionary);
        }
    });

    if (started.Elapsed > TimeSpan.FromSeconds(5))
    {
        Console.WriteLine($"{DateTime.Now}: still alive...");
        started = Stopwatch.StartNew();
    }
}

internal class Entry
{
    public bool Finalized;
    public bool Acquired;
}

Expected behavior

The sample program is expected to run indefinitely without any "failed" messages like it does in .NET 6.0/7.0 and earlier:

9/9/2024 11:39:26 AM: still alive...
9/9/2024 11:39:31 AM: still alive...
9/9/2024 11:39:36 AM: still alive...
9/9/2024 11:39:41 AM: still alive...
9/9/2024 11:39:46 AM: still alive...
9/9/2024 11:39:51 AM: still alive...
9/9/2024 11:39:56 AM: still alive...
9/9/2024 11:40:01 AM: still alive...

Actual behavior

But in .NET 8.0 and later versions, it shows "failed" messages like the following ones within the first N seconds:

Failed to remove entry for key 5f31ea5a-8690-4a40-9ae7-2da0379edf14, existed before: True
Key 5f31ea5a-8690-4a40-9ae7-2da0379edf14 successfully removed after enforcing removal
9/9/2024 11:38:54 AM: still alive...
Failed to remove entry for key 6d8d9957-20f1-447e-98d5-e64f63bffa1b, existed before: True
Key 6d8d9957-20f1-447e-98d5-e64f63bffa1b successfully removed after enforcing removal

Regression?

The sample above works fine in SDK 8.0.100-preview.1 and below (.NET 7.0, .NET 6.0 and .NET Framework 4.8), but doesn't work in SDK 8.0.100-preview.2 and above, including .NET 8.0.401 and 9.0.100-preview.7.

Known Workarounds

Removal eventually completes when calling it in a while loop and checking the result:

while (!dictionary.TryRemove(key, out _)) { }

Configuration

Version: .NET 8.0.401
OS: Windows 11 and Ubuntu 24.04
Architecture: x64
Processors: 24

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 9, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 9, 2024
@huoyaoyuan huoyaoyuan added area-System.Collections and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Sep 9, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@Clockwork-Muse
Copy link
Contributor

locks are performed on the dictionary instance to highlight the problem.

Don't do this. Use a separate object (and on .NET 9 , use the new Lock object).

This use of locks would be suspicious real-world code, because it would mean you don't actually need ConcurrentDictionary.

entry.Acquired = false;
entry.Finalized = true;

These modifications are on the entry retrieved, but:

  • There's no guarantee that they're on the entry removed.
  • They may have been made by other threads already.

Regardless of any potential bug discovered here, if this reflects your real-world code you may have much larger application design issues.

@stephentoub stephentoub self-assigned this Sep 9, 2024
@stephentoub stephentoub added bug and removed untriaged New issue has not been triaged by the area owner labels Sep 9, 2024
@stephentoub
Copy link
Member

stephentoub commented Sep 9, 2024

Thanks for the quality report.

The bug is #82004. It tries to optimize removal on an empty bucket by reading the bucket's count prior to taking the lock. TryAdd adds an element and then increments the corresponding bucket count. That means there's a period of time between when ContainsKey/TryGetValue can find the item in the dictionary but TryRemove misses it if the concurrent TryAdd adds the item to an otherwise empty bucket.

The right answer is to revert that PR, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants