Potential problem with L2P1 msg_data consumption #100
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem description:
Some noc1 requests getting into L2 pipe1 contain msg_data (e.g. nc_store, atomics, interrupt forward), but the pipeline does not always consume the data immediately. While the request being pushed into mshr, the msg_data is not stored in mshr array, but being left in the noc1 buffer. At this moment, if another request gets into L2 pipe1, and also trying to consume msg_data, it would read out the msg_data of the previous request.
Is this problem real and triggerable?
Only two types of request may contain msg_data and has the potential to be pushed in mshr, atomic operation and nc store, which may cause the msg_data pending at the buffer.
atomic operation
An atomic operation is divided into two internal requests in L2: xx_P1 and xx_P2. In phase1 it will invalidate all sharers if the line is in S/M state; in phase2 it will read msg_data and do the arithmetic computation. L2 stalls the first stage between phase1 and phase2, and won't ack the msg header until it reaches phase2. Thus, no new requests can be processed by the pipe when a msg_data in the atomic operation is pending. A request in mshr can be recovered between phase1 and phase2, though, and that request may even be a nc_store. But in this case it was the nc_store what arrived at L2 first and made the msg_data pending, we'll discuss that in the next case.
So the conclusion is: due to the late ask of the msg header, we've already prevented new operations being consumed during the time when a msg_data in atomic operation is pending. @morenes also wrote tests to issue lots of consecutive atomic operations from multiple threads to try to trigger this problem, but we saw nothing happened. It's kind of verified that we are fine in this case.
non-cacheable store
If the target line of the nc_store is already in S/M state, L2 will firstly invalidate all the sharers and push the nc_store into mshr. At this time the msg_data is pending, and bad thing may happen if the pipe receives another request which carries msg_data as well. However, with Ariane core this case could not be triggered, because the non-cacheable region is fixed, and the line in the nc space will never be stored in L2.
The plan would be either try to build a test with sparc core, sending nc_store to a previous cacheable address, or using other device sending nc_store to cacheable region.
This fix
The idea of this fix is to figure out when the msg_data is pending. It would raise a flag
msg_data_pending
when a request is supposed to consume msg_data, but not and in reality being pushed into mshr. When that flag is set, the pipeline would not accept new request which also carries msg_data.It's verified that this fix would not break the system. But we need further tests to make sure this problem would actually be triggered and this fix could solve the problem.