Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove memory spike in Sampling #1235

Merged
merged 8 commits into from
Oct 23, 2024
Merged

Conversation

marchdf
Copy link
Contributor

@marchdf marchdf commented Sep 5, 2024

Summary

This removes the memory spike on the ioproc when doing sampling by initializing the sampling particles in parallel.

Pull request type

Please check the type of change introduced:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Other (please describe):

Checklist

The following is included:

  • new unit-test(s)
  • new regression test(s)
  • documentation for new capability

This PR was tested by running:

  • the unit tests
    • on GPU
    • on CPU
  • the regression tests
    • on GPU
    • on CPU

Additional background

Issue Number: #1186

@marchdf
Copy link
Contributor Author

marchdf commented Sep 5, 2024

This is great except that it now all ranks need to know about the probe locations. So it ends up needing more memory in total though there is no spike. The next step that I want to do before this PR is merged is to have the ranks only need to know about probe locations that are relevant to it. Basically I want to pass box to sampling_locations and only get the locations that are inside with that box. But I need to switch to something else for a bit.

@marchdf marchdf force-pushed the sampling-spike branch 5 times, most recently from ac8f6e4 to fc43780 Compare September 11, 2024 20:41
@marchdf
Copy link
Contributor Author

marchdf commented Sep 11, 2024

Status update before I leave this aside for a bit:

  • there is currently are hard check that the number of particles added match what was expected to be added. This is a good check. But if the sampling location is not in the domain, then this will fail. Do I silently filter out sampling locations outside the domain? Or abort? This shows up in the dam_break_godunov test case.
  • Now that I can dynamically add sampling locations, I am going to have each rank only do the sampling locations that are in the boxes that it owns. That should further lower the memory consumption.

@marchdf marchdf force-pushed the sampling-spike branch 2 times, most recently from 5a54ba1 to c99cba2 Compare September 20, 2024 18:51
@marchdf
Copy link
Contributor Author

marchdf commented Sep 20, 2024

Progress is being made... I need to fix the m_fill_val in radar sampler to be the min of the domain (checking to see if that's allowed). And then netcdf builds need to be fixed with the realvect change. And then finally I can call sampling_locations with the locally owned boxes. Phew.

@marchdf
Copy link
Contributor Author

marchdf commented Oct 9, 2024

We are close. I added the box check which is the thing that should help the memory pressure. I need to check it on a large case and see if it actually works/causes no diffs. The only other thing I am "worried" about is the radar sampler since I don't have an input file for that.

@marchdf
Copy link
Contributor Author

marchdf commented Oct 10, 2024

Ran the new reg test with MPI and everything on and I don't see any diffs in the particle files. I would like to double check netcdf. But here's the upshot:

  • successfully removes the memory spike: 10GB spike is gone and the max we see is about 250MB on each rank.
  • however the initialize of the particles is much slower with this PR. This is a bummer. Basically we are trading memory usage for computation. TANSTAAFL. One thing I want to think about is if there's a way to give it a bounding box...

This PR

Time spent in InitData():    106.9760745
Time spent in Evolve():      4.634056547

memory_30plane_dev

main branch

Time spent in InitData():    10.57266587
Time spent in Evolve():      4.56230204

memory_30plane_main-2

@marchdf
Copy link
Contributor Author

marchdf commented Oct 10, 2024

This is the diff with the big case I am using to benchmark (40M probes):

           absolute_error  relative_error
uid          0.000000e 00    0.000000e 00
set_id       0.000000e 00    0.000000e 00
probe_id     0.000000e 00    0.000000e 00
xco          1.104635e-08    1.092440e-16
yco          0.000000e 00    0.000000e 00
zco          0.000000e 00    0.000000e 00
velocityx    4.813321e-13    9.463192e-18
velocityy    7.431559e-13    1.052051e-15
velocityz    0.000000e 00    0.000000e 00

@marchdf
Copy link
Contributor Author

marchdf commented Oct 10, 2024

Ok here's the updated plots with the improvements I made:
memory_30plane_dev-2

And now we are back to something reasonable:

Time spent in InitData():    16.04659739
Time spent in Evolve():      4.62255488

@marchdf
Copy link
Contributor Author

marchdf commented Oct 10, 2024

Checklist:

  • check netcdf
  • rerun all the tests
  • check abl_sampling test and large test files from a GPU run
  • check free surface
  • I think I can actually simplify now that I did all that extra work

@marchdf
Copy link
Contributor Author

marchdf commented Oct 11, 2024

Because I can't leave well enough alone... That last simplification means that now we are even faster than the main branch in the init (2X speedup). Turns out TANSTAAFL is not a thing.

Time spent in InitData():    5.056825252
Time spent in Evolve():      5.084482466

memory_30plane_dev-3

@mbkuhn
Copy link
Contributor

mbkuhn commented Oct 11, 2024

not sure if you've incorporated my additions to probe sampler, but hopefully those won't be too bad to address.

@marchdf
Copy link
Contributor Author

marchdf commented Oct 11, 2024

not sure if you've incorporated my additions to probe sampler, but hopefully those won't be too bad to address.

Yup I merged those in when you made the changes. You had good tests in there that made it easy.

@marchdf
Copy link
Contributor Author

marchdf commented Oct 11, 2024

That last one made the init 10% faster than the one before.

Time spent in InitData():    4.615414819
Time spent in Evolve():      5.122051878

Not enough to claim that the 80s were right about greed. I will keep it though.

@marchdf
Copy link
Contributor Author

marchdf commented Oct 11, 2024

Free surface sampler looks fine.

plot_sampling_native.pdf

@marchdf
Copy link
Contributor Author

marchdf commented Oct 11, 2024

abl_sampling_netcdf with 10 ranks on Kestrel:
This PR:

Time spent in InitData():    0.38666686
Time spent in Evolve():      1.077089947

main branch:

Time spent in InitData():    0.483688814
Time spent in Evolve():      1.074861178

Things "look" right:

❯ cd post_processing_dev
❯ du -shc *
28K     abl_statistics00000.nc
32K     line_sampling00000.nc
136K    plane_sampling00000.nc
32K     probe_sampling00000.nc
5.8M    volume_sampling00000.nc
88K     volume_sampling200000.nc
6.1M    total
❯ ..
❯ cd post_processing_main
❯ du -shc *
28K     abl_statistics00000.nc
32K     line_sampling00000.nc
136K    plane_sampling00000.nc
32K     probe_sampling00000.nc
5.8M    volume_sampling00000.nc
88K     volume_sampling200000.nc
6.1M    total

But I would like to have an nccmp output... which isn't on any machine I have access to yet.

@marchdf
Copy link
Contributor Author

marchdf commented Oct 11, 2024

got nccmp added to kestrel:

❯ nccmp -d post_processing_dev/volume_sampling00000.nc post_processing_main/volume_sampling00000.nc
❯ nccmp -d post_processing_dev/volume_sampling200000.nc post_processing_main/volume_sampling200000.nc

good to go for netcdf

@hgopalan
Copy link
Contributor

I checked one of our cases which crashed before on CPU in the first step. I changed the output format to native and it has already run fine for 100 seconds. Will keep you updated.

@marchdf marchdf force-pushed the sampling-spike branch 3 times, most recently from ebf5ccd to 0d55040 Compare October 16, 2024 15:38
@marchdf marchdf force-pushed the sampling-spike branch 3 times, most recently from ec868b1 to c51a440 Compare October 22, 2024 15:43
@marchdf
Copy link
Contributor Author

marchdf commented Oct 22, 2024

With some final tweaks, I think we are good to go. Running the GPU tests now.

@marchdf marchdf marked this pull request as ready for review October 23, 2024 16:09
@marchdf marchdf merged commit d10f427 into Exawind:main Oct 23, 2024
15 checks passed
@marchdf marchdf deleted the sampling-spike branch October 23, 2024 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants