Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two restart tests have answer changes due to moving mountain update in cam6_4_078 #1284

Closed
cacraigucar opened this issue Mar 20, 2025 · 8 comments
Assignees
Labels
bug Something isn't working correctly CoupledEval3 next tag This issue is ready to be fixed in the next CAM tag Priority 1 bug Top priority bug

Comments

@cacraigucar
Copy link
Collaborator

What happened?

Restart tests for a couple of regression tests indicate answer changes between a full 9 time step run and a run that is broken in two pieces with a restart to finish the run. The test failures were missed when making cam6_4_078.

Subsequent investigaion indicates the problem occurs when both the change to gv_convect.F90 and the new setting of effgw_beres_dp=0.15 are implemented together. The line in gv_convect.F90 is:
hdepth = max(1000._r8, hdepth*qbo_hdepth_scaling)

Restarts are bit-for-bit when the line is reverted back to:
hdepth = hdepth*qbo_hdepth_scaling

It was noted in issue #1276 : Change in gw_convect.f90: we introduced a check to make sure that the latent depths (variable called hdepth) do not exceed the range of latent heating depths covered by the lookup table.

A fix needs to be implemented to get restart tests to be bit-for-bit

What are the steps to reproduce the bug?

Using cam6_4_078:

Run: ERP_D_Ln9.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp
This test was tested in the detail described above

Assume the following test which also fails restart is due to the same problem, but it has not been investigated:
ERS_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_WXma.derecho_intel.cam-outfrq9s

What CAM tag were you using?

cam6_4_078

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/cacraig/test_cac_intel_20250320150236

Will you be addressing this bug yourself?

Yes, but I will need some help

Extra info

@JulioTBacmeister @PeterHjortLauritzen @mbramberger

@cacraigucar cacraigucar added bug Something isn't working correctly next tag This issue is ready to be fixed in the next CAM tag Priority 1 bug Top priority bug labels Mar 20, 2025
@cacraigucar cacraigucar self-assigned this Mar 20, 2025
@PeterHjortLauritzen
Copy link
Collaborator

PeterHjortLauritzen commented Mar 21, 2025

A little debugging info: The only field that is different in this test

ERP_D_Ln9.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp

is CS_RAINCERT which is part of the COSP simulator.

In the WACCM-x test

ERS_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_WXma.derecho_intel.cam-outfrq9s

differences are in atmImp_Faxa_swnet

@PeterHjortLauritzen
Copy link
Collaborator

PeterHjortLauritzen commented Mar 25, 2025

Based on @fvitt's idea of the tests being less than the radiation time-step are failing, I tried replacing the ERP test with 6, 10, 12, 20 timesteps instead of 9

ERP_D_Ln6.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl - PASS

ERP_D_Ln10.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl - FAIL

ERP_D_Ln12.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl - PASS

ERP_D_Ln20.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl - PASS

This issue might be related to

#655

@cacraigucar
Copy link
Collaborator Author

Based on @fvitt's idea of the tests being less than the radiation time-step are failing, I tried replacing the ERP test with 20 timesteps instead of 9

ERP_D_Ln20.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl

and the test passes.

I will update both tests to use the outfrq3h use case. Hopefully that solves all the problems!

@fvitt
Copy link

fvitt commented Mar 25, 2025

The short 9-step cam7 physics WACCMX restart test failed tag cam6_4_077, when it was first introduced. As noted by @PeterHjortLauritzen, differences are only in the atmImp_Faxa_swnet diagnostic field. Otherwise, the restarts are bit-for-bit. WACCMX uses a 5-minute time step, thus the 9-step test is 45 minutes model time in length, shorter than the radiation transfer frequency. RRTMGP used in cam7 physics might be missing the update to the swnet diagnostic on the restart in this case.

3-hour restart test passes:

    PASS ERS_Lh3.ne30pg3_ne30pg3_mg17.FHISTC_WXma.derecho_intel.cam-outfrq3h RUN time=805
    PASS ERS_Lh3.ne30pg3_ne30pg3_mg17.FHISTC_WXma.derecho_intel.cam-outfrq3h COMPARE_base_rest

@cacraigucar
Copy link
Collaborator Author

@fvitt - based on what you just reported, do we need to add the swnet diagnostic to the restart file? I was planning on updating the tests to outfrq3h based on what we discussed at the meeting, but it sounds like your fix is the "proper" one?

If we need to add that variable, could someone advise on how to do this. Belive it or not, I've never added a variable to the restart files and I'm not sure on where these mods would need to happen.

@fvitt
Copy link

fvitt commented Mar 25, 2025

@fvitt - based on what you just reported, do we need to add the swnet diagnostic to the restart file? I was planning on updating the tests to outfrq3h based on what we discussed at the meeting, but it sounds like your fix is the "proper" one?

If we need to add that variable, could someone advise on how to do this. Belive it or not, I've never added a variable to the restart files and I'm not sure on where these mods would need to happen.

It would be good to get input from @brian-eaton on this. This seems to be lower priority issue and can probably wait for his return.

I first thought is to change how RRTMGP updates this diagnostic to be similar to RRTMG's behavior. Adding a field to the restart file is another option.

@brian-eaton
Copy link
Collaborator

I can't look into details, but just want to mention that CS_RAINCERT is known to not be task independent. I thought I had added it to the fexcl in the user_nl_cam testmods file for cosp tests. Maybe I missed this one. All diagnostics from the cloudsat simulator (CS_*) may have this issue.

@peverwhee
Copy link
Collaborator

@brian-eaton and @fvitt -

thank you both! I've opened a PR that contains both of the fixes y'all have suggested (fexcl-ing CS_RAINCERT, and fixing how we handle swnet when shortwave radiation is not performed) and the failing tests pass. We'll get it in soon.

#1289

@github-project-automation github-project-automation bot moved this from To Do to Done in CAM Development Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly CoupledEval3 next tag This issue is ready to be fixed in the next CAM tag Priority 1 bug Top priority bug
Projects
Status: Done
Development

No branches or pull requests

6 participants