Two restart tests have answer changes due to moving mountain update in cam6_4_078 #1284

cacraigucar · 2025-03-20T21:56:41Z

What happened?

Restart tests for a couple of regression tests indicate answer changes between a full 9 time step run and a run that is broken in two pieces with a restart to finish the run. The test failures were missed when making cam6_4_078.

Subsequent investigaion indicates the problem occurs when both the change to gv_convect.F90 and the new setting of effgw_beres_dp=0.15 are implemented together. The line in gv_convect.F90 is:
hdepth = max(1000._r8, hdepth*qbo_hdepth_scaling)

Restarts are bit-for-bit when the line is reverted back to:
hdepth = hdepth*qbo_hdepth_scaling

It was noted in issue #1276 : Change in gw_convect.f90: we introduced a check to make sure that the latent depths (variable called hdepth) do not exceed the range of latent heating depths covered by the lookup table.

A fix needs to be implemented to get restart tests to be bit-for-bit

What are the steps to reproduce the bug?

Using cam6_4_078:

Run: ERP_D_Ln9.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp
This test was tested in the detail described above

Assume the following test which also fails restart is due to the same problem, but it has not been investigated:
ERS_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_WXma.derecho_intel.cam-outfrq9s

What CAM tag were you using?

cam6_4_078

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/cacraig/test_cac_intel_20250320150236

Will you be addressing this bug yourself?

Yes, but I will need some help

Extra info

@JulioTBacmeister @PeterHjortLauritzen @mbramberger

The text was updated successfully, but these errors were encountered:

PeterHjortLauritzen · 2025-03-21T16:41:47Z

A little debugging info: The only field that is different in this test

ERP_D_Ln9.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp

is CS_RAINCERT which is part of the COSP simulator.

In the WACCM-x test

ERS_Ln9.ne30pg3_ne30pg3_mg17.FHISTC_WXma.derecho_intel.cam-outfrq9s

differences are in atmImp_Faxa_swnet

PeterHjortLauritzen · 2025-03-25T16:10:39Z

Based on @fvitt's idea of the tests being less than the radiation time-step are failing, I tried replacing the ERP test with 6, 10, 12, 20 timesteps instead of 9

ERP_D_Ln6.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl - PASS

ERP_D_Ln10.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl - FAIL

ERP_D_Ln12.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl - PASS

ERP_D_Ln20.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl - PASS

This issue might be related to

#655

cacraigucar · 2025-03-25T16:12:37Z

Based on @fvitt's idea of the tests being less than the radiation time-step are failing, I tried replacing the ERP test with 20 timesteps instead of 9

ERP_D_Ln20.ne30pg3_ne30pg3_mt232.QPC7.derecho_intel.cam-outfrq3s_cosp.20250325_094658_mcchhl

and the test passes.

I will update both tests to use the outfrq3h use case. Hopefully that solves all the problems!

fvitt · 2025-03-25T16:17:30Z

The short 9-step cam7 physics WACCMX restart test failed tag cam6_4_077, when it was first introduced. As noted by @PeterHjortLauritzen, differences are only in the atmImp_Faxa_swnet diagnostic field. Otherwise, the restarts are bit-for-bit. WACCMX uses a 5-minute time step, thus the 9-step test is 45 minutes model time in length, shorter than the radiation transfer frequency. RRTMGP used in cam7 physics might be missing the update to the swnet diagnostic on the restart in this case.

3-hour restart test passes:

    PASS ERS_Lh3.ne30pg3_ne30pg3_mg17.FHISTC_WXma.derecho_intel.cam-outfrq3h RUN time=805
    PASS ERS_Lh3.ne30pg3_ne30pg3_mg17.FHISTC_WXma.derecho_intel.cam-outfrq3h COMPARE_base_rest

cacraigucar · 2025-03-25T16:35:04Z

@fvitt - based on what you just reported, do we need to add the swnet diagnostic to the restart file? I was planning on updating the tests to outfrq3h based on what we discussed at the meeting, but it sounds like your fix is the "proper" one?

If we need to add that variable, could someone advise on how to do this. Belive it or not, I've never added a variable to the restart files and I'm not sure on where these mods would need to happen.

fvitt · 2025-03-25T16:57:09Z

@fvitt - based on what you just reported, do we need to add the swnet diagnostic to the restart file? I was planning on updating the tests to outfrq3h based on what we discussed at the meeting, but it sounds like your fix is the "proper" one?

If we need to add that variable, could someone advise on how to do this. Belive it or not, I've never added a variable to the restart files and I'm not sure on where these mods would need to happen.

It would be good to get input from @brian-eaton on this. This seems to be lower priority issue and can probably wait for his return.

I first thought is to change how RRTMGP updates this diagnostic to be similar to RRTMG's behavior. Adding a field to the restart file is another option.

brian-eaton · 2025-03-25T17:51:12Z

I can't look into details, but just want to mention that CS_RAINCERT is known to not be task independent. I thought I had added it to the fexcl in the user_nl_cam testmods file for cosp tests. Maybe I missed this one. All diagnostics from the cloudsat simulator (CS_*) may have this issue.

peverwhee · 2025-03-25T19:56:35Z

@brian-eaton and @fvitt -

thank you both! I've opened a PR that contains both of the fixes y'all have suggested (fexcl-ing CS_RAINCERT, and fixing how we handle swnet when shortwave radiation is not performed) and the failing tests pass. We'll get it in soon.

#1289

cacraigucar added bug Something isn't working correctly next tag This issue is ready to be fixed in the next CAM tag Priority 1 bug Top priority bug labels Mar 20, 2025

cacraigucar self-assigned this Mar 20, 2025

PeterHjortLauritzen mentioned this issue Mar 21, 2025

New low top baseline with cam6_4_078 NCAR/amwg_dev#651

Closed

PeterHjortLauritzen mentioned this issue Mar 21, 2025

f.e30.FHISTC_WAma.ne16pg3_mg17_L135_cam6_4_078_beres0.15 NCAR/wawg_dev#93

Open

cacraigucar added the CoupledEval3 label Mar 24, 2025

cacraigucar added this to CAM Development Mar 24, 2025

github-project-automation bot moved this to To Do in CAM Development Mar 24, 2025

peverwhee mentioned this issue Mar 25, 2025

cam6_4_081: Fixes for failing restart tests #1289

Merged

nusbaume closed this as completed Mar 26, 2025

github-project-automation bot moved this from To Do to Done in CAM Development Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two restart tests have answer changes due to moving mountain update in cam6_4_078 #1284

Two restart tests have answer changes due to moving mountain update in cam6_4_078 #1284

cacraigucar commented Mar 20, 2025

PeterHjortLauritzen commented Mar 21, 2025 •

edited

Loading

PeterHjortLauritzen commented Mar 25, 2025 •

edited

Loading

cacraigucar commented Mar 25, 2025

fvitt commented Mar 25, 2025

cacraigucar commented Mar 25, 2025

fvitt commented Mar 25, 2025

brian-eaton commented Mar 25, 2025

peverwhee commented Mar 25, 2025

Two restart tests have answer changes due to moving mountain update in cam6_4_078 #1284

Two restart tests have answer changes due to moving mountain update in cam6_4_078 #1284

Comments

cacraigucar commented Mar 20, 2025

What happened?

What are the steps to reproduce the bug?

What CAM tag were you using?

What machine were you running CAM on?

What compiler were you using?

Path to a case directory, if applicable

Will you be addressing this bug yourself?

Extra info

PeterHjortLauritzen commented Mar 21, 2025 • edited Loading

PeterHjortLauritzen commented Mar 25, 2025 • edited Loading

cacraigucar commented Mar 25, 2025

fvitt commented Mar 25, 2025

cacraigucar commented Mar 25, 2025

fvitt commented Mar 25, 2025

brian-eaton commented Mar 25, 2025

peverwhee commented Mar 25, 2025

PeterHjortLauritzen commented Mar 21, 2025 •

edited

Loading

PeterHjortLauritzen commented Mar 25, 2025 •

edited

Loading