Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

733 review config newest #182

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

733 review config newest #182

wants to merge 18 commits into from

Conversation

donadviser
Copy link
Collaborator

@donadviser donadviser commented Mar 5, 2025

Pull Request Title

Review config handling: Split config.json and implement merge utility.

Summary

Add your summary here - keep it brief, to the point, and in plain English.
This PR splits the existing config.json into two files: config_user.json and config_constants.json. A new utility script is added to merge these files back into config.json. Additionally, updates are made to copy_script_and_config.py and helper_functions.py to handle the merging, and logger setup is introduced in `mbs_results/init.py for use in the code pipeline.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Other (please describe):

Checklists

This pull request meets the following requirements:

Creator Checklist

  • Installable with all dependencies recorded
  • Runs without error
  • Follows PEP8 and project-specific conventions
  • Appropriate use of comments, for example, no descriptive comments
  • Functions documented using Numpy style docstrings
  • Assumptions and decisions log considered and updated if appropriate
  • Unit tests have been updated to cover essential functionality for a reasonable range of inputs and conditions
  • Other forms of testing such as end-to-end and user-interface testing have been considered and updated as required

If you feel some of these conditions do not apply for this pull request, please
add a comment to explain why.

Reviewer Checklist

  • Test suite passes (locally as a minimum)
  • Peer reviewed with review recorded

Additional Information

Please provide any additional information or context that would help the reviewer understand the changes in this pull request.

Related Issues

Link any related issues or pull requests here.

Copy link
Collaborator

@NathanKelly-ONS NathanKelly-ONS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Derrick, good work so far! I definitely can see how this is going to be useful and simplify the configs quite a lot. I have a few suggestions and a couple of questions. I think some aspects will be useful to get people's input at the show-and-tell, so looking forward to that.

The only thing I haven't been able to test yet is the logger functionality, I'm not sure why I can't seem to get it working, but I'll keep testing - it could be a me problem.

One more general thing is I think it makes sense to put the two configs into a separate folder. We'll need to check it still works as expected and doesn't break the pipeline since the filepaths will be slightly different, but it will allow us to have a README for that specific folder which will be useful - otherwise the config README that I'm working on won't be particularly visible.

import json


def merge_two_config_files(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points about this function more generally:

I think it's worthwhile calling this function as part of the load_config function and testing whether it works - the new configs currently aren't used anywhere in the script as far as I can tell. I can't currently tell whether if we replace the current config with the combined config, whether this will work in place without any additional changes - but I think that would be ideal, so it's a good idea to test this to see whether it works.

Also, relevant to the above, this function is essentially loading the configs and merging them. I think it makes sense to separate this out - so perhaps update the load_config function to read in both configs, then once the configs have been read in, pass the actual variables into merge_two_config_files, then strip out the with open(...) parts of merge_two_config_files, so the merge function is literally just merging and the load function is literally just loading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have updated the function merge_two_config_files to perform the merging and return the config object. I have also updated the create_testing_config in the helper_functions.py to call it.

I will retain the option of the creation of config.json from merging of config_user and config_dev. This keeping the current structure of running and testing of the pipeline.

Copy link
Collaborator

@NathanKelly-ONS NathanKelly-ONS Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Related to my previous comment) I think the load_config function is essentially redundant with this function, because it's only designed to read in a single config, which we won't have any more once we move towards having a user_config and dev_config.

I also don't believe that we're permanently writing out the combined config anywhere (by design, as we discussed yesterday) so there's no way for the load_config function to actually pick up the correct config when you're running the pipeline.

Whilst I think this function would do the job, I don't think it's best practice to have a function doing two things like this is (loading AND merging) - it's generally better for each function to do a single thing.

So I think the best thing to do is to update load_config so it takes two filepaths, and returns dev_config and user_config, and then you can pass those objects into merge_two_configs. That way, load_config is literally just loading the configs, and merge_two_configs is literally just merging two configs.

Also, this will mean that the new configs are actually being used in main too, so we can make sure the pipeline still runs as expected and it's passing the integration tests.

Also, sorry I missed this in my first review: could you add a unit test for this too please?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this part now. It's working fine as demoed in the show and tell earlier today. I have also added test_merge_two_config_files.py to test the merging of the two config files

"threshold_filepath":"",

"back_data_type":"type",
"imputation_marker_col":"imputation_flags_adjustedresponse",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - do we think this makes sense to have in the user config? Or do we think it belongs with the other column names in the constants_config? I don't have a strong opinion either way, I'm just thinking that it lines up nicely with the various other column names in the constants config.

If there's a particular reason why a user might need to change this column more so than any other columns though, then I'm happy for it to stay here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. back_data_type and imputation_marker_col should be in constants_config. As part of this review, constants_config has been changed to config_dev in the interest of consistency with the other one, config_user. I have move both to the config_dev

Copy link
Collaborator

@NathanKelly-ONS NathanKelly-ONS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Derrick, this looks a lot better, thanks for making those changes!

There's a couple of things I think still need to be changed. I don't believe, currently, that the new configs are being used in the pipeline at all - it's still using config.json, so it's difficult to tell if the pipeline/the integration tests will run on the new configs. Could you make sure the new configs are being run in main? (Probably the easiest way is to amend the load_config function as per my comment below, as that function is already being called in main), otherwise you'll need to make a change directly to main.py.

Aside from that, I think it'd be worthwhile creating a unit test for merge_two_config_files, just as good practice and to make sure we're maximising test coverage.

I'll be happy to merge after that, once I've seen it works as expected and passes the integration tests when incorporated into main :)

import json


def merge_two_config_files(
Copy link
Collaborator

@NathanKelly-ONS NathanKelly-ONS Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Related to my previous comment) I think the load_config function is essentially redundant with this function, because it's only designed to read in a single config, which we won't have any more once we move towards having a user_config and dev_config.

I also don't believe that we're permanently writing out the combined config anywhere (by design, as we discussed yesterday) so there's no way for the load_config function to actually pick up the correct config when you're running the pipeline.

Whilst I think this function would do the job, I don't think it's best practice to have a function doing two things like this is (loading AND merging) - it's generally better for each function to do a single thing.

So I think the best thing to do is to update load_config so it takes two filepaths, and returns dev_config and user_config, and then you can pass those objects into merge_two_configs. That way, load_config is literally just loading the configs, and merge_two_configs is literally just merging two configs.

Also, this will mean that the new configs are actually being used in main too, so we can make sure the pipeline still runs as expected and it's passing the integration tests.

Also, sorry I missed this in my first review: could you add a unit test for this too please?

@@ -1,32 +1,9 @@
{
"platform" : "network",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed with Jordan earlier (not sure why I can't tag him) that having platform set to network by default is likely going to cause issues - we don't want the user to have to change (or, ideally, even see) the dev config. But users will be running the pipeline on DAP, so this should be set to s3. I'm not sure if we want to set this to s3 by default or potentially explore other options (otherwise, the devs will need to change this from s3 to network every time they want to test run the pipeline locally).

Might be worthwhile asking the team to see what they think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean here. I have updated the platform to s3 in config_dev.json. This will now be the default value for the users. Dev team can switch between s3 and network.

In addition, platform": "network has been added to the test_config dictionary in test_main.py. This will ensure that the test is carried out within the network option.

@AntonZogk AntonZogk mentioned this pull request Mar 11, 2025
15 tasks
if config_user_dict is not None:
config.update(config_user_dict)
logger.info("config dictionary updated with config user dictionary")
config["mbs_results_path"] = config["folder_path"] + config["mbs_file_name"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this "mbs_results_path" parameter? I don't think this is used anywhere in the pipeline so I don't think we need to create it here - unless there's a reason why you're creating it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mbs_results_path key was their in the original load_config() in inputs.py. So, I left it. However, as you mentioned, I have also searched and I couldn't find anywhere it was used. It is correct to tag it redundant, hence, can be removed. Should I remove it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants