-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
733 review config newest #182
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Derrick, good work so far! I definitely can see how this is going to be useful and simplify the configs quite a lot. I have a few suggestions and a couple of questions. I think some aspects will be useful to get people's input at the show-and-tell, so looking forward to that.
The only thing I haven't been able to test yet is the logger functionality, I'm not sure why I can't seem to get it working, but I'll keep testing - it could be a me problem.
One more general thing is I think it makes sense to put the two configs into a separate folder. We'll need to check it still works as expected and doesn't break the pipeline since the filepaths will be slightly different, but it will allow us to have a README for that specific folder which will be useful - otherwise the config README that I'm working on won't be particularly visible.
import json | ||
|
||
|
||
def merge_two_config_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two points about this function more generally:
I think it's worthwhile calling this function as part of the load_config function and testing whether it works - the new configs currently aren't used anywhere in the script as far as I can tell. I can't currently tell whether if we replace the current config with the combined config, whether this will work in place without any additional changes - but I think that would be ideal, so it's a good idea to test this to see whether it works.
Also, relevant to the above, this function is essentially loading the configs and merging them. I think it makes sense to separate this out - so perhaps update the load_config function to read in both configs, then once the configs have been read in, pass the actual variables into merge_two_config_files, then strip out the with open(...)
parts of merge_two_config_files, so the merge function is literally just merging and the load function is literally just loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have updated the function merge_two_config_files
to perform the merging and return the config object. I have also updated the create_testing_config
in the helper_functions.py
to call it.
I will retain the option of the creation of config.json
from merging of config_user
and config_dev
. This keeping the current structure of running and testing of the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Related to my previous comment) I think the load_config function is essentially redundant with this function, because it's only designed to read in a single config, which we won't have any more once we move towards having a user_config and dev_config.
I also don't believe that we're permanently writing out the combined config anywhere (by design, as we discussed yesterday) so there's no way for the load_config function to actually pick up the correct config when you're running the pipeline.
Whilst I think this function would do the job, I don't think it's best practice to have a function doing two things like this is (loading AND merging) - it's generally better for each function to do a single thing.
So I think the best thing to do is to update load_config so it takes two filepaths, and returns dev_config and user_config, and then you can pass those objects into merge_two_configs. That way, load_config is literally just loading the configs, and merge_two_configs is literally just merging two configs.
Also, this will mean that the new configs are actually being used in main too, so we can make sure the pipeline still runs as expected and it's passing the integration tests.
Also, sorry I missed this in my first review: could you add a unit test for this too please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this part now. It's working fine as demoed in the show and tell earlier today. I have also added test_merge_two_config_files.py to test the merging of the two config files
mbs_results/config_user.json
Outdated
"threshold_filepath":"", | ||
|
||
"back_data_type":"type", | ||
"imputation_marker_col":"imputation_flags_adjustedresponse", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing - do we think this makes sense to have in the user config? Or do we think it belongs with the other column names in the constants_config? I don't have a strong opinion either way, I'm just thinking that it lines up nicely with the various other column names in the constants config.
If there's a particular reason why a user might need to change this column more so than any other columns though, then I'm happy for it to stay here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct. back_data_type
and imputation_marker_col
should be in constants_config
. As part of this review, constants_config
has been changed to config_dev
in the interest of consistency with the other one, config_user
. I have move both to the config_dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Derrick, this looks a lot better, thanks for making those changes!
There's a couple of things I think still need to be changed. I don't believe, currently, that the new configs are being used in the pipeline at all - it's still using config.json, so it's difficult to tell if the pipeline/the integration tests will run on the new configs. Could you make sure the new configs are being run in main? (Probably the easiest way is to amend the load_config function as per my comment below, as that function is already being called in main), otherwise you'll need to make a change directly to main.py.
Aside from that, I think it'd be worthwhile creating a unit test for merge_two_config_files, just as good practice and to make sure we're maximising test coverage.
I'll be happy to merge after that, once I've seen it works as expected and passes the integration tests when incorporated into main :)
import json | ||
|
||
|
||
def merge_two_config_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Related to my previous comment) I think the load_config function is essentially redundant with this function, because it's only designed to read in a single config, which we won't have any more once we move towards having a user_config and dev_config.
I also don't believe that we're permanently writing out the combined config anywhere (by design, as we discussed yesterday) so there's no way for the load_config function to actually pick up the correct config when you're running the pipeline.
Whilst I think this function would do the job, I don't think it's best practice to have a function doing two things like this is (loading AND merging) - it's generally better for each function to do a single thing.
So I think the best thing to do is to update load_config so it takes two filepaths, and returns dev_config and user_config, and then you can pass those objects into merge_two_configs. That way, load_config is literally just loading the configs, and merge_two_configs is literally just merging two configs.
Also, this will mean that the new configs are actually being used in main too, so we can make sure the pipeline still runs as expected and it's passing the integration tests.
Also, sorry I missed this in my first review: could you add a unit test for this too please?
mbs_results/configs/config_dev.json
Outdated
@@ -1,32 +1,9 @@ | |||
{ | |||
"platform" : "network", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed with Jordan earlier (not sure why I can't tag him) that having platform set to network by default is likely going to cause issues - we don't want the user to have to change (or, ideally, even see) the dev config. But users will be running the pipeline on DAP, so this should be set to s3. I'm not sure if we want to set this to s3 by default or potentially explore other options (otherwise, the devs will need to change this from s3 to network every time they want to test run the pipeline locally).
Might be worthwhile asking the team to see what they think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean here. I have updated the platform
to s3
in config_dev.json
. This will now be the default value for the users. Dev team can switch between s3
and network
.
In addition, platform": "network
has been added to the test_config
dictionary in test_main.py
. This will ensure that the test is carried out within the network
option.
mbs_results/utilities/inputs.py
Outdated
if config_user_dict is not None: | ||
config.update(config_user_dict) | ||
logger.info("config dictionary updated with config user dictionary") | ||
config["mbs_results_path"] = config["folder_path"] + config["mbs_file_name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this "mbs_results_path" parameter? I don't think this is used anywhere in the pipeline so I don't think we need to create it here - unless there's a reason why you're creating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mbs_results_path
key was their in the original load_config()
in inputs.py. So, I left it. However, as you mentioned, I have also searched and I couldn't find anywhere it was used. It is correct to tag it redundant, hence, can be removed. Should I remove it?
…tal/monthly-business-survey-results into 733-review-config-newest
…-survey-results into 733-review-config-newest
…tal/monthly-business-survey-results into 733-review-config-newest
Pull Request Title
Review config handling: Split
config.json
and implement merge utility.Summary
Add your summary here - keep it brief, to the point, and in plain English.
This PR splits the existing
config.json
into two files:config_user.json
andconfig_constants.json
. A new utility script is added to merge these files back intoconfig.json
. Additionally, updates are made tocopy_script_and_config.py
andhelper_functions.py
to handle the merging, andlogger
setup is introduced in `mbs_results/init.py for use in the code pipeline.Type of Change
Checklists
This pull request meets the following requirements:
Creator Checklist
If you feel some of these conditions do not apply for this pull request, please
add a comment to explain why.
Reviewer Checklist
Additional Information
Please provide any additional information or context that would help the reviewer understand the changes in this pull request.
Related Issues
Link any related issues or pull requests here.