Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce load on runs db #105

Open
JelleAalbers opened this issue May 28, 2017 · 4 comments
Open

Reduce load on runs db #105

JelleAalbers opened this issue May 28, 2017 · 4 comments

Comments

@JelleAalbers
Copy link
Contributor

JelleAalbers commented May 28, 2017

I went through cax looking for code that heavily loads the runs db. Probably one of these three is most significant:

  1. massive_ruciax, massice_tsmclient and cax_tape_log_file basically does while True: fetch most of runs db:
    https://github.com/XENON1T/cax/blob/8f2a99450cc79a3db7ccec0294ead7a6952122fd~~/cax/main.py#L1146

The last one literally does this, the others have a bit of a query restricting it. massive_cax doesn't have this problem, as it uses a projection for the relevant query:

cax/cax/main.py

Line 289 in 8f2a994

projection=['start', 'number','name',

  1. Every correction downloads the latest correction doc every time we check a run. One doc in particular is huge -- the electron lifetime beast. Perhaps we should just query the version string first.

  2. Every cax task gets the full run doc of every run it runs on:

self.run_doc = self.collection.find_one({'_id': id})

That includes runs it's just checking to e.g. see if corrections are up to date.

Then there are some more minor things:

  1. During '_process' there's also a full run doc query, when we're just checking if we need to process a run or not:

doc = collection.find_one(query) # Query DB

  1. Massive-ruciax and massive-tsmclient fetch most of the docs in the full runs db when starting up (but only once)
@XeBoris
Copy link
Contributor

XeBoris commented May 29, 2017

Hello,

Regarding 1:
Taking this link as example:

cax/cax/main.py

Line 723 in 8f2a994

docs = list(collection.find(query,
.
Maybe I don't understand the concept fully, but in line 716 is a selection columns which are read from the database and in the lines above (698 - 713) are the numbers of runs limited according the input to massive-ruciax. The idea is to pull once the necessary information from the run database to know which runs are necessary to upload then. I don't see how to reduce this request from the runDB.

@JelleAalbers
Copy link
Contributor Author

Ah, you're right, I was mistaken due to the comment "Select specific data sets" with the selections variable. In fact what you call selections is a projection, so you don't get all the runs info. In massive_cax the projection is called projection (

cax/cax/main.py

Line 289 in 8f2a994

projection=['start', 'number','name',
), so when I didn't see it in the other caxes, I though it wasn't there.

Still, I think cax_tape_log_file doesn't have a projection, here:

cax/cax/main.py

Line 1146 in 8f2a994

while True: # yeah yeah

while True:
    query = {}
    docs = list(collection.find(query))

that looks like a full db dump, right?

@XeBoris
Copy link
Contributor

XeBoris commented May 29, 2017

Yes it does. This function is older and was written when I was less experienced with the selection process. On the other other side, this function is called once or twice a week to test data base entries. I can adjust it at some point in the future but due to the low usage I don't think that it would be necessary.

@JelleAalbers
Copy link
Contributor Author

Ok, I thought it was a program that was left running (since it has a while True that only breaks if --run_once is passed). I guess you always run it with run_once then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants