-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce job submission count #134
Comments
So, I did a first test on datamanger adding the correction tasks together the checksum, but they take too much time because for each task run over all the runs and in the meanwhile the runs are waiting to be verified. |
Test definitely negative! also if with a new cax process,
|
Ciao, maybe I found the way to stop the submission of thousand jobs useless with massive-cax. Basically I made a check on the variables present on RunDB in Of course another check is if there are the processed and the minitrees files on the local directories on midway and understand where the code is running (which host). |
@lucrlom I think we can raise the memory on xetransfer for the virtual machine xe1t-datamanager if it helps. I run in any case two cax-like sessions (massive-cax and massive-ruciax) with the user xe1ttransfer. Each process needs ~12 GB of memory. I haven't yet understood why these processes need so much memory (it seems a lot to me). |
We need to reduce the number of short jobs being submitted on Midway. Some possible solutions that may or may not be combined:
Bundling runs (which should be fine since we're not running very long pax processing anymore) so each job runs longer,
Using job arrays to reduce number of jobs scheduler handles (I think),
Running Corrections locally (this seems to be fast now after previous hax improvements) and implementing local checks for intensive processes (e.g.
AddChecksum
,ProcessBatchQueueHax
) before submitting jobs that actually run those tasks.Add minitrees to RunsDB, to facilitate the local checking in 3.
The text was updated successfully, but these errors were encountered: