-
Notifications
You must be signed in to change notification settings - Fork 44
Description
(Note: maybe I'm completely misunderstanding how multiple allocs are supposed to be used - so please tell me if this can be solved in a different way!)
I have two usecases and I didn't understand if they are already supported (or they work already by in some way I don't know)
-
on some of our supercomputers at the Swiss supercomputing centre CSCS, we have two partitions, one with CPU only and one with GPUs. Therefore, I'd like to start two allocs, one for each (e.g.
-C mcvs.-C gpu), give to each of them a name, and when I submit a job tohq, specify to which of the two they should run. Is this possible already? (The docs state that the name is only used for debug reasons) -
If I understand correctly, when I start the alloc, I specify all parameters, including how many nodes each worker should have. Is there a way to start automatically workers with different number of nodes? E.g. if I have a machine with nodes with 128 cores, I might want to have jobs with e.g. 16, 32, 64 or 128 cores run in workers that use 1 node; but I might also have e.g. jobs that I want to run on 2 or 4 nodes, and for these I'd like to also have some workers run on 2 or 4 nodes. At the same time, I don't want the jobs with <= 128 cores to run on these workers but only on the 1-node workers. Can this be already achieved by creating 3 allocs with 1, 2, 4 nodes respectively? I tried to do this but I saw some strange result that I couldn't debug properly or test extensively yet (I think I had only the 1-node alloc first, I submitted a 2-node job to
hqand I think thathqstarted submitting workers, when they entered the SLURM queue it realized they weren't OK for the job and it started submitting more; I then created also a 2-node alloc but it didn't enter the queue for quite some time because of the priorities of the machine, and in the meantime I kept having 1-node workers being submitted, starting, and dying after the 5-minutes idle timeout, and new ones getting launched again).
For this usecase, I would either have something like the point 1 above (I can decide to send a job only to one specific alloc and avoid that the ones that are not compatible keep starting, timing out and restart), or some automatic way forhqto understand how many nodes a job requires, and send to the correct alloc