nntool.slurm¶
Classes¶
Configuration class for SLURM job submission and execution. |
|
alias of |
|
The function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass). |
|
The base class for all tasks that will be run on Slurm. |
|
Configuration for distributed tasks. |
|
A task that runs on Slurm and sets up the PyTorch distributed environment variables. |
Functions¶
A decorator to wrap a function to be run on slurm. |
|
A decorator to annoate a function to be run in slurm. |
|
A slurm launcher decorator for distributed or non-distributed job (controlled by use_distributed_env in slurm field). |
Descriptions¶
- class nntool.slurm.SlurmConfig(mode='run', job_name='Job', partition='', output_parent_path='./', output_folder='slurm', node_list='', node_list_exclude='', num_of_node=1, tasks_per_node=1, gpus_per_task=0, cpus_per_task=1, gpus_per_node=None, mem='', timeout_min=9223372036854775807, stderr_to_stdout=False, setup=<factory>, pack_code=False, use_packed_code=False, code_root='.', code_file_suffixes=<factory>, exclude_code_folders=<factory>, use_distributed_env=False, distributed_env_task='torch', processes_per_task=1, distributed_launch_command='', extra_params_kwargs=<factory>, extra_submit_kwargs=<factory>, extra_task_kwargs=<factory>)[source]¶
Configuration class for SLURM job submission and execution.
- Parameters:
mode (Literal["run", "debug", "local", "slurm"]) – Running mode for the job. Options include: “run” (default, directly run the function), “debug” (run debugging which will involve pdb if it reachs a breakpoint), “local” (run the job locally by subprocess, without gpu allocations and CUDA_VISIBLE_DEVICES cannot be set), or “slurm” (run the job on a SLURM cluster).
job_name (str) – The name of the SLURM job. Default is ‘Job’.
partition (str) – The name of the SLURM partition to use. Default is ‘’.
output_parent_path (str) – The parent directory path for saving the slurm folder. Default is ‘./’.
output_folder (str) – The folder name where SLURM output files will be stored. Default is ‘slurm’.
node_list (str) – A string specifying the nodes to use. Leave blank to use all available nodes. Default is an empty string.
node_list_exclude (str) – A string specifying the nodes to exclude. Leave blank to use all nodes in the node list. Default is an empty string.
num_of_node (int) – The number of nodes to request. Default is 1.
tasks_per_node (int) – The number of tasks to run per node. Default is 1.
gpus_per_task (int) – The number of GPUs to request per task. Default is 0.
cpus_per_task (int) – The number of CPUs to request per task. Default is 1.
gpus_per_node (int) – The number of GPUs to request per node. If this is set, gpus_per_task will be ignored. Default is None.
mem (str) – The amount of memory (GB) to request. Leave blank to use the default memory configuration of the node. Default is an empty string.
timeout_min (int) – The time limit for the job in minutes. Default is sys.maxsize for effectively no limit.
stderr_to_stdout (bool) – Whether to redirect stderr to stdout. Default is False.
setup (List[str]) – A list of environment variable setup commands. Default is an empty list.
pack_code (bool) – Whether to pack the codebase before submission. Default is False.
use_packed_code (bool) – Whether to use the packed code for execution. Default is False.
code_root (str) – The root directory of the codebase, which will be used by the code packing. Default is the current directory (
.).code_file_suffixes (List[str]) – A list of file extensions for code files to be included when packing. Default includes
.py,.sh,.yaml, and.toml.exclude_code_folders (List[str]) – A list of folder names relative to code_root that will be excluded from packing. Default excludes ‘wandb’, ‘outputs’, and ‘datasets’.
use_distributed_env (bool) – Whether to use a distributed environment for the job. Default is False.
distributed_env_task (Literal["torch"]) – The type of distributed environment task to use. Currently, only “torch” is supported. Default is “torch”.
processes_per_task (int) – The number of processes to run per task. This value is not used by SLURM but is relevant for correctly set up distributed environments. Default is 1.
distributed_launch_command (str) – The command to launch distributed environment setup, using environment variables like
{num_processes},{num_machines},{machine_rank},{main_process_ip},{main_process_port}. Default is an empty string.extra_params_kwargs (Dict[str, str]) – Additional parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.
extra_submit_kwargs (Dict[str, str]) – Additional submit parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.
extra_task_kwargs (Dict[str, str]) – Additional task parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.
- nntool.slurm.SlurmArgs[source]¶
alias of
SlurmConfig
- class nntool.slurm.SlurmFunction(submit_fn, default_submit_fn_args=None, default_submit_fn_kwargs=None)[source]¶
The function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).
- afterany(*jobs)[source]¶
Mark the function should be executed after any one of the provided slurm jobs has been done.
- Returns:
the new slurm function with the condition
- Return type:
- afternotok(*jobs)[source]¶
Mark the function should be executed after any one of the provided slurm jobs has been failed.
- Returns:
the new slurm function with the condition
- Return type:
- afterok(*jobs)[source]¶
Mark the function should be executed after the provided slurm jobs have been done.
- Returns:
the new slurm function with the condition
- Return type:
- configure(slurm_config, slurm_params_kwargs=None, slurm_submit_kwargs=None, slurm_task_kwargs=None, system_argv=None, pack_code_include_fn=None, pack_code_exclude_fn=None)[source]¶
Update the slurm configuration for the slurm function. A slurm function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).
Exported Distributed Enviroment Variables
NNTOOL_SLURM_HAS_BEEN_SET_UPis a special environment variable to indicate that the slurm has been set up.- After the set up, the distributed job will be launched and the following variables are exported:
num_processes: intnum_machines: intmachine_rank: intmain_process_ip: strmain_process_port: int
- Parameters:
slurm_config (SlurmConfig) – SlurmConfig, the slurm configuration dataclass, defaults to None
slurm_params_kwargs (Dict[str, str] | None) – extra slurm arguments for the slurm configuration, defaults to {}
slurm_submit_kwargs (Dict[str, str] | None) – extra slurm arguments for srun or sbatch, defaults to {}
slurm_task_kwargs (Dict[str, str] | None) – extra arguments for the setting of distributed task, defaults to {}
system_argv (List[str] | None) – the system arguments for the second launch in the distributed task (by default it will use the current system arguments sys.argv[1:]), defaults to None
- Returns:
a new copy with configured slurm parameters
- Return type:
- is_configured()[source]¶
Whether the slurm function has been configured.
- Returns:
True if the slurm function has been configured, False otherwise
- Return type:
bool
- is_distributed()[source]¶
Whether the slurm function is distributed.
- Returns:
True if the slurm function is distributed, False otherwise
- Return type:
bool
- map_array(*submit_fn_args, **submit_fn_kwargs)[source]¶
Run the submit_fn with the given arguments and keyword arguments. The function is non-blocking in the mode of slurm, while other modes cause blocking. If there is no given arguments or keyword arguments, the default arguments and keyword arguments will be used.
- Parameters:
submit_fn_args – arguments for the submit_fn
submit_fn_kwargs – keyword arguments for the submit_fn
- Raises:
Exception – if the submit_fn is not set up
- Returns:
Slurm Job or the return value of the submit_fn
- Return type:
Job[Any] | List[Job[Any]] | Any
- on_condition(jobs, condition='afterok')[source]¶
Mark this job should be executed after the provided slurm jobs have been done. This function allows combining different conditions by multiple calling.
- Parameters:
jobs (Job | List[Job] | Tuple[Job]) – dependent jobs
condition (Literal['afterany', 'afterok', 'afternotok']) – run condition, defaults to “afterok”
- Returns:
the function itself
- Return type:
- submit(*submit_fn_args, **submit_fn_kwargs)[source]¶
An alias function to
__call__.- Parameters:
submit_fn_args – arguments for the submit_fn
submit_fn_kwargs – keyword arguments for the submit_fn
- Raises:
Exception – if the submit_fn is not set up
- Returns:
Slurm Job or the return value of the submit_fn
- Return type:
Job | Any
- nntool.slurm.slurm_fn(submit_fn)[source]¶
A decorator to wrap a function to be run on slurm. The function decorated by this decorator should be launched on the way below. The decorated function submit_fn is non-blocking now. To block and get the return value, you can call
job.result().- Parameters:
submit_fn (Callable) – the function to be run on slurm
- Returns:
the function to be run on slurm
- Return type:
Example
>>> @slurm_fn ... def run_on_slurm(a, b): ... return a + b >>> slurm_config = SlurmConfig( ... mode="slurm", ... partition="PARTITION", ... job_name="EXAMPLE", ... tasks_per_node=1, ... cpus_per_task=8, ... mem="1GB", ... ) >>> job = run_on_slurm[slurm_config](1, b=2) >>> result = job.result() # block and get the result
- nntool.slurm.slurm_function(submit_fn)[source]¶
A decorator to annoate a function to be run in slurm. The function decorated by this decorator should be launched in the way below.
- Deprecated:
This function is deprecated and will be removed in future versions. Please use slurm_fn instead.
Example
>>> @slurm_function ... def run_on_slurm(a, b): ... return a + b >>> slurm_config = SlurmConfig( ... mode="slurm", ... partition="PARTITION", ... job_name="EXAMPLE", ... tasks_per_node=1, ... cpus_per_task=8, ... mem="1GB", ... ) >>> job = run_on_slurm(slurm_config)(1, b=2) >>> result = job.result() # block and get the result
- nntool.slurm.slurm_launcher(ArgsType, parser='tyro', slurm_key='slurm', slurm_params_kwargs={}, slurm_submit_kwargs={}, slurm_task_kwargs={}, *extra_args, **extra_kwargs)[source]¶
A slurm launcher decorator for distributed or non-distributed job (controlled by use_distributed_env in slurm field). This decorator should be used as the program entry. The decorated function is non-blocking in the mode of slurm, while other modes cause blocking.
- Parameters:
ArgsType (Type[Any]) – the experiment arguments type, which should be a dataclass (it mush have a slurm field defined by slurm_key)
slurm_key (str) – the key of the slurm field in the ArgsType, defaults to “slurm”
parser (str | Callable) – the parser for the arguments, defaults to “tyro”
slurm_config – SlurmConfig, the slurm configuration dataclass
slurm_params_kwargs (dict) – extra slurm arguments for the slurm configuration, defaults to {}
slurm_submit_kwargs (dict) – extra slurm arguments for srun or sbatch, defaults to {}
slurm_task_kwargs (dict) – extra arguments for the setting of distributed task, defaults to {}
extra_args – extra arguments for the parser
extra_kwargs – extra keyword arguments for the parser
- Returns:
decorator function with main entry
- Return type:
Callable[[Callable[[…], Any]], SlurmFunction]
- Exported Distributed Enviroment Variables:
NNTOOL_SLURM_HAS_BEEN_SET_UP is a special environment variable to indicate that the slurm has been set up.
After the set up, the distributed job will be launched and the following variables are exported: num_processes: int, num_machines: int, machine_rank: int, main_process_ip: str, main_process_port: int.
- class nntool.slurm.Task(argv, slurm_config, verbose=False)[source]¶
The base class for all tasks that will be run on Slurm. Especially useful for distributed tasks that need to set up the distributed environment variables.
- Parameters:
argv (list[str]) – the command line arguments to run the task. This will be passed to the command method to reconstruct the command line.
slurm_config (SlurmConfig) – the Slurm configuration to use for the task.
verbose (bool, optional) – whether to print verbose output. Defaults to False.
- class nntool.slurm.DistributedTaskConfig(num_processes='$nntool_num_processes', num_machines='$nntool_num_machines', machine_rank='$nntool_machine_rank', main_process_ip='$nntool_main_process_ip', main_process_port='$nntool_main_process_port')[source]¶
Configuration for distributed tasks. This is used to set up the distributed environment variables for PyTorch distributed training.
- Parameters:
num_processes (int) – The total number of processes to run across all machines.
num_machines (int) – The number of machines to run the task on.
machine_rank (int) – The rank of the current machine in the distributed setup.
main_process_ip (str) – The IP address of the main process (rank 0) in the distributed setup.
main_process_port (int) – The port of the main process (rank 0) in the distributed setup.
- class nntool.slurm.PyTorchDistributedTask(launch_cmd, argv, slurm_config, verbose=False, **env_setup_kwargs)[source]¶
A task that runs on Slurm and sets up the PyTorch distributed environment variables. It runs the command locally if in other modes.
- Parameters:
launch_cmd (str) – The command to launch the task.
argv (list[str]) – The command line arguments for the task.
slurm_config (SlurmConfig) – The Slurm configuration to use for the task.
verbose (bool, optional) – _description_. Defaults to False.
References
https://github.com/huggingface/accelerate/issues/1239 https://github.com/yuvalkirstain/PickScore/blob/main/trainer/slurm_scripts/slurm_train.py https://github.com/facebookincubator/submitit/pull/1703