nntool.slurm¶

Classes¶

`SlurmConfig`	Configuration class for SLURM job submission and execution.
`SlurmArgs`	alias of `SlurmConfig`
`SlurmFunction`	The function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).
`Task`	The base class for all tasks that will be run on Slurm.
`DistributedTaskConfig`	Configuration for distributed tasks.
`PyTorchDistributedTask`	A task that runs on Slurm and sets up the PyTorch distributed environment variables.

Functions¶

`slurm_fn`	A decorator to wrap a function to be run on slurm.
`slurm_function`	A decorator to annoate a function to be run in slurm.
`slurm_launcher`	A slurm launcher decorator for distributed or non-distributed job (controlled by use_distributed_env in slurm field).

Descriptions¶

class nntool.slurm.SlurmConfig(mode='run', job_name='Job', partition='', output_parent_path='./', output_folder='slurm', node_list='', node_list_exclude='', num_of_node=1, tasks_per_node=1, gpus_per_task=0, cpus_per_task=1, gpus_per_node=None, mem='', timeout_min=9223372036854775807, stderr_to_stdout=False, setup=<factory>, pack_code=False, use_packed_code=False, code_root='.', code_file_suffixes=<factory>, exclude_code_folders=<factory>, use_distributed_env=False, distributed_env_task='torch', processes_per_task=1, distributed_launch_command='', extra_params_kwargs=<factory>, extra_submit_kwargs=<factory>, extra_task_kwargs=<factory>)[source]¶

Configuration class for SLURM job submission and execution.

Parameters:

mode (Literal["run", "debug", "local", "slurm"]) – Running mode for the job. Options include: “run” (default, directly run the function), “debug” (run debugging which will involve pdb if it reachs a breakpoint), “local” (run the job locally by subprocess, without gpu allocations and CUDA_VISIBLE_DEVICES cannot be set), or “slurm” (run the job on a SLURM cluster).
job_name (str) – The name of the SLURM job. Default is ‘Job’.
partition (str) – The name of the SLURM partition to use. Default is ‘’.
output_parent_path (str) – The parent directory path for saving the slurm folder. Default is ‘./’.
output_folder (str) – The folder name where SLURM output files will be stored. Default is ‘slurm’.
node_list (str) – A string specifying the nodes to use. Leave blank to use all available nodes. Default is an empty string.
node_list_exclude (str) – A string specifying the nodes to exclude. Leave blank to use all nodes in the node list. Default is an empty string.
num_of_node (int) – The number of nodes to request. Default is 1.
tasks_per_node (int) – The number of tasks to run per node. Default is 1.
gpus_per_task (int) – The number of GPUs to request per task. Default is 0.
cpus_per_task (int) – The number of CPUs to request per task. Default is 1.
gpus_per_node (int) – The number of GPUs to request per node. If this is set, gpus_per_task will be ignored. Default is None.
mem (str) – The amount of memory (GB) to request. Leave blank to use the default memory configuration of the node. Default is an empty string.
timeout_min (int) – The time limit for the job in minutes. Default is sys.maxsize for effectively no limit.
stderr_to_stdout (bool) – Whether to redirect stderr to stdout. Default is False.
setup (List[str]) – A list of environment variable setup commands. Default is an empty list.
pack_code (bool) – Whether to pack the codebase before submission. Default is False.
use_packed_code (bool) – Whether to use the packed code for execution. Default is False.
code_root (str) – The root directory of the codebase, which will be used by the code packing. Default is the current directory (.).
code_file_suffixes (List[str]) – A list of file extensions for code files to be included when packing. Default includes .py, .sh, .yaml, and .toml.
exclude_code_folders (List[str]) – A list of folder names relative to code_root that will be excluded from packing. Default excludes ‘wandb’, ‘outputs’, and ‘datasets’.
use_distributed_env (bool) – Whether to use a distributed environment for the job. Default is False.
distributed_env_task (Literal["torch"]) – The type of distributed environment task to use. Currently, only “torch” is supported. Default is “torch”.
processes_per_task (int) – The number of processes to run per task. This value is not used by SLURM but is relevant for correctly set up distributed environments. Default is 1.
distributed_launch_command (str) – The command to launch distributed environment setup, using environment variables like {num_processes}, {num_machines}, {machine_rank}, {main_process_ip}, {main_process_port}. Default is an empty string.
extra_params_kwargs (Dict[str, str]) – Additional parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.
extra_submit_kwargs (Dict[str, str]) – Additional submit parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.
extra_task_kwargs (Dict[str, str]) – Additional task parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.

set_output_path(output_parent_path)[source]¶

Set output path and date for the slurm job.

Parameters:: output_parent_path (str) – The parent path for the output.
Returns:: The updated SlurmConfig instance.
Return type:: SlurmConfig

nntool.slurm.SlurmArgs[source]¶: alias of SlurmConfig

class nntool.slurm.SlurmFunction(submit_fn, default_submit_fn_args=None, default_submit_fn_kwargs=None)[source]¶

The function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).

afterany(*jobs)[source]¶

Mark the function should be executed after any one of the provided slurm jobs has been done.

Returns:: the new slurm function with the condition
Return type:: SlurmFunction

afternotok(*jobs)[source]¶

Mark the function should be executed after any one of the provided slurm jobs has been failed.

Returns:: the new slurm function with the condition
Return type:: SlurmFunction

afterok(*jobs)[source]¶

Mark the function should be executed after the provided slurm jobs have been done.

Returns:: the new slurm function with the condition
Return type:: SlurmFunction

configure(slurm_config, slurm_params_kwargs=None, slurm_submit_kwargs=None, slurm_task_kwargs=None, system_argv=None, pack_code_include_fn=None, pack_code_exclude_fn=None)[source]¶

Update the slurm configuration for the slurm function. A slurm function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).

Exported Distributed Enviroment Variables

NNTOOL_SLURM_HAS_BEEN_SET_UP is a special environment variable to indicate that the slurm has been set up.
After the set up, the distributed job will be launched and the following variables are exported:
- num_processes: int
- num_machines: int
- machine_rank: int
- main_process_ip: str
- main_process_port: int

Parameters:

slurm_config (SlurmConfig) – SlurmConfig, the slurm configuration dataclass, defaults to None
slurm_params_kwargs (Dict[str, str] | None) – extra slurm arguments for the slurm configuration, defaults to {}
slurm_submit_kwargs (Dict[str, str] | None) – extra slurm arguments for srun or sbatch, defaults to {}
slurm_task_kwargs (Dict[str, str] | None) – extra arguments for the setting of distributed task, defaults to {}
system_argv (List[str] | None) – the system arguments for the second launch in the distributed task (by default it will use the current system arguments sys.argv[1:]), defaults to None

Returns:

a new copy with configured slurm parameters

Return type:

SlurmFunction

is_configured()[source]¶

Whether the slurm function has been configured.

Returns:: True if the slurm function has been configured, False otherwise
Return type:: bool

is_distributed()[source]¶

Whether the slurm function is distributed.

Returns:: True if the slurm function is distributed, False otherwise
Return type:: bool

map_array(*submit_fn_args, **submit_fn_kwargs)[source]¶

Run the submit_fn with the given arguments and keyword arguments. The function is non-blocking in the mode of slurm, while other modes cause blocking. If there is no given arguments or keyword arguments, the default arguments and keyword arguments will be used.

Parameters:

submit_fn_args – arguments for the submit_fn
submit_fn_kwargs – keyword arguments for the submit_fn

Raises:

Exception – if the submit_fn is not set up

Returns:

Slurm Job or the return value of the submit_fn

Return type:

Job[Any] | List[Job[Any]] | Any

on_condition(jobs, condition='afterok')[source]¶

Mark this job should be executed after the provided slurm jobs have been done. This function allows combining different conditions by multiple calling.

Parameters:

jobs (Job | List[Job] | Tuple[Job]) – dependent jobs
condition (Literal['afterany', 'afterok', 'afternotok']) – run condition, defaults to “afterok”

Returns:

the function itself

Return type:

SlurmFunction

submit(*submit_fn_args, **submit_fn_kwargs)[source]¶

An alias function to __call__.

Parameters:

submit_fn_args – arguments for the submit_fn
submit_fn_kwargs – keyword arguments for the submit_fn

Raises:

Exception – if the submit_fn is not set up

Returns:

Slurm Job or the return value of the submit_fn

Return type:

Job | Any

nntool.slurm.slurm_fn(submit_fn)[source]¶

A decorator to wrap a function to be run on slurm. The function decorated by this decorator should be launched on the way below. The decorated function submit_fn is non-blocking now. To block and get the return value, you can call job.result().

Parameters:: submit_fn (Callable) – the function to be run on slurm
Returns:: the function to be run on slurm
Return type:: SlurmFunction

Example

>>> @slurm_fn
... def run_on_slurm(a, b):
...     return a + b
>>> slurm_config = SlurmConfig(
...     mode="slurm",
...     partition="PARTITION",
...     job_name="EXAMPLE",
...     tasks_per_node=1,
...     cpus_per_task=8,
...     mem="1GB",
... )
>>> job = run_on_slurm[slurm_config](1, b=2)
>>> result = job.result()  # block and get the result

nntool.slurm.slurm_function(submit_fn)[source]¶

A decorator to annoate a function to be run in slurm. The function decorated by this decorator should be launched in the way below.

Deprecated:: This function is deprecated and will be removed in future versions. Please use slurm_fn instead.

Example

>>> @slurm_function
... def run_on_slurm(a, b):
...     return a + b
>>> slurm_config = SlurmConfig(
...     mode="slurm",
...     partition="PARTITION",
...     job_name="EXAMPLE",
...     tasks_per_node=1,
...     cpus_per_task=8,
...     mem="1GB",
... )
>>> job = run_on_slurm(slurm_config)(1, b=2)
>>> result = job.result()  # block and get the result

nntool.slurm.slurm_launcher(ArgsType, parser='tyro', slurm_key='slurm', slurm_params_kwargs={}, slurm_submit_kwargs={}, slurm_task_kwargs={}, *extra_args, **extra_kwargs)[source]¶

A slurm launcher decorator for distributed or non-distributed job (controlled by use_distributed_env in slurm field). This decorator should be used as the program entry. The decorated function is non-blocking in the mode of slurm, while other modes cause blocking.

Parameters:

ArgsType (Type[Any]) – the experiment arguments type, which should be a dataclass (it mush have a slurm field defined by slurm_key)
slurm_key (str) – the key of the slurm field in the ArgsType, defaults to “slurm”
parser (str | Callable) – the parser for the arguments, defaults to “tyro”
slurm_config – SlurmConfig, the slurm configuration dataclass
slurm_params_kwargs (dict) – extra slurm arguments for the slurm configuration, defaults to {}
slurm_submit_kwargs (dict) – extra slurm arguments for srun or sbatch, defaults to {}
slurm_task_kwargs (dict) – extra arguments for the setting of distributed task, defaults to {}
extra_args – extra arguments for the parser
extra_kwargs – extra keyword arguments for the parser

Returns:

decorator function with main entry

Return type:

Callable[[Callable[[…], Any]], SlurmFunction]

Exported Distributed Enviroment Variables:

NNTOOL_SLURM_HAS_BEEN_SET_UP is a special environment variable to indicate that the slurm has been set up.
After the set up, the distributed job will be launched and the following variables are exported: num_processes: int, num_machines: int, machine_rank: int, main_process_ip: str, main_process_port: int.

class nntool.slurm.Task(argv, slurm_config, verbose=False)[source]¶

The base class for all tasks that will be run on Slurm. Especially useful for distributed tasks that need to set up the distributed environment variables.

Parameters:

argv (list[str]) – the command line arguments to run the task. This will be passed to the command method to reconstruct the command line.
slurm_config (SlurmConfig) – the Slurm configuration to use for the task.
verbose (bool, optional) – whether to print verbose output. Defaults to False.

checkpoint()[source]¶: Return a checkpoint for the task. This is used to save the state of the task.

command()[source]¶

Return the command to run the task. This method should be implemented by subclasses to return the actual command line to run the task.

Raises:: NotImplementedError – If the method is not implemented by the subclass.
Returns:: the command to run the task.
Return type:: str

log(msg)[source]¶

Log a message to the console if verbose is enabled.

Parameters:: msg (str) – the message to log.

class nntool.slurm.DistributedTaskConfig(num_processes='$nntool_num_processes', num_machines='$nntool_num_machines', machine_rank='$nntool_machine_rank', main_process_ip='$nntool_main_process_ip', main_process_port='$nntool_main_process_port')[source]¶

Configuration for distributed tasks. This is used to set up the distributed environment variables for PyTorch distributed training.

Parameters:

num_processes (int) – The total number of processes to run across all machines.
num_machines (int) – The number of machines to run the task on.
machine_rank (int) – The rank of the current machine in the distributed setup.
main_process_ip (str) – The IP address of the main process (rank 0) in the distributed setup.
main_process_port (int) – The port of the main process (rank 0) in the distributed setup.

export_bash(output_folder)[source]¶

Export the distributed environment variables to a bash script. This script can be sourced to set the environment variables for the distributed task.

Parameters:: output_folder (str) – the folder to save the bash script to.

class nntool.slurm.PyTorchDistributedTask(launch_cmd, argv, slurm_config, verbose=False, **env_setup_kwargs)[source]¶

A task that runs on Slurm and sets up the PyTorch distributed environment variables. It runs the command locally if in other modes.

Parameters:

launch_cmd (str) – The command to launch the task.
argv (list[str]) – The command line arguments for the task.
slurm_config (SlurmConfig) – The Slurm configuration to use for the task.
verbose (bool, optional) – _description_. Defaults to False.

References

https://github.com/huggingface/accelerate/issues/1239 https://github.com/yuvalkirstain/PickScore/blob/main/trainer/slurm_scripts/slurm_train.py https://github.com/facebookincubator/submitit/pull/1703

command()[source]¶

Return the command to run the task. This method should be implemented by subclasses to return the actual command line to run the task.

Returns:: the command to run the task.
Return type:: str

set_up_dist_env()[source]¶: Set up the distributed environment variables for PyTorch distributed training.