nntool.slurm

Classes

SlurmConfig

Configuration class for SLURM job submission and execution.

SlurmArgs

alias of SlurmConfig

SlurmFunction

The function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).

Task

The base class for all tasks that will be run on Slurm.

DistributedTaskConfig

Configuration for distributed tasks.

PyTorchDistributedTask

A task that runs on Slurm and sets up the PyTorch distributed environment variables.

Functions

slurm_fn

A decorator to wrap a function to be run on slurm.

slurm_function

A decorator to annoate a function to be run in slurm.

slurm_launcher

A slurm launcher decorator for distributed or non-distributed job (controlled by use_distributed_env in slurm field).

Descriptions

class nntool.slurm.SlurmConfig(mode='run', job_name='Job', partition='', output_parent_path='./', output_folder='slurm', node_list='', node_list_exclude='', num_of_node=1, tasks_per_node=1, gpus_per_task=0, cpus_per_task=1, gpus_per_node=None, mem='', timeout_min=9223372036854775807, stderr_to_stdout=False, setup=<factory>, pack_code=False, use_packed_code=False, code_root='.', code_file_suffixes=<factory>, exclude_code_folders=<factory>, use_distributed_env=False, distributed_env_task='torch', processes_per_task=1, distributed_launch_command='', extra_params_kwargs=<factory>, extra_submit_kwargs=<factory>, extra_task_kwargs=<factory>)[source]

Configuration class for SLURM job submission and execution.

Parameters:
  • mode (Literal["run", "debug", "local", "slurm"]) – Running mode for the job. Options include: “run” (default, directly run the function), “debug” (run debugging which will involve pdb if it reachs a breakpoint), “local” (run the job locally by subprocess, without gpu allocations and CUDA_VISIBLE_DEVICES cannot be set), or “slurm” (run the job on a SLURM cluster).

  • job_name (str) – The name of the SLURM job. Default is ‘Job’.

  • partition (str) – The name of the SLURM partition to use. Default is ‘’.

  • output_parent_path (str) – The parent directory path for saving the slurm folder. Default is ‘./’.

  • output_folder (str) – The folder name where SLURM output files will be stored. Default is ‘slurm’.

  • node_list (str) – A string specifying the nodes to use. Leave blank to use all available nodes. Default is an empty string.

  • node_list_exclude (str) – A string specifying the nodes to exclude. Leave blank to use all nodes in the node list. Default is an empty string.

  • num_of_node (int) – The number of nodes to request. Default is 1.

  • tasks_per_node (int) – The number of tasks to run per node. Default is 1.

  • gpus_per_task (int) – The number of GPUs to request per task. Default is 0.

  • cpus_per_task (int) – The number of CPUs to request per task. Default is 1.

  • gpus_per_node (int) – The number of GPUs to request per node. If this is set, gpus_per_task will be ignored. Default is None.

  • mem (str) – The amount of memory (GB) to request. Leave blank to use the default memory configuration of the node. Default is an empty string.

  • timeout_min (int) – The time limit for the job in minutes. Default is sys.maxsize for effectively no limit.

  • stderr_to_stdout (bool) – Whether to redirect stderr to stdout. Default is False.

  • setup (List[str]) – A list of environment variable setup commands. Default is an empty list.

  • pack_code (bool) – Whether to pack the codebase before submission. Default is False.

  • use_packed_code (bool) – Whether to use the packed code for execution. Default is False.

  • code_root (str) – The root directory of the codebase, which will be used by the code packing. Default is the current directory (.).

  • code_file_suffixes (List[str]) – A list of file extensions for code files to be included when packing. Default includes .py, .sh, .yaml, and .toml.

  • exclude_code_folders (List[str]) – A list of folder names relative to code_root that will be excluded from packing. Default excludes ‘wandb’, ‘outputs’, and ‘datasets’.

  • use_distributed_env (bool) – Whether to use a distributed environment for the job. Default is False.

  • distributed_env_task (Literal["torch"]) – The type of distributed environment task to use. Currently, only “torch” is supported. Default is “torch”.

  • processes_per_task (int) – The number of processes to run per task. This value is not used by SLURM but is relevant for correctly set up distributed environments. Default is 1.

  • distributed_launch_command (str) – The command to launch distributed environment setup, using environment variables like {num_processes}, {num_machines}, {machine_rank}, {main_process_ip}, {main_process_port}. Default is an empty string.

  • extra_params_kwargs (Dict[str, str]) – Additional parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.

  • extra_submit_kwargs (Dict[str, str]) – Additional submit parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.

  • extra_task_kwargs (Dict[str, str]) – Additional task parameters for the SLURM job as a dictionary of key-value pairs. Default is an empty dictionary.

set_output_path(output_parent_path)[source]

Set output path and date for the slurm job.

Parameters:

output_parent_path (str) – The parent path for the output.

Returns:

The updated SlurmConfig instance.

Return type:

SlurmConfig

nntool.slurm.SlurmArgs[source]

alias of SlurmConfig

class nntool.slurm.SlurmFunction(submit_fn, default_submit_fn_args=None, default_submit_fn_kwargs=None)[source]

The function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).

afterany(*jobs)[source]

Mark the function should be executed after any one of the provided slurm jobs has been done.

Returns:

the new slurm function with the condition

Return type:

SlurmFunction

afternotok(*jobs)[source]

Mark the function should be executed after any one of the provided slurm jobs has been failed.

Returns:

the new slurm function with the condition

Return type:

SlurmFunction

afterok(*jobs)[source]

Mark the function should be executed after the provided slurm jobs have been done.

Returns:

the new slurm function with the condition

Return type:

SlurmFunction

configure(slurm_config, slurm_params_kwargs=None, slurm_submit_kwargs=None, slurm_task_kwargs=None, system_argv=None, pack_code_include_fn=None, pack_code_exclude_fn=None)[source]

Update the slurm configuration for the slurm function. A slurm function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).

Exported Distributed Enviroment Variables

  • NNTOOL_SLURM_HAS_BEEN_SET_UP is a special environment variable to indicate that the slurm has been set up.

  • After the set up, the distributed job will be launched and the following variables are exported:
    • num_processes: int

    • num_machines: int

    • machine_rank: int

    • main_process_ip: str

    • main_process_port: int

Parameters:
  • slurm_config (SlurmConfig) – SlurmConfig, the slurm configuration dataclass, defaults to None

  • slurm_params_kwargs (Dict[str, str] | None) – extra slurm arguments for the slurm configuration, defaults to {}

  • slurm_submit_kwargs (Dict[str, str] | None) – extra slurm arguments for srun or sbatch, defaults to {}

  • slurm_task_kwargs (Dict[str, str] | None) – extra arguments for the setting of distributed task, defaults to {}

  • system_argv (List[str] | None) – the system arguments for the second launch in the distributed task (by default it will use the current system arguments sys.argv[1:]), defaults to None

Returns:

a new copy with configured slurm parameters

Return type:

SlurmFunction

is_configured()[source]

Whether the slurm function has been configured.

Returns:

True if the slurm function has been configured, False otherwise

Return type:

bool

is_distributed()[source]

Whether the slurm function is distributed.

Returns:

True if the slurm function is distributed, False otherwise

Return type:

bool

map_array(*submit_fn_args, **submit_fn_kwargs)[source]

Run the submit_fn with the given arguments and keyword arguments. The function is non-blocking in the mode of slurm, while other modes cause blocking. If there is no given arguments or keyword arguments, the default arguments and keyword arguments will be used.

Parameters:
  • submit_fn_args – arguments for the submit_fn

  • submit_fn_kwargs – keyword arguments for the submit_fn

Raises:

Exception – if the submit_fn is not set up

Returns:

Slurm Job or the return value of the submit_fn

Return type:

Job[Any] | List[Job[Any]] | Any

on_condition(jobs, condition='afterok')[source]

Mark this job should be executed after the provided slurm jobs have been done. This function allows combining different conditions by multiple calling.

Parameters:
  • jobs (Job | List[Job] | Tuple[Job]) – dependent jobs

  • condition (Literal['afterany', 'afterok', 'afternotok']) – run condition, defaults to “afterok”

Returns:

the function itself

Return type:

SlurmFunction

submit(*submit_fn_args, **submit_fn_kwargs)[source]

An alias function to __call__.

Parameters:
  • submit_fn_args – arguments for the submit_fn

  • submit_fn_kwargs – keyword arguments for the submit_fn

Raises:

Exception – if the submit_fn is not set up

Returns:

Slurm Job or the return value of the submit_fn

Return type:

Job | Any

nntool.slurm.slurm_fn(submit_fn)[source]

A decorator to wrap a function to be run on slurm. The function decorated by this decorator should be launched on the way below. The decorated function submit_fn is non-blocking now. To block and get the return value, you can call job.result().

Parameters:

submit_fn (Callable) – the function to be run on slurm

Returns:

the function to be run on slurm

Return type:

SlurmFunction

Example

>>> @slurm_fn
... def run_on_slurm(a, b):
...     return a + b
>>> slurm_config = SlurmConfig(
...     mode="slurm",
...     partition="PARTITION",
...     job_name="EXAMPLE",
...     tasks_per_node=1,
...     cpus_per_task=8,
...     mem="1GB",
... )
>>> job = run_on_slurm[slurm_config](1, b=2)
>>> result = job.result()  # block and get the result
nntool.slurm.slurm_function(submit_fn)[source]

A decorator to annoate a function to be run in slurm. The function decorated by this decorator should be launched in the way below.

Deprecated:

This function is deprecated and will be removed in future versions. Please use slurm_fn instead.

Example

>>> @slurm_function
... def run_on_slurm(a, b):
...     return a + b
>>> slurm_config = SlurmConfig(
...     mode="slurm",
...     partition="PARTITION",
...     job_name="EXAMPLE",
...     tasks_per_node=1,
...     cpus_per_task=8,
...     mem="1GB",
... )
>>> job = run_on_slurm(slurm_config)(1, b=2)
>>> result = job.result()  # block and get the result
nntool.slurm.slurm_launcher(ArgsType, parser='tyro', slurm_key='slurm', slurm_params_kwargs={}, slurm_submit_kwargs={}, slurm_task_kwargs={}, *extra_args, **extra_kwargs)[source]

A slurm launcher decorator for distributed or non-distributed job (controlled by use_distributed_env in slurm field). This decorator should be used as the program entry. The decorated function is non-blocking in the mode of slurm, while other modes cause blocking.

Parameters:
  • ArgsType (Type[Any]) – the experiment arguments type, which should be a dataclass (it mush have a slurm field defined by slurm_key)

  • slurm_key (str) – the key of the slurm field in the ArgsType, defaults to “slurm”

  • parser (str | Callable) – the parser for the arguments, defaults to “tyro”

  • slurm_config – SlurmConfig, the slurm configuration dataclass

  • slurm_params_kwargs (dict) – extra slurm arguments for the slurm configuration, defaults to {}

  • slurm_submit_kwargs (dict) – extra slurm arguments for srun or sbatch, defaults to {}

  • slurm_task_kwargs (dict) – extra arguments for the setting of distributed task, defaults to {}

  • extra_args – extra arguments for the parser

  • extra_kwargs – extra keyword arguments for the parser

Returns:

decorator function with main entry

Return type:

Callable[[Callable[[…], Any]], SlurmFunction]

Exported Distributed Enviroment Variables:
  1. NNTOOL_SLURM_HAS_BEEN_SET_UP is a special environment variable to indicate that the slurm has been set up.

  2. After the set up, the distributed job will be launched and the following variables are exported: num_processes: int, num_machines: int, machine_rank: int, main_process_ip: str, main_process_port: int.

class nntool.slurm.Task(argv, slurm_config, verbose=False)[source]

The base class for all tasks that will be run on Slurm. Especially useful for distributed tasks that need to set up the distributed environment variables.

Parameters:
  • argv (list[str]) – the command line arguments to run the task. This will be passed to the command method to reconstruct the command line.

  • slurm_config (SlurmConfig) – the Slurm configuration to use for the task.

  • verbose (bool, optional) – whether to print verbose output. Defaults to False.

checkpoint()[source]

Return a checkpoint for the task. This is used to save the state of the task.

command()[source]

Return the command to run the task. This method should be implemented by subclasses to return the actual command line to run the task.

Raises:

NotImplementedError – If the method is not implemented by the subclass.

Returns:

the command to run the task.

Return type:

str

log(msg)[source]

Log a message to the console if verbose is enabled.

Parameters:

msg (str) – the message to log.

class nntool.slurm.DistributedTaskConfig(num_processes='$nntool_num_processes', num_machines='$nntool_num_machines', machine_rank='$nntool_machine_rank', main_process_ip='$nntool_main_process_ip', main_process_port='$nntool_main_process_port')[source]

Configuration for distributed tasks. This is used to set up the distributed environment variables for PyTorch distributed training.

Parameters:
  • num_processes (int) – The total number of processes to run across all machines.

  • num_machines (int) – The number of machines to run the task on.

  • machine_rank (int) – The rank of the current machine in the distributed setup.

  • main_process_ip (str) – The IP address of the main process (rank 0) in the distributed setup.

  • main_process_port (int) – The port of the main process (rank 0) in the distributed setup.

export_bash(output_folder)[source]

Export the distributed environment variables to a bash script. This script can be sourced to set the environment variables for the distributed task.

Parameters:

output_folder (str) – the folder to save the bash script to.

class nntool.slurm.PyTorchDistributedTask(launch_cmd, argv, slurm_config, verbose=False, **env_setup_kwargs)[source]

A task that runs on Slurm and sets up the PyTorch distributed environment variables. It runs the command locally if in other modes.

Parameters:
  • launch_cmd (str) – The command to launch the task.

  • argv (list[str]) – The command line arguments for the task.

  • slurm_config (SlurmConfig) – The Slurm configuration to use for the task.

  • verbose (bool, optional) – _description_. Defaults to False.

References

https://github.com/huggingface/accelerate/issues/1239 https://github.com/yuvalkirstain/PickScore/blob/main/trainer/slurm_scripts/slurm_train.py https://github.com/facebookincubator/submitit/pull/1703

command()[source]

Return the command to run the task. This method should be implemented by subclasses to return the actual command line to run the task.

Returns:

the command to run the task.

Return type:

str

set_up_dist_env()[source]

Set up the distributed environment variables for PyTorch distributed training.