nntool.slurm.function

Classes

SlurmFunction(submit_fn[, ...])

The function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).

class nntool.slurm.function.SlurmFunction(submit_fn, default_submit_fn_args=None, default_submit_fn_kwargs=None)[source]

The function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).

is_configured()[source]

Whether the slurm function has been configured.

Returns:

True if the slurm function has been configured, False otherwise

Return type:

bool

is_distributed()[source]

Whether the slurm function is distributed.

Returns:

True if the slurm function is distributed, False otherwise

Return type:

bool

configure(slurm_config, slurm_params_kwargs=None, slurm_submit_kwargs=None, slurm_task_kwargs=None, system_argv=None, pack_code_include_fn=None, pack_code_exclude_fn=None)[source]

Update the slurm configuration for the slurm function. A slurm function for the slurm job, which can be used for distributed or non-distributed job (controlled by use_distributed_env in the slurm dataclass).

Exported Distributed Enviroment Variables

  • NNTOOL_SLURM_HAS_BEEN_SET_UP is a special environment variable to indicate that the slurm has been set up.

  • After the set up, the distributed job will be launched and the following variables are exported:
    • num_processes: int

    • num_machines: int

    • machine_rank: int

    • main_process_ip: str

    • main_process_port: int

Parameters:
  • slurm_config (SlurmConfig) – SlurmConfig, the slurm configuration dataclass, defaults to None

  • slurm_params_kwargs (Dict[str, str] | None) – extra slurm arguments for the slurm configuration, defaults to {}

  • slurm_submit_kwargs (Dict[str, str] | None) – extra slurm arguments for srun or sbatch, defaults to {}

  • slurm_task_kwargs (Dict[str, str] | None) – extra arguments for the setting of distributed task, defaults to {}

  • system_argv (List[str] | None) – the system arguments for the second launch in the distributed task (by default it will use the current system arguments sys.argv[1:]), defaults to None

Returns:

a new copy with configured slurm parameters

Return type:

SlurmFunction

submit(*submit_fn_args, **submit_fn_kwargs)[source]

An alias function to __call__.

Parameters:
  • submit_fn_args – arguments for the submit_fn

  • submit_fn_kwargs – keyword arguments for the submit_fn

Raises:

Exception – if the submit_fn is not set up

Returns:

Slurm Job or the return value of the submit_fn

Return type:

Job | Any

map_array(*submit_fn_args, **submit_fn_kwargs)[source]

Run the submit_fn with the given arguments and keyword arguments. The function is non-blocking in the mode of slurm, while other modes cause blocking. If there is no given arguments or keyword arguments, the default arguments and keyword arguments will be used.

Parameters:
  • submit_fn_args – arguments for the submit_fn

  • submit_fn_kwargs – keyword arguments for the submit_fn

Raises:

Exception – if the submit_fn is not set up

Returns:

Slurm Job or the return value of the submit_fn

Return type:

Job[Any] | List[Job[Any]] | Any

on_condition(jobs, condition='afterok')[source]

Mark this job should be executed after the provided slurm jobs have been done. This function allows combining different conditions by multiple calling.

Parameters:
  • jobs (Job | List[Job] | Tuple[Job]) – dependent jobs

  • condition (Literal['afterany', 'afterok', 'afternotok']) – run condition, defaults to “afterok”

Returns:

the function itself

Return type:

SlurmFunction

afterok(*jobs)[source]

Mark the function should be executed after the provided slurm jobs have been done.

Returns:

the new slurm function with the condition

Return type:

SlurmFunction

afterany(*jobs)[source]

Mark the function should be executed after any one of the provided slurm jobs has been done.

Returns:

the new slurm function with the condition

Return type:

SlurmFunction

afternotok(*jobs)[source]

Mark the function should be executed after any one of the provided slurm jobs has been failed.

Returns:

the new slurm function with the condition

Return type:

SlurmFunction