nntool.slurm.taskΒΆ
Functions
|
Classes
|
Configuration for distributed tasks. |
|
A task that runs on Slurm and sets up the PyTorch distributed environment variables. |
|
The base class for all tasks that will be run on Slurm. |
- class nntool.slurm.task.Task(argv, slurm_config, verbose=False)[source]ΒΆ
The base class for all tasks that will be run on Slurm. Especially useful for distributed tasks that need to set up the distributed environment variables.
- Parameters:
argv (list[str]) β the command line arguments to run the task. This will be passed to the command method to reconstruct the command line.
slurm_config (SlurmConfig) β the Slurm configuration to use for the task.
verbose (bool, optional) β whether to print verbose output. Defaults to False.
- log(msg)[source]ΒΆ
Log a message to the console if verbose is enabled.
- Parameters:
msg (str) β the message to log.
- class nntool.slurm.task.DistributedTaskConfig(num_processes='$nntool_num_processes', num_machines='$nntool_num_machines', machine_rank='$nntool_machine_rank', main_process_ip='$nntool_main_process_ip', main_process_port='$nntool_main_process_port')[source]ΒΆ
Configuration for distributed tasks. This is used to set up the distributed environment variables for PyTorch distributed training.
- Parameters:
num_processes (int) β The total number of processes to run across all machines.
num_machines (int) β The number of machines to run the task on.
machine_rank (int) β The rank of the current machine in the distributed setup.
main_process_ip (str) β The IP address of the main process (rank 0) in the distributed setup.
main_process_port (int) β The port of the main process (rank 0) in the distributed setup.
- class nntool.slurm.task.PyTorchDistributedTask(launch_cmd, argv, slurm_config, verbose=False, **env_setup_kwargs)[source]ΒΆ
A task that runs on Slurm and sets up the PyTorch distributed environment variables. It runs the command locally if in other modes.
- Parameters:
launch_cmd (str) β The command to launch the task.
argv (list[str]) β The command line arguments for the task.
slurm_config (SlurmConfig) β The Slurm configuration to use for the task.
verbose (bool, optional) β _description_. Defaults to False.
References
https://github.com/huggingface/accelerate/issues/1239 https://github.com/yuvalkirstain/PickScore/blob/main/trainer/slurm_scripts/slurm_train.py https://github.com/facebookincubator/submitit/pull/1703