DeepSpeed Autotune: User Guide#
Important
Deprecation Notice: DeepSpeed Autotune is deprecated and will be removed in an upcoming release. For information on supported search methods, please visit Search Methods.
Getting the most out of DeepSpeed (DS) requires aligning the many DS parameters with the specific
properties of your hardware and model. Determined AI’s DeepSpeed Autotune (dsat) helps to
optimize these settings through an easy-to-use API with very few changes required in user-code, as
we describe in the remainder of this user guide. dsat can be used with
DeepSpeedTrial, Core API, and
Hugging Face Trainer.
How it Works#
You do not need to create a special configuration file to use dsat. Assuming you have DeepSpeed
code which already functions, autotuning is as easy as inserting one or two helper functions into
your code and modifying the launch command.
For instance, let’s say your directory contains DeepSpeed code and a corresponding single trial
experiment configuration file deepspeed.yaml. Then, after inserting a line or two of
dsat-specific code per the instructions in the following sections, launching the dsat
experiments is as easy as replacing the usual experiment-launching command:
det experiment create deepspeed.yaml .
with:
python3 -m determined.pytorch.dsat asha deepspeed.yaml .
The above uses Determined AI’s DeepSpeed Autotune with the asha algorithm, one of three
available search methods:
asha: Adaptively searches over randomly selected DeepSpeed configurations, allocating more compute resources to well-performing configurations. See this introduction to ASHA for more details.binary: Performs a simple binary search over the batch size for randomly-generated DS configurations.random: Conducts a search over random DeepSpeed configurations with an aggressive early-stopping criteria based on domain-knowledge of DeepSpeed and the search history.
DeepSpeed Autotune is built on top of Custom Searcher (see Custom Search Methods) which starts up two separate experiments:
singleSearch Runner Experiment: This experiment coordinates and schedules the trials that run the model code.customExperiment: This experiment contains the trials referenced above whose results are reported back to the search runner.
Initially, a profiling trial is created to gather information regarding the model and computational
resources. The search runner experiment takes this initial profiling information and creates a
series of trials to search for the DS settings which optimize FLOPS_per_gpu, throughput
(samples/second), or latency timing information. The results of all such trials can be viewed in the
custom experiment above. The search is informed both by the initial profiling trial and the
results of each subsequent trial, all of whose results are fed back to the search runner.
Warning
Determined’s DeepSpeed Autotune is not compatible with pipeline or model parallelism. The
to-be-trained model must be a DeepSpeedEngine instance (not a PipelineEngine instance).
User Code Changes#
To use dsat with DeepSpeedTrial, Core API, and Hugging
Face Trainer, specific changes must be made to your user code. In the following sections, we will
describe specific use cases and the changes needed for each.
DeepSpeedTrial#
To use Determined’s DeepSpeed Autotune with DeepSpeedTrial, you must meet the following
requirements.
First, it is assumed that a base DeepSpeed configuration exists in a file (written following the
DeepSpeed documentation here). We then require that
your Determined yaml configuration points to the location of that file through a
deepspeed_config key in its hyperparameters section. For example, if your default DeepSpeed
configuration is stored in ds_config.json at the top-level of your model directory, your
hyperparameters section should include:
hyperparameters:
deepspeed_config: ds_config.json
Second, your DeepSpeedTrial code must use our
get_ds_config_from_hparams() helper function to get the DeepSpeed
configuration dictionary which is generated by DeepSpeed Autotune for each trial. These dictionaries
are generated by overwriting certain fields in the base DeepSpeed configuration referenced in the
step above. The returned dictionary can then be passed to deepspeed.initialize as usual:
from determined.pytorch.deepspeed import DeepSpeedTrial, DeepSpeedTrialContext
from determined.pytorch import dsat
class MyDeepSpeedTrial(DeepSpeedTrial):
def __init__(self, context: DeepSpeedTrialContext) -> None:
self.hparams = self.context.get_hparams()
config = dsat.get_ds_config_from_hparams(self.hparams)
model = ...
model_parameters= ...
model_engine, optimizer, train_loader, lr_scheduler = deepspeed.initialize(
model=model, model_parameters=model_parameters, config=config
)
Using Determined’s DeepSpeed Autotune with a DeepSpeedTrial
instance requires no further changes to your code.
For a complete example of how to use DeepSpeed Autotune with DeepSpeedTrial, visit the
Determined GitHub Repo
and navigate to examples/deepspeed_autotune/torchvision/deepspeed_trial .
Note
To find out more about DeepSpeedTrial, visit Usage Guide.
Core API#
When using DeepSpeed Autotune with a Core API experiment, there is one additional change to be made following the steps in the DeepSpeedTrial section above.
The forward, backward, and step methods of the DeepSpeedEngine class need to be
wrapped in the dsat_reporting_context() context manager. This
addition ensures that the autotuning metrics from each trial are captured and reported back to the
Determined master.
Here is an example sketch of dsat code with Core API:
for op in core_context.searcher.operations():
for (inputs, labels) in trainloader:
with dsat.dsat_reporting_context(core_context, op): # <-- The new code
outputs = model_engine(inputs)
loss = criterion(outputs, labels)
model_engine.backward(loss)
model_engine.step()
In this code snippet, core_context is the Context instance which was
initialized with determined.core.init(). The context manager requires access to both
core_context and the current SearcherOperation instance (op) to
appropriately report results. Outside of a dsat context, dsat_reporting_context is a no-op,
so there is no need to remove the context manager after the dsat trials have completed.
For a complete example of how to use DeepSpeed Autotune with Core API, visit the Determined GitHub
Repo
and navigate to examples/deepspeed_autotune/torchvision/core_api .
Hugging Face Trainer#
You can also use Determined’s DeepSpeed Autotune with the Hugging Face (HF) Trainer and Determined’s
DetCallback callback object to optimize your DeepSpeed parameters.
Similar to the previous case (Core API), you need to add a deepspeed_config field to the
hyperparameters section of your experiment configuration file, specifying the relative path to
the DS json config file.
Reporting results back to the Determined master requires both the dsat.dsat_reporting_context
context manager and DetCallback.
Furthermore, since dsat performs a search over different batch sizes and Hugging Face expects
parameters to be specified as command-line arguments, an additional helper function,
get_hf_args_with_overwrites(), is needed to create consistent Hugging
Face arguments.
Here is an example code snippet from a Hugging Face Trainer script that contains key pieces of relevant code:
from determined.transformers import DetCallback
from determined.pytorch import dsat
from transformers import HfArgumentParser,Trainer, TrainingArguments,
hparams = self.context.get_hparams()
parser = HfArgumentParser(TrainingArguments)
args = sys.argv[1:]
args = dsat.get_hf_args_with_overwrites(args, hparams)
training_args = parser.parse_args_into_dataclasses(args, look_for_args_file=False)
det_callback = DetCallback(core_context, ...)
trainer = Trainer(args=training_args, ...)
with dsat.dsat_reporting_context(core_context, op=det_callback.current_op):
train_result = trainer.train(resume_from_checkpoint=checkpoint)
Important
The
dsat_reporting_contextcontext manager shares the same initialSearcherOperationas theDetCallbackinstance through itsop=det_callback.current_opargument.The entire
trainmethod of the Hugging Face trainer is wrapped in thedsat_reporting_contextcontext manager.
To find examples that use DeepSpeed Autotune with Hugging Face Trainer, visit the Determined GitHub
Repo and navigate
to examples/hf_trainer_api.
Advanced Options#
The command-line entrypoint to dsat has various available options, some of them
search-algorithm-specific. All available options for any given search method can be found through
the command:
python3 -m determined.pytorch.dsat asha --help
and similar for the binary and random search methods.
Flags that are particularly important are detailed below.
General Options#
The following options are available for every search method.
--max-trials: The maximum number of trials to run. Default:64.--max-concurrent-trials: The maximum number of trials that can run concurrently. Default:16.--max-slots: The maximum number of slots that can be used concurrently. Defaults toNone, i.e., there is no limit by default.--metric: The metric to be optimized. Defaults toFLOPS-per-gpu. Other available options arethroughput,forward,backward, andlatency.--run-full-experiment: If specified, after thedsatexperiment has completed, asingleexperiment will be launched using the specifications in thedeepspeed.yamloverwritten with the best-found DS configuration parameters.--zero-stages: This flag allows the user to limit the search to a subset of the stages by providing a space-separated list, as in--zero-stages 2 3. Default:1 2 3.
asha Options#
The asha search algorithm randomly generates various DeepSpeed configurations and attempts to
tune the batch size for each configuration through a binary search. asha adaptively allocates
resources to explore each configuration (providing more resources to promising lineages) where the
resource is the number of steps taken in each binary search (i.e., the number of trials).
asha can be configured with the following flags:
--max-rungs: The maximum total number of rungs to use in the ASHA algorithm. Larger values allow for longer binary searches. Default:5.--min-binary-search-trials: The minimum number of trials to use for each binary search. Therparameter in the ASHA paper. Default:3.--divisor: Factor controlling the increased computational allotment across rungs, and the decrease in their population size. Theetaparameter in the ASHA paper. Default:2.--search-range-factor: The inclusive, initialhibound on the binary search is set by an approximate computation (thelobound is always initialized to1). This parameter adjusts thehibound by a factor ofsearch-range-factor. Default:1.0.
binary Options#
The binary search algorithm performs a straightforward search over the batch size for a
collection of randomly-drawn DS configurations. A single option is available for this search:
--search-range-factor, which plays precisely the same role as in the asha Options section
above.
random Options#
The random search algorithm performs a search over randomly drawn DS configurations and uses a
semi-random search over the batch size.
random can be configured with the following flags:
--trials-per-random-config: The maximum batch size configuration which will tested for a given DS configuration. Default:5.--early-stopping: If provided, the experiment will terminate if a new best-configuration has not been found in the lastearly-stoppingtrials. Default:None, corresponding to no such early stopping.