Trainers and Training Runners¶
This page contains the reference documentation for trainers and training runners:
Overview
General¶
These are general interfaces, classes and utility functions for trainers and training runners:
Interface for trainers. |
|
Base class for training runner implementations. |
|
Top-level configuration structure. |
|
Model configuration structure. |
|
Base class for all specific algorithm configurations. |
|
Base class for model selection strategies. |
|
Best model selection strategy. |
|
Abstract interface for policy evaluation. |
|
Evaluates the given policy using multiple different evaluators (ran in sequence). |
|
Evaluates a given policy by rolling it out and collecting the mean reward. |
|
Value transformation (e.g. |
|
Scale reduction value transform according to Pohlen et al (2018). |
|
Convert support vector to scalar by probability weighted interpolation. |
|
Converts tensor of scalars into probability support vectors corresponding to the provided range. |
|
Abstract interface for all replay buffer implementations. |
|
Replay buffer for off policy learning. |
Trainers¶
These are interfaces, classes and utility functions for built-in trainers:
Actor-Critics (AC)¶
Abstract baseclass of AC runners. |
|
Runner for single-threaded training, based on SequentialVectorEnv. |
|
Runner for locally distributed training, based on SubprocVectorEnv. |
|
Base class for actor critic trainers. |
|
Event interface, defining statistics emitted by the A2CTrainer. |
|
Advantage Actor Critic. |
|
Algorithm parameters for multi-step A2C model. |
|
Proximal Policy Optimization trainer. |
|
Algorithm parameters for multi-step PPO model. |
|
Multi step advantage actor critic. |
|
Algorithm parameters for Impala. |
|
Events specific for the impala algorithm, in order to record and analyse it’s behaviour in more detail |
|
Common superclass for IMPALA runners, implementing the main training controls. |
|
Runner for single-threaded training, based on SequentialVectorEnv. |
|
Runner for locally distributed training, based on SubprocVectorEnv. |
|
Computes action log-probs from policy logits, actions and acton_spaces. |
|
V-trace for softmax policies. |
|
V-trace from log importance weights. |
|
With the selected log_probs for multi-discrete actions of behavior and target policies we compute the log_rhos for calculating the vtrace. |
|
Multi step soft actor critic. |
|
Algorithm parameters for SAC. |
|
Events specific for the SAC algorithm, in order to record and analyse it’s behaviour in more detail |
|
Common superclass for SAC runners, implementing the main training controls. |
|
Runner for single-threaded training, based on SequentialVectorEnv. |
Evolutionary Strategies (ES)¶
Trainer class for OpenAI Evolution Strategies. |
|
Algorithm parameters for evolution strategies model. |
|
Event interface, defining statistics emitted by the ESTrainer. |
|
Baseclass of ES training master runners (serves as basis for dev and other runners). |
|
Runner config for single-threaded training, based on ESDummyDistributedRollouts. |
|
A fixed length vector of deterministically generated pseudo-random floats. |
|
Abstract baseclass of an optimizer to be used with ES. |
|
Stochastic gradient descent with momentum |
|
Adam optimizer |
|
Result structure for distributed rollouts. |
|
Implementation of the ES distribution by running the rollouts synchronously in the same process. |
|
Abstract base class of ES rollout distribution. |
|
This exception is raised if the current rollout is intentionally aborted. |
|
The rollout generation is bound to a single worker environment by implementing it as a Wrapper class. |
|
Get the parameters of all sub-policies as a single flat vector. |
|
Overwrite the parameters of all sub-policies by a single flat vector. |
Imitation Learning (IL) and Learning from Demonstrations (LfD)¶
Event interface defining statistics emitted by the imitation learning trainers. |
|
Dev runner for imitation learning. |
|
Trainer for behavioral cloning learning. |
|
Algorithm parameters for behavioral cloning. |
|
Evaluates a given policy on validation data. |
|
Loss function for behavioral cloning. |
Utilities¶
Stack list of dictionaries holding numpy arrays as values. |
|
Inverse of |
|
Computes the cumulative gradient norm of all provided parameters. |
|
Stack list of dictionaries holding torch tensors as values. |