Samplers¶
Sampler¶
-
class
meta_policy_search.samplers.
Sampler
(env, policy, batch_size, max_path_length)[source]¶ Bases:
object
Sampler interface
Parameters: - env (gym.Env) – environment object
- policy (meta_policy_search.policies.policy) – policy object
- batch_size (int) – number of trajectories per task
- max_path_length (int) – max number of steps per trajectory
-
class
meta_policy_search.samplers.
MetaSampler
(env, policy, rollouts_per_meta_task, meta_batch_size, max_path_length, envs_per_task=None, parallel=False)[source]¶ Bases:
meta_policy_search.samplers.base.Sampler
Sampler for Meta-RL
Parameters: - env (meta_policy_search.envs.base.MetaEnv) – environment object
- policy (meta_policy_search.policies.base.Policy) – policy object
- batch_size (int) – number of trajectories per task
- meta_batch_size (int) – number of meta tasks
- max_path_length (int) – max number of steps per trajectory
- envs_per_task (int) – number of envs to run vectorized for each task (influences the memory usage)
-
obtain_samples
(log=False, log_prefix='')[source]¶ Collect batch_size trajectories from each task
Parameters: - log (boolean) – whether to log sampling times
- log_prefix (str) – prefix for logger
Returns: A dict of paths of size [meta_batch_size] x (batch_size) x [5] x (max_path_length)
Return type: (dict)
Sample Processor¶
-
class
meta_policy_search.samplers.
SampleProcessor
(baseline, discount=0.99, gae_lambda=1, normalize_adv=False, positive_adv=False)[source]¶ Bases:
object
- Sample processor interface
- fits a reward baseline (use zero baseline to skip this step)
- performs Generalized Advantage Estimation to provide advantages (see Schulman et al. 2015 - https://arxiv.org/abs/1506.02438)
Parameters: - baseline (Baseline) – a reward baseline object
- discount (float) – reward discount factor
- gae_lambda (float) – Generalized Advantage Estimation lambda
- normalize_adv (bool) – indicates whether to normalize the estimated advantages (zero mean and unit std)
- positive_adv (bool) – indicates whether to shift the (normalized) advantages so that they are all positive
-
process_samples
(paths, log=False, log_prefix='')[source]¶ - Processes sampled paths. This involves:
- computing discounted rewards (returns)
- fitting baseline estimator using the path returns and predicting the return baselines
- estimating the advantages using GAE (+ advantage normalization id desired)
- stacking the path data
- logging statistics of the paths
Parameters: - paths (list) – A list of paths of size (batch_size) x [5] x (max_path_length)
- log (boolean) – indicates whether to log
- log_prefix (str) – prefix for the logging keys
Returns: Processed sample data of size [7] x (batch_size x max_path_length)
Return type: (dict)
-
class
meta_policy_search.samplers.
DiceSampleProcessor
(baseline, max_path_length, discount=0.99, gae_lambda=1, normalize_adv=True, positive_adv=False, return_baseline=None)[source]¶ Bases:
meta_policy_search.samplers.base.SampleProcessor
- Sample processor for DICE implementations
- fits a reward baseline (use zero baseline to skip this step)
- computes adjusted rewards (reward - baseline)
- normalize adjusted rewards if desired
- zero-pads paths to max_path_length
- stacks the padded path data
Parameters: - baseline (Baseline) – a time dependent reward baseline object
- max_path_length (int) – maximum path length
- discount (float) – reward discount factor
- normalize_adv (bool) – indicates whether to normalize the estimated advantages (zero mean and unit std)
- positive_adv (bool) – indicates whether to shift the (normalized) advantages so that they are all positive
- return_baseline (Baseline) – (optional) a state(-time) dependent baseline - if provided it is also fitted and used to calculate GAE advantage estimates
-
process_samples
(paths, log=False, log_prefix='')[source]¶ - Processes sampled paths, This involves:
- computing discounted rewards
- fitting a reward baseline
- computing adjusted rewards (reward - baseline)
- normalizing adjusted rewards if desired
- stacking the padded path data
- creating a mask which indicates padded values by zero and original values by one
- logging statistics of the paths
Parameters: - paths (list) – A list of paths of size (batch_size) x [5] x (max_path_length)
- log (boolean) – indicates whether to log
- log_prefix (str) – prefix for the logging keys
Returns: - Processed sample data. A dict containing the following items with respective shapes:
- mask: (batch_size, max_path_length)
- observations: (batch_size, max_path_length, ndim_act)
- actions: (batch_size, max_path_length, ndim_obs)
- rewards: (batch_size, max_path_length)
- adjusted_rewards: (batch_size, max_path_length)
- env_infos: dict of ndarrays of shape (batch_size, max_path_length, ?)
- agent_infos: dict of ndarrays of shape (batch_size, max_path_length, ?)
Return type: (dict)
-
class
meta_policy_search.samplers.
MetaSampleProcessor
(baseline, discount=0.99, gae_lambda=1, normalize_adv=False, positive_adv=False)[source]¶ Bases:
meta_policy_search.samplers.base.SampleProcessor
-
process_samples
(paths_meta_batch, log=False, log_prefix='')[source]¶ - Processes sampled paths. This involves:
- computing discounted rewards (returns)
- fitting baseline estimator using the path returns and predicting the return baselines
- estimating the advantages using GAE (+ advantage normalization id desired)
- stacking the path data
- logging statistics of the paths
Parameters: - paths_meta_batch (dict) – A list of dict of lists, size: [meta_batch_size] x (batch_size) x [5] x (max_path_length)
- log (boolean) – indicates whether to log
- log_prefix (str) – prefix for the logging keys
Returns: Processed sample data among the meta-batch; size: [meta_batch_size] x [7] x (batch_size x max_path_length)
Return type: (list of dicts)
-
Vectorized Environment Executor¶
-
class
meta_policy_search.samplers.vectorized_env_executor.
MetaIterativeEnvExecutor
(env, meta_batch_size, envs_per_task, max_path_length)[source]¶ Bases:
object
Wraps multiple environments of the same kind and provides functionality to reset / step the environments in a vectorized manner. Internally, the environments are executed iteratively.
Parameters: - env (meta_policy_search.envs.base.MetaEnv) – meta environment object
- meta_batch_size (int) – number of meta tasks
- envs_per_task (int) – number of environments per meta task
- max_path_length (int) – maximum length of sampled environment paths - if the max_path_length is reached, the respective environment is reset
-
num_envs
¶ Number of environments
Returns: number of environments Return type: (int)
-
reset
()[source]¶ Resets the environments
Returns: list of (np.ndarray) with the new initial observations. Return type: (list)
-
set_tasks
(tasks)[source]¶ Sets a list of tasks to each environment
Parameters: tasks (list) – list of the tasks for each environment
-
step
(actions)[source]¶ Steps the wrapped environments with the provided actions
Parameters: actions (list) – lists of actions, of length meta_batch_size x envs_per_task - Returns
- (tuple): a length 4 tuple of lists, containing obs (np.array), rewards (float), dones (bool),
- env_infos (dict). Each list is of length meta_batch_size x envs_per_task (assumes that every task has same number of envs)
-
class
meta_policy_search.samplers.vectorized_env_executor.
MetaParallelEnvExecutor
(env, meta_batch_size, envs_per_task, max_path_length)[source]¶ Bases:
object
Wraps multiple environments of the same kind and provides functionality to reset / step the environments in a vectorized manner. Thereby the environments are distributed among meta_batch_size processes and executed in parallel.
Parameters: - env (meta_policy_search.envs.base.MetaEnv) – meta environment object
- meta_batch_size (int) – number of meta tasks
- envs_per_task (int) – number of environments per meta task
- max_path_length (int) – maximum length of sampled environment paths - if the max_path_length is reached, the respective environment is reset
-
num_envs
¶ Number of environments
Returns: number of environments Return type: (int)
-
reset
()[source]¶ Resets the environments of each worker
Returns: list of (np.ndarray) with the new initial observations. Return type: (list)
-
set_tasks
(tasks=None)[source]¶ Sets a list of tasks to each worker
Parameters: tasks (list) – list of the tasks for each worker
-
step
(actions)[source]¶ Executes actions on each env
Parameters: actions (list) – lists of actions, of length meta_batch_size x envs_per_task - Returns
- (tuple): a length 4 tuple of lists, containing obs (np.array), rewards (float), dones (bool), env_infos (dict)
- each list is of length meta_batch_size x envs_per_task (assumes that every task has same number of envs)