Samplers

Sampler

class meta_policy_search.samplers.Sampler(env, policy, batch_size, max_path_length)[source]

Bases: object

Sampler interface

Parameters:
  • env (gym.Env) – environment object
  • policy (meta_policy_search.policies.policy) – policy object
  • batch_size (int) – number of trajectories per task
  • max_path_length (int) – max number of steps per trajectory
obtain_samples()[source]

Collect batch_size trajectories

Returns:A list of paths.
Return type:(list)
class meta_policy_search.samplers.MetaSampler(env, policy, rollouts_per_meta_task, meta_batch_size, max_path_length, envs_per_task=None, parallel=False)[source]

Bases: meta_policy_search.samplers.base.Sampler

Sampler for Meta-RL

Parameters:
  • env (meta_policy_search.envs.base.MetaEnv) – environment object
  • policy (meta_policy_search.policies.base.Policy) – policy object
  • batch_size (int) – number of trajectories per task
  • meta_batch_size (int) – number of meta tasks
  • max_path_length (int) – max number of steps per trajectory
  • envs_per_task (int) – number of envs to run vectorized for each task (influences the memory usage)
obtain_samples(log=False, log_prefix='')[source]

Collect batch_size trajectories from each task

Parameters:
  • log (boolean) – whether to log sampling times
  • log_prefix (str) – prefix for logger
Returns:

A dict of paths of size [meta_batch_size] x (batch_size) x [5] x (max_path_length)

Return type:

(dict)

update_tasks()[source]

Samples a new goal for each meta task

Sample Processor

class meta_policy_search.samplers.SampleProcessor(baseline, discount=0.99, gae_lambda=1, normalize_adv=False, positive_adv=False)[source]

Bases: object

Sample processor interface
  • fits a reward baseline (use zero baseline to skip this step)
  • performs Generalized Advantage Estimation to provide advantages (see Schulman et al. 2015 - https://arxiv.org/abs/1506.02438)
Parameters:
  • baseline (Baseline) – a reward baseline object
  • discount (float) – reward discount factor
  • gae_lambda (float) – Generalized Advantage Estimation lambda
  • normalize_adv (bool) – indicates whether to normalize the estimated advantages (zero mean and unit std)
  • positive_adv (bool) – indicates whether to shift the (normalized) advantages so that they are all positive
process_samples(paths, log=False, log_prefix='')[source]
Processes sampled paths. This involves:
  • computing discounted rewards (returns)
  • fitting baseline estimator using the path returns and predicting the return baselines
  • estimating the advantages using GAE (+ advantage normalization id desired)
  • stacking the path data
  • logging statistics of the paths
Parameters:
  • paths (list) – A list of paths of size (batch_size) x [5] x (max_path_length)
  • log (boolean) – indicates whether to log
  • log_prefix (str) – prefix for the logging keys
Returns:

Processed sample data of size [7] x (batch_size x max_path_length)

Return type:

(dict)

class meta_policy_search.samplers.DiceSampleProcessor(baseline, max_path_length, discount=0.99, gae_lambda=1, normalize_adv=True, positive_adv=False, return_baseline=None)[source]

Bases: meta_policy_search.samplers.base.SampleProcessor

Sample processor for DICE implementations
  • fits a reward baseline (use zero baseline to skip this step)
  • computes adjusted rewards (reward - baseline)
  • normalize adjusted rewards if desired
  • zero-pads paths to max_path_length
  • stacks the padded path data
Parameters:
  • baseline (Baseline) – a time dependent reward baseline object
  • max_path_length (int) – maximum path length
  • discount (float) – reward discount factor
  • normalize_adv (bool) – indicates whether to normalize the estimated advantages (zero mean and unit std)
  • positive_adv (bool) – indicates whether to shift the (normalized) advantages so that they are all positive
  • return_baseline (Baseline) – (optional) a state(-time) dependent baseline - if provided it is also fitted and used to calculate GAE advantage estimates
process_samples(paths, log=False, log_prefix='')[source]
Processes sampled paths, This involves:
  • computing discounted rewards
  • fitting a reward baseline
  • computing adjusted rewards (reward - baseline)
  • normalizing adjusted rewards if desired
  • stacking the padded path data
  • creating a mask which indicates padded values by zero and original values by one
  • logging statistics of the paths
Parameters:
  • paths (list) – A list of paths of size (batch_size) x [5] x (max_path_length)
  • log (boolean) – indicates whether to log
  • log_prefix (str) – prefix for the logging keys
Returns:

Processed sample data. A dict containing the following items with respective shapes:
  • mask: (batch_size, max_path_length)
  • observations: (batch_size, max_path_length, ndim_act)
  • actions: (batch_size, max_path_length, ndim_obs)
  • rewards: (batch_size, max_path_length)
  • adjusted_rewards: (batch_size, max_path_length)
  • env_infos: dict of ndarrays of shape (batch_size, max_path_length, ?)
  • agent_infos: dict of ndarrays of shape (batch_size, max_path_length, ?)

Return type:

(dict)

class meta_policy_search.samplers.MetaSampleProcessor(baseline, discount=0.99, gae_lambda=1, normalize_adv=False, positive_adv=False)[source]

Bases: meta_policy_search.samplers.base.SampleProcessor

process_samples(paths_meta_batch, log=False, log_prefix='')[source]
Processes sampled paths. This involves:
  • computing discounted rewards (returns)
  • fitting baseline estimator using the path returns and predicting the return baselines
  • estimating the advantages using GAE (+ advantage normalization id desired)
  • stacking the path data
  • logging statistics of the paths
Parameters:
  • paths_meta_batch (dict) – A list of dict of lists, size: [meta_batch_size] x (batch_size) x [5] x (max_path_length)
  • log (boolean) – indicates whether to log
  • log_prefix (str) – prefix for the logging keys
Returns:

Processed sample data among the meta-batch; size: [meta_batch_size] x [7] x (batch_size x max_path_length)

Return type:

(list of dicts)

Vectorized Environment Executor

class meta_policy_search.samplers.vectorized_env_executor.MetaIterativeEnvExecutor(env, meta_batch_size, envs_per_task, max_path_length)[source]

Bases: object

Wraps multiple environments of the same kind and provides functionality to reset / step the environments in a vectorized manner. Internally, the environments are executed iteratively.

Parameters:
  • env (meta_policy_search.envs.base.MetaEnv) – meta environment object
  • meta_batch_size (int) – number of meta tasks
  • envs_per_task (int) – number of environments per meta task
  • max_path_length (int) – maximum length of sampled environment paths - if the max_path_length is reached, the respective environment is reset
num_envs

Number of environments

Returns:number of environments
Return type:(int)
reset()[source]

Resets the environments

Returns:list of (np.ndarray) with the new initial observations.
Return type:(list)
set_tasks(tasks)[source]

Sets a list of tasks to each environment

Parameters:tasks (list) – list of the tasks for each environment
step(actions)[source]

Steps the wrapped environments with the provided actions

Parameters:actions (list) – lists of actions, of length meta_batch_size x envs_per_task
Returns
(tuple): a length 4 tuple of lists, containing obs (np.array), rewards (float), dones (bool),
env_infos (dict). Each list is of length meta_batch_size x envs_per_task (assumes that every task has same number of envs)
class meta_policy_search.samplers.vectorized_env_executor.MetaParallelEnvExecutor(env, meta_batch_size, envs_per_task, max_path_length)[source]

Bases: object

Wraps multiple environments of the same kind and provides functionality to reset / step the environments in a vectorized manner. Thereby the environments are distributed among meta_batch_size processes and executed in parallel.

Parameters:
  • env (meta_policy_search.envs.base.MetaEnv) – meta environment object
  • meta_batch_size (int) – number of meta tasks
  • envs_per_task (int) – number of environments per meta task
  • max_path_length (int) – maximum length of sampled environment paths - if the max_path_length is reached, the respective environment is reset
num_envs

Number of environments

Returns:number of environments
Return type:(int)
reset()[source]

Resets the environments of each worker

Returns:list of (np.ndarray) with the new initial observations.
Return type:(list)
set_tasks(tasks=None)[source]

Sets a list of tasks to each worker

Parameters:tasks (list) – list of the tasks for each worker
step(actions)[source]

Executes actions on each env

Parameters:actions (list) – lists of actions, of length meta_batch_size x envs_per_task
Returns
(tuple): a length 4 tuple of lists, containing obs (np.array), rewards (float), dones (bool), env_infos (dict)
each list is of length meta_batch_size x envs_per_task (assumes that every task has same number of envs)