Samplers¶

Sampler¶

class meta_policy_search.samplers.Sampler(env, policy, batch_size, max_path_length)[source]¶

Bases: object

Sampler interface

Parameters:	env (gym.Env) – environment object policy (meta_policy_search.policies.policy) – policy object batch_size (int) – number of trajectories per task max_path_length (int) – max number of steps per trajectory

obtain_samples()[source]¶

Collect batch_size trajectories

Returns:	A list of paths.
Return type:	(list)

class meta_policy_search.samplers.MetaSampler(env, policy, rollouts_per_meta_task, meta_batch_size, max_path_length, envs_per_task=None, parallel=False)[source]¶

Bases: meta_policy_search.samplers.base.Sampler

Sampler for Meta-RL

Parameters:

env (meta_policy_search.envs.base.MetaEnv) – environment object
policy (meta_policy_search.policies.base.Policy) – policy object
batch_size (int) – number of trajectories per task
meta_batch_size (int) – number of meta tasks
max_path_length (int) – max number of steps per trajectory
envs_per_task (int) – number of envs to run vectorized for each task (influences the memory usage)

obtain_samples(log=False, log_prefix='')[source]¶

Collect batch_size trajectories from each task

Parameters:	log (boolean) – whether to log sampling times log_prefix (str) – prefix for logger
Returns:	A dict of paths of size [meta_batch_size] x (batch_size) x [5] x (max_path_length)
Return type:	(dict)

update_tasks()[source]¶: Samples a new goal for each meta task

Sample Processor¶

class meta_policy_search.samplers.SampleProcessor(baseline, discount=0.99, gae_lambda=1, normalize_adv=False, positive_adv=False)[source]¶

Bases: object

Sample processor interface

fits a reward baseline (use zero baseline to skip this step)
performs Generalized Advantage Estimation to provide advantages (see Schulman et al. 2015 - https://arxiv.org/abs/1506.02438)

Parameters:

baseline (Baseline) – a reward baseline object
discount (float) – reward discount factor
gae_lambda (float) – Generalized Advantage Estimation lambda
normalize_adv (bool) – indicates whether to normalize the estimated advantages (zero mean and unit std)
positive_adv (bool) – indicates whether to shift the (normalized) advantages so that they are all positive

process_samples(paths, log=False, log_prefix='')[source]¶

Processes sampled paths. This involves:

computing discounted rewards (returns)
fitting baseline estimator using the path returns and predicting the return baselines
estimating the advantages using GAE (+ advantage normalization id desired)
stacking the path data
logging statistics of the paths

Parameters:	paths (list) – A list of paths of size (batch_size) x [5] x (max_path_length) log (boolean) – indicates whether to log log_prefix (str) – prefix for the logging keys
Returns:	Processed sample data of size [7] x (batch_size x max_path_length)
Return type:	(dict)

class meta_policy_search.samplers.DiceSampleProcessor(baseline, max_path_length, discount=0.99, gae_lambda=1, normalize_adv=True, positive_adv=False, return_baseline=None)[source]¶

Bases: meta_policy_search.samplers.base.SampleProcessor

Sample processor for DICE implementations

fits a reward baseline (use zero baseline to skip this step)
computes adjusted rewards (reward - baseline)
normalize adjusted rewards if desired
zero-pads paths to max_path_length
stacks the padded path data

Parameters:

baseline (Baseline) – a time dependent reward baseline object
max_path_length (int) – maximum path length
discount (float) – reward discount factor
normalize_adv (bool) – indicates whether to normalize the estimated advantages (zero mean and unit std)
positive_adv (bool) – indicates whether to shift the (normalized) advantages so that they are all positive
return_baseline (Baseline) – (optional) a state(-time) dependent baseline - if provided it is also fitted and used to calculate GAE advantage estimates

process_samples(paths, log=False, log_prefix='')[source]¶

Processes sampled paths, This involves:

computing discounted rewards
fitting a reward baseline
computing adjusted rewards (reward - baseline)
normalizing adjusted rewards if desired
stacking the padded path data
creating a mask which indicates padded values by zero and original values by one
logging statistics of the paths

Parameters:

paths (list) – A list of paths of size (batch_size) x [5] x (max_path_length)
log (boolean) – indicates whether to log
log_prefix (str) – prefix for the logging keys

Returns:

Processed sample data. A dict containing the following items with respective shapes:

mask: (batch_size, max_path_length)
observations: (batch_size, max_path_length, ndim_act)
actions: (batch_size, max_path_length, ndim_obs)
rewards: (batch_size, max_path_length)
adjusted_rewards: (batch_size, max_path_length)
env_infos: dict of ndarrays of shape (batch_size, max_path_length, ?)
agent_infos: dict of ndarrays of shape (batch_size, max_path_length, ?)

Return type:

(dict)

class meta_policy_search.samplers.MetaSampleProcessor(baseline, discount=0.99, gae_lambda=1, normalize_adv=False, positive_adv=False)[source]¶

Bases: meta_policy_search.samplers.base.SampleProcessor

process_samples(paths_meta_batch, log=False, log_prefix='')[source]¶

Processes sampled paths. This involves:

computing discounted rewards (returns)
fitting baseline estimator using the path returns and predicting the return baselines
estimating the advantages using GAE (+ advantage normalization id desired)
stacking the path data
logging statistics of the paths

Parameters:	paths_meta_batch (dict) – A list of dict of lists, size: [meta_batch_size] x (batch_size) x [5] x (max_path_length) log (boolean) – indicates whether to log log_prefix (str) – prefix for the logging keys
Returns:	Processed sample data among the meta-batch; size: [meta_batch_size] x [7] x (batch_size x max_path_length)
Return type:	(list of dicts)

Vectorized Environment Executor¶

class meta_policy_search.samplers.vectorized_env_executor.MetaIterativeEnvExecutor(env, meta_batch_size, envs_per_task, max_path_length)[source]¶

Bases: object

Wraps multiple environments of the same kind and provides functionality to reset / step the environments in a vectorized manner. Internally, the environments are executed iteratively.

Parameters:	env (meta_policy_search.envs.base.MetaEnv) – meta environment object meta_batch_size (int) – number of meta tasks envs_per_task (int) – number of environments per meta task max_path_length (int) – maximum length of sampled environment paths - if the max_path_length is reached, the respective environment is reset

num_envs¶

Number of environments

Returns:	number of environments
Return type:	(int)

reset()[source]¶

Resets the environments

Returns:	list of (np.ndarray) with the new initial observations.
Return type:	(list)

set_tasks(tasks)[source]¶

Sets a list of tasks to each environment

Parameters:	tasks (list) – list of the tasks for each environment

step(actions)[source]¶

Steps the wrapped environments with the provided actions

Parameters:	actions (list) – lists of actions, of length meta_batch_size x envs_per_task

Returns

(tuple): a length 4 tuple of lists, containing obs (np.array), rewards (float), dones (bool),: env_infos (dict). Each list is of length meta_batch_size x envs_per_task (assumes that every task has same number of envs)

class meta_policy_search.samplers.vectorized_env_executor.MetaParallelEnvExecutor(env, meta_batch_size, envs_per_task, max_path_length)[source]¶

Bases: object

Wraps multiple environments of the same kind and provides functionality to reset / step the environments in a vectorized manner. Thereby the environments are distributed among meta_batch_size processes and executed in parallel.

Parameters:	env (meta_policy_search.envs.base.MetaEnv) – meta environment object meta_batch_size (int) – number of meta tasks envs_per_task (int) – number of environments per meta task max_path_length (int) – maximum length of sampled environment paths - if the max_path_length is reached, the respective environment is reset

num_envs¶

Number of environments

Returns:	number of environments
Return type:	(int)

reset()[source]¶

Resets the environments of each worker

Returns:	list of (np.ndarray) with the new initial observations.
Return type:	(list)

set_tasks(tasks=None)[source]¶

Sets a list of tasks to each worker

Parameters:	tasks (list) – list of the tasks for each worker

step(actions)[source]¶

Executes actions on each env

Parameters:	actions (list) – lists of actions, of length meta_batch_size x envs_per_task

Returns

(tuple): a length 4 tuple of lists, containing obs (np.array), rewards (float), dones (bool), env_infos (dict): each list is of length meta_batch_size x envs_per_task (assumes that every task has same number of envs)