Policies¶

Policy Interfaces¶

class meta_policy_search.policies.Policy(obs_dim, action_dim, name='policy', hidden_sizes=(32, 32), learn_std=True, hidden_nonlinearity=<function tanh>, output_nonlinearity=None, **kwargs)[source]¶

Bases: meta_policy_search.utils.serializable.Serializable

A container for storing the current pre and post update policies Also provides functions for executing and updating policy parameters

Note

the preupdate policy is stored as tf.Variables, while the postupdate policy is stored in numpy arrays and executed through tf.placeholders

Parameters:

obs_dim (int) – dimensionality of the observation space -> specifies the input size of the policy
action_dim (int) – dimensionality of the action space -> specifies the output size of the policy
name (str) – Name used for scoping variables in policy
hidden_sizes (tuple) – size of hidden layers of network
learn_std (bool) – whether to learn variance of network output
hidden_nonlinearity (Operation) – nonlinearity used between hidden layers of network
output_nonlinearity (Operation) – nonlinearity used after the final layer of network

build_graph()[source]¶: Builds computational graph for policy

distribution¶

Returns this policy’s distribution

Returns:	this policy’s distribution
Return type:	(Distribution)

distribution_info_keys(obs, state_infos)[source]¶

Parameters:	obs (placeholder) – symbolic variable for observations state_infos (dict) – a dictionary of placeholders that contains information about the of the policy at the time it received the observation (state) –
Returns:	a dictionary of tf placeholders for the policy output distribution
Return type:	(dict)

distribution_info_sym(obs_var, params=None)[source]¶

Return the symbolic distribution information about the actions.

Parameters:	obs_var (placeholder) – symbolic variable for observations params (None or dict) – a dictionary of placeholders that contains information about the of the policy at the time it received the observation (state) –
Returns:	a dictionary of tf placeholders for the policy output distribution
Return type:	(dict)

get_action(observation)[source]¶

Runs a single observation through the specified policy

Parameters:	observation (array) – single observation
Returns:	array of arrays of actions for each env
Return type:	(array)

get_actions(observations)[source]¶

Runs each set of observations through each task specific policy

Parameters:	observations (array) – array of arrays of observations generated by each task and env
Returns:	array of arrays of actions for each env (meta_batch_size) x (batch_size) x (action_dim) and array of arrays of agent_info dicts
Return type:	(tuple)

get_param_values()[source]¶

Gets a list of all the current weights in the network (in original code it is flattened, why?)

Returns:	list of values for parameters
Return type:	(list)

get_params()[source]¶

Get the tf.Variables representing the trainable weights of the network (symbolic)

Returns:	a dict of all trainable Variables
Return type:	(dict)

likelihood_ratio_sym(obs, action, dist_info_old, policy_params)[source]¶

Computes the likelihood p_new(obs|act)/p_old ratio between

Parameters:	obs (tf.Tensor) – symbolic variable for observations action (tf.Tensor) – symbolic variable for actions dist_info_old (dict) – dictionary of tf.placeholders with old policy information policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:	likelihood ratio
Return type:	(tf.Tensor)

log_diagnostics(paths)[source]¶: Log extra information per iteration based on the collected paths

log_likelihood_sym(obs, action, policy_params)[source]¶

Computes the log likelihood p(obs|act)

Parameters:	obs (tf.Tensor) – symbolic variable for observations action (tf.Tensor) – symbolic variable for actions policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:	log likelihood
Return type:	(tf.Tensor)

set_params(policy_params)[source]¶

Sets the parameters for the graph

Parameters:	policy_params (dict) – of variable names and corresponding parameter values

class meta_policy_search.policies.MetaPolicy(*args, **kwargs)[source]¶

Bases: meta_policy_search.policies.base.Policy

build_graph()[source]¶: Also should create lists of variables and corresponding assign ops

distribution¶

Returns this policy’s distribution

Returns:	this policy’s distribution
Return type:	(Distribution)

distribution_info_keys(obs, state_infos)¶

Parameters:	obs (placeholder) – symbolic variable for observations state_infos (dict) – a dictionary of placeholders that contains information about the of the policy at the time it received the observation (state) –
Returns:	a dictionary of tf placeholders for the policy output distribution
Return type:	(dict)

distribution_info_sym(obs_var, params=None)¶

Return the symbolic distribution information about the actions.

Parameters:	obs_var (placeholder) – symbolic variable for observations params (None or dict) – a dictionary of placeholders that contains information about the of the policy at the time it received the observation (state) –
Returns:	a dictionary of tf placeholders for the policy output distribution
Return type:	(dict)

get_action(observation)¶

Runs a single observation through the specified policy

Parameters:	observation (array) – single observation
Returns:	array of arrays of actions for each env
Return type:	(array)

get_actions(observations)[source]¶

Runs each set of observations through each task specific policy

Parameters:	observations (array) – array of arrays of observations generated by each task and env
Returns:	array of arrays of actions for each env (meta_batch_size) x (batch_size) x (action_dim) and array of arrays of agent_info dicts
Return type:	(tuple)

get_param_values()¶

Gets a list of all the current weights in the network (in original code it is flattened, why?)

Returns:	list of values for parameters
Return type:	(list)

get_params()¶

Get the tf.Variables representing the trainable weights of the network (symbolic)

Returns:	a dict of all trainable Variables
Return type:	(dict)

likelihood_ratio_sym(obs, action, dist_info_old, policy_params)¶

Computes the likelihood p_new(obs|act)/p_old ratio between

Parameters:	obs (tf.Tensor) – symbolic variable for observations action (tf.Tensor) – symbolic variable for actions dist_info_old (dict) – dictionary of tf.placeholders with old policy information policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:	likelihood ratio
Return type:	(tf.Tensor)

log_diagnostics(paths)¶: Log extra information per iteration based on the collected paths

log_likelihood_sym(obs, action, policy_params)¶

Computes the log likelihood p(obs|act)

Parameters:	obs (tf.Tensor) – symbolic variable for observations action (tf.Tensor) – symbolic variable for actions policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:	log likelihood
Return type:	(tf.Tensor)

policies_params_feed_dict¶: returns fully prepared feed dict for feeding the currently saved policy parameter values into the lightweight policy graph

set_params(policy_params)¶

Sets the parameters for the graph

Parameters:	policy_params (dict) – of variable names and corresponding parameter values

switch_to_pre_update()[source]¶: Switches get_action to pre-update policy

update_task_parameters(updated_policies_parameters)[source]¶

Parameters:	updated_policies_parameters (list) – List of size meta-batch size. Each contains a dict with the policies as numpy arrays (parameters) –

Gaussian-Policies¶

class meta_policy_search.policies.GaussianMLPPolicy(*args, init_std=1.0, min_std=1e-06, **kwargs)[source]¶

Bases: meta_policy_search.policies.base.Policy

Gaussian multi-layer perceptron policy (diagonal covariance matrix) Provides functions for executing and updating policy parameters A container for storing the current pre and post update policies

Parameters:

obs_dim (int) – dimensionality of the observation space -> specifies the input size of the policy
action_dim (int) – dimensionality of the action space -> specifies the output size of the policy
name (str) – name of the policy used as tf variable scope
hidden_sizes (tuple) – tuple of integers specifying the hidden layer sizes of the MLP
hidden_nonlinearity (tf.op) – nonlinearity function of the hidden layers
output_nonlinearity (tf.op or None) – nonlinearity function of the output layer
learn_std (boolean) – whether the standard_dev / variance is a trainable or fixed variable
init_std (float) – initial policy standard deviation
min_std (float) – minimal policy standard deviation

build_graph()[source]¶: Builds computational graph for policy

distribution¶

Returns this policy’s distribution

Returns:	this policy’s distribution
Return type:	(Distribution)

distribution_info_keys(obs, state_infos)[source]¶

Parameters:	obs (placeholder) – symbolic variable for observations state_infos (dict) – a dictionary of placeholders that contains information about the of the policy at the time it received the observation (state) –
Returns:	a dictionary of tf placeholders for the policy output distribution
Return type:	(dict)

distribution_info_sym(obs_var, params=None)[source]¶

Return the symbolic distribution information about the actions.

Parameters:	obs_var (placeholder) – symbolic variable for observations params (dict) – a dictionary of placeholders or vars with the parameters of the MLP
Returns:	a dictionary of tf placeholders for the policy output distribution
Return type:	(dict)

get_action(observation)[source]¶

Runs a single observation through the specified policy and samples an action

Parameters:	observation (ndarray) – single observation - shape: (obs_dim,)
Returns:	single action - shape: (action_dim,)
Return type:	(ndarray)

get_actions(observations)[source]¶

Runs each set of observations through each task specific policy

Parameters:	observations (ndarray) – array of observations - shape: (batch_size, obs_dim)
Returns:	array of sampled actions - shape: (batch_size, action_dim)
Return type:	(ndarray)

get_param_values()¶

Gets a list of all the current weights in the network (in original code it is flattened, why?)

Returns:	list of values for parameters
Return type:	(list)

get_params()¶

Get the tf.Variables representing the trainable weights of the network (symbolic)

Returns:	a dict of all trainable Variables
Return type:	(dict)

likelihood_ratio_sym(obs, action, dist_info_old, policy_params)¶

Computes the likelihood p_new(obs|act)/p_old ratio between

Parameters:	obs (tf.Tensor) – symbolic variable for observations action (tf.Tensor) – symbolic variable for actions dist_info_old (dict) – dictionary of tf.placeholders with old policy information policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:	likelihood ratio
Return type:	(tf.Tensor)

load_params(policy_params)[source]¶

Parameters:	policy_params (ndarray) – array of policy parameters for each task

log_diagnostics(paths, prefix='')[source]¶: Log extra information per iteration based on the collected paths

log_likelihood_sym(obs, action, policy_params)¶

Computes the log likelihood p(obs|act)

Parameters:	obs (tf.Tensor) – symbolic variable for observations action (tf.Tensor) – symbolic variable for actions policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:	log likelihood
Return type:	(tf.Tensor)

set_params(policy_params)¶

Sets the parameters for the graph

Parameters:	policy_params (dict) – of variable names and corresponding parameter values

class meta_policy_search.policies.MetaGaussianMLPPolicy(meta_batch_size, *args, **kwargs)[source]¶

Bases: meta_policy_search.policies.gaussian_mlp_policy.GaussianMLPPolicy, meta_policy_search.policies.base.MetaPolicy

build_graph()[source]¶: Builds computational graph for policy

distribution¶

Returns this policy’s distribution

Returns:	this policy’s distribution
Return type:	(Distribution)

distribution_info_keys(obs, state_infos)¶

Parameters:	obs (placeholder) – symbolic variable for observations state_infos (dict) – a dictionary of placeholders that contains information about the of the policy at the time it received the observation (state) –
Returns:	a dictionary of tf placeholders for the policy output distribution
Return type:	(dict)

distribution_info_sym(obs_var, params=None)¶

Return the symbolic distribution information about the actions.

Parameters:	obs_var (placeholder) – symbolic variable for observations params (dict) – a dictionary of placeholders or vars with the parameters of the MLP
Returns:	a dictionary of tf placeholders for the policy output distribution
Return type:	(dict)

get_action(observation, task=0)[source]¶

Runs a single observation through the specified policy and samples an action

Parameters:	observation (ndarray) – single observation - shape: (obs_dim,)
Returns:	single action - shape: (action_dim,)
Return type:	(ndarray)

get_actions(observations)[source]¶

Parameters:	observations (list) – List of numpy arrays of shape (meta_batch_size, batch_size, obs_dim)
Returns:	A tuple containing a list of numpy arrays of action, and a list of list of dicts of agent infos
Return type:	(tuple)

get_param_values()¶

Gets a list of all the current weights in the network (in original code it is flattened, why?)

Returns:	list of values for parameters
Return type:	(list)

get_params()¶

Get the tf.Variables representing the trainable weights of the network (symbolic)

Returns:	a dict of all trainable Variables
Return type:	(dict)

likelihood_ratio_sym(obs, action, dist_info_old, policy_params)¶

Computes the likelihood p_new(obs|act)/p_old ratio between

Parameters:	obs (tf.Tensor) – symbolic variable for observations action (tf.Tensor) – symbolic variable for actions dist_info_old (dict) – dictionary of tf.placeholders with old policy information policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:	likelihood ratio
Return type:	(tf.Tensor)

load_params(policy_params)¶

Parameters:	policy_params (ndarray) – array of policy parameters for each task

log_diagnostics(paths, prefix='')¶: Log extra information per iteration based on the collected paths

log_likelihood_sym(obs, action, policy_params)¶

Computes the log likelihood p(obs|act)

Parameters:	obs (tf.Tensor) – symbolic variable for observations action (tf.Tensor) – symbolic variable for actions policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:	log likelihood
Return type:	(tf.Tensor)

policies_params_feed_dict¶: returns fully prepared feed dict for feeding the currently saved policy parameter values into the lightweight policy graph

set_params(policy_params)¶

Sets the parameters for the graph

Parameters:	policy_params (dict) – of variable names and corresponding parameter values

switch_to_pre_update()¶: Switches get_action to pre-update policy

update_task_parameters(updated_policies_parameters)¶

Parameters:	updated_policies_parameters (list) – List of size meta-batch size. Each contains a dict with the policies as numpy arrays (parameters) –