Policies¶
Policy Interfaces¶
-
class
meta_policy_search.policies.
Policy
(obs_dim, action_dim, name='policy', hidden_sizes=(32, 32), learn_std=True, hidden_nonlinearity=<function tanh>, output_nonlinearity=None, **kwargs)[source]¶ Bases:
meta_policy_search.utils.serializable.Serializable
A container for storing the current pre and post update policies Also provides functions for executing and updating policy parameters
Note
the preupdate policy is stored as tf.Variables, while the postupdate policy is stored in numpy arrays and executed through tf.placeholders
Parameters: - obs_dim (int) – dimensionality of the observation space -> specifies the input size of the policy
- action_dim (int) – dimensionality of the action space -> specifies the output size of the policy
- name (str) – Name used for scoping variables in policy
- hidden_sizes (tuple) – size of hidden layers of network
- learn_std (bool) – whether to learn variance of network output
- hidden_nonlinearity (Operation) – nonlinearity used between hidden layers of network
- output_nonlinearity (Operation) – nonlinearity used after the final layer of network
-
distribution
¶ Returns this policy’s distribution
Returns: this policy’s distribution Return type: (Distribution)
-
distribution_info_keys
(obs, state_infos)[source]¶ Parameters: - obs (placeholder) – symbolic variable for observations
- state_infos (dict) – a dictionary of placeholders that contains information about the
- of the policy at the time it received the observation (state) –
Returns: a dictionary of tf placeholders for the policy output distribution
Return type: (dict)
-
distribution_info_sym
(obs_var, params=None)[source]¶ Return the symbolic distribution information about the actions.
Parameters: - obs_var (placeholder) – symbolic variable for observations
- params (None or dict) – a dictionary of placeholders that contains information about the
- of the policy at the time it received the observation (state) –
Returns: a dictionary of tf placeholders for the policy output distribution
Return type: (dict)
-
get_action
(observation)[source]¶ Runs a single observation through the specified policy
Parameters: observation (array) – single observation Returns: array of arrays of actions for each env Return type: (array)
-
get_actions
(observations)[source]¶ Runs each set of observations through each task specific policy
Parameters: observations (array) – array of arrays of observations generated by each task and env Returns: - array of arrays of actions for each env (meta_batch_size) x (batch_size) x (action_dim)
- and array of arrays of agent_info dicts
Return type: (tuple)
-
get_param_values
()[source]¶ Gets a list of all the current weights in the network (in original code it is flattened, why?)
Returns: list of values for parameters Return type: (list)
-
get_params
()[source]¶ Get the tf.Variables representing the trainable weights of the network (symbolic)
Returns: a dict of all trainable Variables Return type: (dict)
-
likelihood_ratio_sym
(obs, action, dist_info_old, policy_params)[source]¶ Computes the likelihood p_new(obs|act)/p_old ratio between
Parameters: - obs (tf.Tensor) – symbolic variable for observations
- action (tf.Tensor) – symbolic variable for actions
- dist_info_old (dict) – dictionary of tf.placeholders with old policy information
- policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns: likelihood ratio
Return type: (tf.Tensor)
-
log_likelihood_sym
(obs, action, policy_params)[source]¶ Computes the log likelihood p(obs|act)
Parameters: - obs (tf.Tensor) – symbolic variable for observations
- action (tf.Tensor) – symbolic variable for actions
- policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns: log likelihood
Return type: (tf.Tensor)
-
class
meta_policy_search.policies.
MetaPolicy
(*args, **kwargs)[source]¶ Bases:
meta_policy_search.policies.base.Policy
-
distribution
¶ Returns this policy’s distribution
Returns: this policy’s distribution Return type: (Distribution)
-
distribution_info_keys
(obs, state_infos)¶ Parameters: - obs (placeholder) – symbolic variable for observations
- state_infos (dict) – a dictionary of placeholders that contains information about the
- of the policy at the time it received the observation (state) –
Returns: a dictionary of tf placeholders for the policy output distribution
Return type: (dict)
-
distribution_info_sym
(obs_var, params=None)¶ Return the symbolic distribution information about the actions.
Parameters: - obs_var (placeholder) – symbolic variable for observations
- params (None or dict) – a dictionary of placeholders that contains information about the
- of the policy at the time it received the observation (state) –
Returns: a dictionary of tf placeholders for the policy output distribution
Return type: (dict)
-
get_action
(observation)¶ Runs a single observation through the specified policy
Parameters: observation (array) – single observation Returns: array of arrays of actions for each env Return type: (array)
-
get_actions
(observations)[source]¶ Runs each set of observations through each task specific policy
Parameters: observations (array) – array of arrays of observations generated by each task and env Returns: - array of arrays of actions for each env (meta_batch_size) x (batch_size) x (action_dim)
- and array of arrays of agent_info dicts
Return type: (tuple)
-
get_param_values
()¶ Gets a list of all the current weights in the network (in original code it is flattened, why?)
Returns: list of values for parameters Return type: (list)
-
get_params
()¶ Get the tf.Variables representing the trainable weights of the network (symbolic)
Returns: a dict of all trainable Variables Return type: (dict)
-
likelihood_ratio_sym
(obs, action, dist_info_old, policy_params)¶ Computes the likelihood p_new(obs|act)/p_old ratio between
Parameters: - obs (tf.Tensor) – symbolic variable for observations
- action (tf.Tensor) – symbolic variable for actions
- dist_info_old (dict) – dictionary of tf.placeholders with old policy information
- policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns: likelihood ratio
Return type: (tf.Tensor)
-
log_diagnostics
(paths)¶ Log extra information per iteration based on the collected paths
-
log_likelihood_sym
(obs, action, policy_params)¶ Computes the log likelihood p(obs|act)
Parameters: - obs (tf.Tensor) – symbolic variable for observations
- action (tf.Tensor) – symbolic variable for actions
- policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns: log likelihood
Return type: (tf.Tensor)
-
policies_params_feed_dict
¶ returns fully prepared feed dict for feeding the currently saved policy parameter values into the lightweight policy graph
-
set_params
(policy_params)¶ Sets the parameters for the graph
Parameters: policy_params (dict) – of variable names and corresponding parameter values
-
Gaussian-Policies¶
-
class
meta_policy_search.policies.
GaussianMLPPolicy
(*args, init_std=1.0, min_std=1e-06, **kwargs)[source]¶ Bases:
meta_policy_search.policies.base.Policy
Gaussian multi-layer perceptron policy (diagonal covariance matrix) Provides functions for executing and updating policy parameters A container for storing the current pre and post update policies
Parameters: - obs_dim (int) – dimensionality of the observation space -> specifies the input size of the policy
- action_dim (int) – dimensionality of the action space -> specifies the output size of the policy
- name (str) – name of the policy used as tf variable scope
- hidden_sizes (tuple) – tuple of integers specifying the hidden layer sizes of the MLP
- hidden_nonlinearity (tf.op) – nonlinearity function of the hidden layers
- output_nonlinearity (tf.op or None) – nonlinearity function of the output layer
- learn_std (boolean) – whether the standard_dev / variance is a trainable or fixed variable
- init_std (float) – initial policy standard deviation
- min_std (float) – minimal policy standard deviation
-
distribution
¶ Returns this policy’s distribution
Returns: this policy’s distribution Return type: (Distribution)
-
distribution_info_keys
(obs, state_infos)[source]¶ Parameters: - obs (placeholder) – symbolic variable for observations
- state_infos (dict) – a dictionary of placeholders that contains information about the
- of the policy at the time it received the observation (state) –
Returns: a dictionary of tf placeholders for the policy output distribution
Return type: (dict)
-
distribution_info_sym
(obs_var, params=None)[source]¶ Return the symbolic distribution information about the actions.
Parameters: - obs_var (placeholder) – symbolic variable for observations
- params (dict) – a dictionary of placeholders or vars with the parameters of the MLP
Returns: a dictionary of tf placeholders for the policy output distribution
Return type: (dict)
-
get_action
(observation)[source]¶ Runs a single observation through the specified policy and samples an action
Parameters: observation (ndarray) – single observation - shape: (obs_dim,) Returns: single action - shape: (action_dim,) Return type: (ndarray)
-
get_actions
(observations)[source]¶ Runs each set of observations through each task specific policy
Parameters: observations (ndarray) – array of observations - shape: (batch_size, obs_dim) Returns: array of sampled actions - shape: (batch_size, action_dim) Return type: (ndarray)
-
get_param_values
()¶ Gets a list of all the current weights in the network (in original code it is flattened, why?)
Returns: list of values for parameters Return type: (list)
-
get_params
()¶ Get the tf.Variables representing the trainable weights of the network (symbolic)
Returns: a dict of all trainable Variables Return type: (dict)
-
likelihood_ratio_sym
(obs, action, dist_info_old, policy_params)¶ Computes the likelihood p_new(obs|act)/p_old ratio between
Parameters: - obs (tf.Tensor) – symbolic variable for observations
- action (tf.Tensor) – symbolic variable for actions
- dist_info_old (dict) – dictionary of tf.placeholders with old policy information
- policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns: likelihood ratio
Return type: (tf.Tensor)
-
load_params
(policy_params)[source]¶ Parameters: policy_params (ndarray) – array of policy parameters for each task
-
log_diagnostics
(paths, prefix='')[source]¶ Log extra information per iteration based on the collected paths
-
log_likelihood_sym
(obs, action, policy_params)¶ Computes the log likelihood p(obs|act)
Parameters: - obs (tf.Tensor) – symbolic variable for observations
- action (tf.Tensor) – symbolic variable for actions
- policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns: log likelihood
Return type: (tf.Tensor)
-
set_params
(policy_params)¶ Sets the parameters for the graph
Parameters: policy_params (dict) – of variable names and corresponding parameter values
-
class
meta_policy_search.policies.
MetaGaussianMLPPolicy
(meta_batch_size, *args, **kwargs)[source]¶ Bases:
meta_policy_search.policies.gaussian_mlp_policy.GaussianMLPPolicy
,meta_policy_search.policies.base.MetaPolicy
-
distribution
¶ Returns this policy’s distribution
Returns: this policy’s distribution Return type: (Distribution)
-
distribution_info_keys
(obs, state_infos)¶ Parameters: - obs (placeholder) – symbolic variable for observations
- state_infos (dict) – a dictionary of placeholders that contains information about the
- of the policy at the time it received the observation (state) –
Returns: a dictionary of tf placeholders for the policy output distribution
Return type: (dict)
-
distribution_info_sym
(obs_var, params=None)¶ Return the symbolic distribution information about the actions.
Parameters: - obs_var (placeholder) – symbolic variable for observations
- params (dict) – a dictionary of placeholders or vars with the parameters of the MLP
Returns: a dictionary of tf placeholders for the policy output distribution
Return type: (dict)
-
get_action
(observation, task=0)[source]¶ Runs a single observation through the specified policy and samples an action
Parameters: observation (ndarray) – single observation - shape: (obs_dim,) Returns: single action - shape: (action_dim,) Return type: (ndarray)
-
get_actions
(observations)[source]¶ Parameters: observations (list) – List of numpy arrays of shape (meta_batch_size, batch_size, obs_dim) Returns: A tuple containing a list of numpy arrays of action, and a list of list of dicts of agent infos Return type: (tuple)
-
get_param_values
()¶ Gets a list of all the current weights in the network (in original code it is flattened, why?)
Returns: list of values for parameters Return type: (list)
-
get_params
()¶ Get the tf.Variables representing the trainable weights of the network (symbolic)
Returns: a dict of all trainable Variables Return type: (dict)
-
likelihood_ratio_sym
(obs, action, dist_info_old, policy_params)¶ Computes the likelihood p_new(obs|act)/p_old ratio between
Parameters: - obs (tf.Tensor) – symbolic variable for observations
- action (tf.Tensor) – symbolic variable for actions
- dist_info_old (dict) – dictionary of tf.placeholders with old policy information
- policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns: likelihood ratio
Return type: (tf.Tensor)
-
load_params
(policy_params)¶ Parameters: policy_params (ndarray) – array of policy parameters for each task
-
log_diagnostics
(paths, prefix='')¶ Log extra information per iteration based on the collected paths
-
log_likelihood_sym
(obs, action, policy_params)¶ Computes the log likelihood p(obs|act)
Parameters: - obs (tf.Tensor) – symbolic variable for observations
- action (tf.Tensor) – symbolic variable for actions
- policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns: log likelihood
Return type: (tf.Tensor)
-
policies_params_feed_dict
¶ returns fully prepared feed dict for feeding the currently saved policy parameter values into the lightweight policy graph
-
set_params
(policy_params)¶ Sets the parameters for the graph
Parameters: policy_params (dict) – of variable names and corresponding parameter values
-
switch_to_pre_update
()¶ Switches get_action to pre-update policy
-
update_task_parameters
(updated_policies_parameters)¶ Parameters: - updated_policies_parameters (list) – List of size meta-batch size. Each contains a dict with the policies
- as numpy arrays (parameters) –
-