Policies

Policy Interfaces

class meta_policy_search.policies.Policy(obs_dim, action_dim, name='policy', hidden_sizes=(32, 32), learn_std=True, hidden_nonlinearity=<function tanh>, output_nonlinearity=None, **kwargs)[source]

Bases: meta_policy_search.utils.serializable.Serializable

A container for storing the current pre and post update policies Also provides functions for executing and updating policy parameters

Note

the preupdate policy is stored as tf.Variables, while the postupdate policy is stored in numpy arrays and executed through tf.placeholders

Parameters:
  • obs_dim (int) – dimensionality of the observation space -> specifies the input size of the policy
  • action_dim (int) – dimensionality of the action space -> specifies the output size of the policy
  • name (str) – Name used for scoping variables in policy
  • hidden_sizes (tuple) – size of hidden layers of network
  • learn_std (bool) – whether to learn variance of network output
  • hidden_nonlinearity (Operation) – nonlinearity used between hidden layers of network
  • output_nonlinearity (Operation) – nonlinearity used after the final layer of network
build_graph()[source]

Builds computational graph for policy

distribution

Returns this policy’s distribution

Returns:this policy’s distribution
Return type:(Distribution)
distribution_info_keys(obs, state_infos)[source]
Parameters:
  • obs (placeholder) – symbolic variable for observations
  • state_infos (dict) – a dictionary of placeholders that contains information about the
  • of the policy at the time it received the observation (state) –
Returns:

a dictionary of tf placeholders for the policy output distribution

Return type:

(dict)

distribution_info_sym(obs_var, params=None)[source]

Return the symbolic distribution information about the actions.

Parameters:
  • obs_var (placeholder) – symbolic variable for observations
  • params (None or dict) – a dictionary of placeholders that contains information about the
  • of the policy at the time it received the observation (state) –
Returns:

a dictionary of tf placeholders for the policy output distribution

Return type:

(dict)

get_action(observation)[source]

Runs a single observation through the specified policy

Parameters:observation (array) – single observation
Returns:array of arrays of actions for each env
Return type:(array)
get_actions(observations)[source]

Runs each set of observations through each task specific policy

Parameters:observations (array) – array of arrays of observations generated by each task and env
Returns:
array of arrays of actions for each env (meta_batch_size) x (batch_size) x (action_dim)
and array of arrays of agent_info dicts
Return type:(tuple)
get_param_values()[source]

Gets a list of all the current weights in the network (in original code it is flattened, why?)

Returns:list of values for parameters
Return type:(list)
get_params()[source]

Get the tf.Variables representing the trainable weights of the network (symbolic)

Returns:a dict of all trainable Variables
Return type:(dict)
likelihood_ratio_sym(obs, action, dist_info_old, policy_params)[source]

Computes the likelihood p_new(obs|act)/p_old ratio between

Parameters:
  • obs (tf.Tensor) – symbolic variable for observations
  • action (tf.Tensor) – symbolic variable for actions
  • dist_info_old (dict) – dictionary of tf.placeholders with old policy information
  • policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:

likelihood ratio

Return type:

(tf.Tensor)

log_diagnostics(paths)[source]

Log extra information per iteration based on the collected paths

log_likelihood_sym(obs, action, policy_params)[source]

Computes the log likelihood p(obs|act)

Parameters:
  • obs (tf.Tensor) – symbolic variable for observations
  • action (tf.Tensor) – symbolic variable for actions
  • policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:

log likelihood

Return type:

(tf.Tensor)

set_params(policy_params)[source]

Sets the parameters for the graph

Parameters:policy_params (dict) – of variable names and corresponding parameter values
class meta_policy_search.policies.MetaPolicy(*args, **kwargs)[source]

Bases: meta_policy_search.policies.base.Policy

build_graph()[source]

Also should create lists of variables and corresponding assign ops

distribution

Returns this policy’s distribution

Returns:this policy’s distribution
Return type:(Distribution)
distribution_info_keys(obs, state_infos)
Parameters:
  • obs (placeholder) – symbolic variable for observations
  • state_infos (dict) – a dictionary of placeholders that contains information about the
  • of the policy at the time it received the observation (state) –
Returns:

a dictionary of tf placeholders for the policy output distribution

Return type:

(dict)

distribution_info_sym(obs_var, params=None)

Return the symbolic distribution information about the actions.

Parameters:
  • obs_var (placeholder) – symbolic variable for observations
  • params (None or dict) – a dictionary of placeholders that contains information about the
  • of the policy at the time it received the observation (state) –
Returns:

a dictionary of tf placeholders for the policy output distribution

Return type:

(dict)

get_action(observation)

Runs a single observation through the specified policy

Parameters:observation (array) – single observation
Returns:array of arrays of actions for each env
Return type:(array)
get_actions(observations)[source]

Runs each set of observations through each task specific policy

Parameters:observations (array) – array of arrays of observations generated by each task and env
Returns:
array of arrays of actions for each env (meta_batch_size) x (batch_size) x (action_dim)
and array of arrays of agent_info dicts
Return type:(tuple)
get_param_values()

Gets a list of all the current weights in the network (in original code it is flattened, why?)

Returns:list of values for parameters
Return type:(list)
get_params()

Get the tf.Variables representing the trainable weights of the network (symbolic)

Returns:a dict of all trainable Variables
Return type:(dict)
likelihood_ratio_sym(obs, action, dist_info_old, policy_params)

Computes the likelihood p_new(obs|act)/p_old ratio between

Parameters:
  • obs (tf.Tensor) – symbolic variable for observations
  • action (tf.Tensor) – symbolic variable for actions
  • dist_info_old (dict) – dictionary of tf.placeholders with old policy information
  • policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:

likelihood ratio

Return type:

(tf.Tensor)

log_diagnostics(paths)

Log extra information per iteration based on the collected paths

log_likelihood_sym(obs, action, policy_params)

Computes the log likelihood p(obs|act)

Parameters:
  • obs (tf.Tensor) – symbolic variable for observations
  • action (tf.Tensor) – symbolic variable for actions
  • policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:

log likelihood

Return type:

(tf.Tensor)

policies_params_feed_dict

returns fully prepared feed dict for feeding the currently saved policy parameter values into the lightweight policy graph

set_params(policy_params)

Sets the parameters for the graph

Parameters:policy_params (dict) – of variable names and corresponding parameter values
switch_to_pre_update()[source]

Switches get_action to pre-update policy

update_task_parameters(updated_policies_parameters)[source]
Parameters:
  • updated_policies_parameters (list) – List of size meta-batch size. Each contains a dict with the policies
  • as numpy arrays (parameters) –

Gaussian-Policies

class meta_policy_search.policies.GaussianMLPPolicy(*args, init_std=1.0, min_std=1e-06, **kwargs)[source]

Bases: meta_policy_search.policies.base.Policy

Gaussian multi-layer perceptron policy (diagonal covariance matrix) Provides functions for executing and updating policy parameters A container for storing the current pre and post update policies

Parameters:
  • obs_dim (int) – dimensionality of the observation space -> specifies the input size of the policy
  • action_dim (int) – dimensionality of the action space -> specifies the output size of the policy
  • name (str) – name of the policy used as tf variable scope
  • hidden_sizes (tuple) – tuple of integers specifying the hidden layer sizes of the MLP
  • hidden_nonlinearity (tf.op) – nonlinearity function of the hidden layers
  • output_nonlinearity (tf.op or None) – nonlinearity function of the output layer
  • learn_std (boolean) – whether the standard_dev / variance is a trainable or fixed variable
  • init_std (float) – initial policy standard deviation
  • min_std (float) – minimal policy standard deviation
build_graph()[source]

Builds computational graph for policy

distribution

Returns this policy’s distribution

Returns:this policy’s distribution
Return type:(Distribution)
distribution_info_keys(obs, state_infos)[source]
Parameters:
  • obs (placeholder) – symbolic variable for observations
  • state_infos (dict) – a dictionary of placeholders that contains information about the
  • of the policy at the time it received the observation (state) –
Returns:

a dictionary of tf placeholders for the policy output distribution

Return type:

(dict)

distribution_info_sym(obs_var, params=None)[source]

Return the symbolic distribution information about the actions.

Parameters:
  • obs_var (placeholder) – symbolic variable for observations
  • params (dict) – a dictionary of placeholders or vars with the parameters of the MLP
Returns:

a dictionary of tf placeholders for the policy output distribution

Return type:

(dict)

get_action(observation)[source]

Runs a single observation through the specified policy and samples an action

Parameters:observation (ndarray) – single observation - shape: (obs_dim,)
Returns:single action - shape: (action_dim,)
Return type:(ndarray)
get_actions(observations)[source]

Runs each set of observations through each task specific policy

Parameters:observations (ndarray) – array of observations - shape: (batch_size, obs_dim)
Returns:array of sampled actions - shape: (batch_size, action_dim)
Return type:(ndarray)
get_param_values()

Gets a list of all the current weights in the network (in original code it is flattened, why?)

Returns:list of values for parameters
Return type:(list)
get_params()

Get the tf.Variables representing the trainable weights of the network (symbolic)

Returns:a dict of all trainable Variables
Return type:(dict)
likelihood_ratio_sym(obs, action, dist_info_old, policy_params)

Computes the likelihood p_new(obs|act)/p_old ratio between

Parameters:
  • obs (tf.Tensor) – symbolic variable for observations
  • action (tf.Tensor) – symbolic variable for actions
  • dist_info_old (dict) – dictionary of tf.placeholders with old policy information
  • policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:

likelihood ratio

Return type:

(tf.Tensor)

load_params(policy_params)[source]
Parameters:policy_params (ndarray) – array of policy parameters for each task
log_diagnostics(paths, prefix='')[source]

Log extra information per iteration based on the collected paths

log_likelihood_sym(obs, action, policy_params)

Computes the log likelihood p(obs|act)

Parameters:
  • obs (tf.Tensor) – symbolic variable for observations
  • action (tf.Tensor) – symbolic variable for actions
  • policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:

log likelihood

Return type:

(tf.Tensor)

set_params(policy_params)

Sets the parameters for the graph

Parameters:policy_params (dict) – of variable names and corresponding parameter values
class meta_policy_search.policies.MetaGaussianMLPPolicy(meta_batch_size, *args, **kwargs)[source]

Bases: meta_policy_search.policies.gaussian_mlp_policy.GaussianMLPPolicy, meta_policy_search.policies.base.MetaPolicy

build_graph()[source]

Builds computational graph for policy

distribution

Returns this policy’s distribution

Returns:this policy’s distribution
Return type:(Distribution)
distribution_info_keys(obs, state_infos)
Parameters:
  • obs (placeholder) – symbolic variable for observations
  • state_infos (dict) – a dictionary of placeholders that contains information about the
  • of the policy at the time it received the observation (state) –
Returns:

a dictionary of tf placeholders for the policy output distribution

Return type:

(dict)

distribution_info_sym(obs_var, params=None)

Return the symbolic distribution information about the actions.

Parameters:
  • obs_var (placeholder) – symbolic variable for observations
  • params (dict) – a dictionary of placeholders or vars with the parameters of the MLP
Returns:

a dictionary of tf placeholders for the policy output distribution

Return type:

(dict)

get_action(observation, task=0)[source]

Runs a single observation through the specified policy and samples an action

Parameters:observation (ndarray) – single observation - shape: (obs_dim,)
Returns:single action - shape: (action_dim,)
Return type:(ndarray)
get_actions(observations)[source]
Parameters:observations (list) – List of numpy arrays of shape (meta_batch_size, batch_size, obs_dim)
Returns:A tuple containing a list of numpy arrays of action, and a list of list of dicts of agent infos
Return type:(tuple)
get_param_values()

Gets a list of all the current weights in the network (in original code it is flattened, why?)

Returns:list of values for parameters
Return type:(list)
get_params()

Get the tf.Variables representing the trainable weights of the network (symbolic)

Returns:a dict of all trainable Variables
Return type:(dict)
likelihood_ratio_sym(obs, action, dist_info_old, policy_params)

Computes the likelihood p_new(obs|act)/p_old ratio between

Parameters:
  • obs (tf.Tensor) – symbolic variable for observations
  • action (tf.Tensor) – symbolic variable for actions
  • dist_info_old (dict) – dictionary of tf.placeholders with old policy information
  • policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:

likelihood ratio

Return type:

(tf.Tensor)

load_params(policy_params)
Parameters:policy_params (ndarray) – array of policy parameters for each task
log_diagnostics(paths, prefix='')

Log extra information per iteration based on the collected paths

log_likelihood_sym(obs, action, policy_params)

Computes the log likelihood p(obs|act)

Parameters:
  • obs (tf.Tensor) – symbolic variable for observations
  • action (tf.Tensor) – symbolic variable for actions
  • policy_params (dict) – dictionary of the policy parameters (each value is a tf.Tensor)
Returns:

log likelihood

Return type:

(tf.Tensor)

policies_params_feed_dict

returns fully prepared feed dict for feeding the currently saved policy parameter values into the lightweight policy graph

set_params(policy_params)

Sets the parameters for the graph

Parameters:policy_params (dict) – of variable names and corresponding parameter values
switch_to_pre_update()

Switches get_action to pre-update policy

update_task_parameters(updated_policies_parameters)
Parameters:
  • updated_policies_parameters (list) – List of size meta-batch size. Each contains a dict with the policies
  • as numpy arrays (parameters) –