Skip to content

agents

FSVI_Agent

Bases: PBVI_Agent

A particular flavor of the Point-Based Value Iteration based agent. The general concept relies on Model-Based reinforcement learning as described in: Pineau, J., Gordon, G., & Thrun, S. (2003, August). Point-based value iteration: An anytime algorithm for POMDPs The Forward Search Value Iteration algorithm is described in: Shani, G., Brafman, R. I., & Shimony, S. E. (2007, January). Forward Search Value Iteration for POMDPs

The training consist in two steps:

  • Expand: Where belief points are explored based on the some strategy (to be defined by subclasses).

  • Backup: Using the generated belief points, the value function is updated.

The belief points are probability distributions over the state space and are therefore vectors of |S| elements.

Actions are chosen based on a value function. A value function is a set of alpha vectors of dimentionality |S|. Each alpha vector is associated to a single action but multiple alpha vectors can be associated to the same action. To choose an action at a given belief point, a dot product is taken between each alpha vector and the belief point and the action associated with the highest result is chosen.

Forward Search exploration concept: It relies of the solution of the Fully-Observable (MDP) problem to guide the exploration of belief points. It makes an agent start randomly in the environment and makes him take steps following the MDP solution while generating belief points along the way. Each time the expand function is called it starts generated a new set of belief points and the update function uses only the latest generated belief points to make update the value function.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

mdp_policy ValueFunction

The solution to the fully version of the problem.

Source code in olfactory_navigation/agents/fsvi_agent.py
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
class FSVI_Agent(PBVI_Agent):
    '''
    A particular flavor of the Point-Based Value Iteration based agent.
    The general concept relies on Model-Based reinforcement learning as described in: Pineau, J., Gordon, G., & Thrun, S. (2003, August). Point-based value iteration: An anytime algorithm for POMDPs
    The Forward Search Value Iteration algorithm is described in: Shani, G., Brafman, R. I., & Shimony, S. E. (2007, January). Forward Search Value Iteration for POMDPs

    The training consist in two steps:

    - Expand: Where belief points are explored based on the some strategy (to be defined by subclasses).

    - Backup: Using the generated belief points, the value function is updated.

    The belief points are probability distributions over the state space and are therefore vectors of |S| elements.

    Actions are chosen based on a value function. A value function is a set of alpha vectors of dimentionality |S|.
    Each alpha vector is associated to a single action but multiple alpha vectors can be associated to the same action.
    To choose an action at a given belief point, a dot product is taken between each alpha vector and the belief point and the action associated with the highest result is chosen.

    Forward Search exploration concept:
    It relies of the solution of the Fully-Observable (MDP) problem to guide the exploration of belief points.
    It makes an agent start randomly in the environment and makes him take steps following the MDP solution while generating belief points along the way.
    Each time the expand function is called it starts generated a new set of belief points and the update function uses only the latest generated belief points to make update the value function.

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    mdp_policy : ValueFunction
        The solution to the fully version of the problem.
    '''
    # FSVI special attribute
    mdp_policy = None

    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int,
               mdp_policy: ValueFunction
               ) -> BeliefSet:
        '''
        Function implementing the exploration process using the MDP policy in order to generate a sequence of Beliefs following the the Forward Search Value Iteration principles.
        It is a loop is started by a initial state 's' and using the MDP policy, chooses the best action to take.
        Following this, a random next state 's_p' is being sampled from the transition probabilities and a random observation 'o' based on the observation probabilities.
        Then the given belief is updated using the chosen action and the observation received and the updated belief is added to the sequence.
        Once the state is a goal state, the loop is done and the belief sequence is returned.

        Parameters
        ----------
        belief_set : BeliefSet
            A belief set containing a single belief to start the sequence with.
            A random state will be chosen based on the probability distribution of the belief.
        value_function : ValueFunction
            The current value function. (NOT USED)
        max_generation : int
            How many beliefs to be generated at most.
        mdp_policy : ValueFunction
            The mdp policy used to choose the action from with the given state 's'.

        Returns
        -------
        belief_set : BeliefSet
            A new sequence of beliefs.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        # Getting initial belief
        b0 = belief_set.belief_list[0]
        belief_list = [b0]

        # Choose a random starting state
        s = b0.random_state()

        # Setting the working belief
        b = b0

        for _ in range(max_generation - 1): #-1 due to a one belief already being present in the set
            # Choose action based on mdp value function
            a_star = xp.argmax(mdp_policy.alpha_vector_array[:,s])

            # Pick a random next state (weighted by transition probabilities)
            s_p = model.transition(s, a_star)

            # Pick a random observation weighted by observation probabilities in state s_p and after having done action a_star
            o = model.observe(s_p, a_star)

            # Generate a new belief based on a_star and o
            b_p = b.update(a_star, o)

            # Record new belief
            belief_list.append(b_p)

            # Updating s and b
            s = s_p
            b = b_p

            # Reset and belief if end state is reached
            if s in model.end_states:
                s = b0.random_state()
                b = b0

        return BeliefSet(model, belief_list)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              mdp_policy: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Foward Search Value Iteration:
        - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        mdp_policy : ValueFunction, optional
            The MDP solution to guide the expand process.
            If it is not provided, the Value Iteration for the MDP version of the problem will be run. (using the same gamma and eps as set here; horizon=1000)
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        if mdp_policy is not None:
            self.mdp_policy = mdp_policy
        elif (self.mdp_policy is None) or overwrite_training:
            log('MDP_policy, not provided. Solving MDP with Value Iteration...')
            self.mdp_policy, hist = vi_solver.solve(model = self.model,
                                                    horizon = 1000,
                                                    initial_value_function = initial_value_function,
                                                    gamma = gamma,
                                                    eps = eps,
                                                    use_gpu = use_gpu,
                                                    history_tracking_level = 1,
                                                    print_progress = print_progress)

            if print_stats:
                print(hist.summary)

        return super().train(expansions = expansions,
                             full_backup = False,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats,
                             mdp_policy = self.mdp_policy)

expand(belief_set, value_function, max_generation, mdp_policy)

Function implementing the exploration process using the MDP policy in order to generate a sequence of Beliefs following the the Forward Search Value Iteration principles. It is a loop is started by a initial state 's' and using the MDP policy, chooses the best action to take. Following this, a random next state 's_p' is being sampled from the transition probabilities and a random observation 'o' based on the observation probabilities. Then the given belief is updated using the chosen action and the observation received and the updated belief is added to the sequence. Once the state is a goal state, the loop is done and the belief sequence is returned.

Parameters:

Name Type Description Default
belief_set BeliefSet

A belief set containing a single belief to start the sequence with. A random state will be chosen based on the probability distribution of the belief.

required
value_function ValueFunction

The current value function. (NOT USED)

required
max_generation int

How many beliefs to be generated at most.

required
mdp_policy ValueFunction

The mdp policy used to choose the action from with the given state 's'.

required

Returns:

Name Type Description
belief_set BeliefSet

A new sequence of beliefs.

Source code in olfactory_navigation/agents/fsvi_agent.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int,
           mdp_policy: ValueFunction
           ) -> BeliefSet:
    '''
    Function implementing the exploration process using the MDP policy in order to generate a sequence of Beliefs following the the Forward Search Value Iteration principles.
    It is a loop is started by a initial state 's' and using the MDP policy, chooses the best action to take.
    Following this, a random next state 's_p' is being sampled from the transition probabilities and a random observation 'o' based on the observation probabilities.
    Then the given belief is updated using the chosen action and the observation received and the updated belief is added to the sequence.
    Once the state is a goal state, the loop is done and the belief sequence is returned.

    Parameters
    ----------
    belief_set : BeliefSet
        A belief set containing a single belief to start the sequence with.
        A random state will be chosen based on the probability distribution of the belief.
    value_function : ValueFunction
        The current value function. (NOT USED)
    max_generation : int
        How many beliefs to be generated at most.
    mdp_policy : ValueFunction
        The mdp policy used to choose the action from with the given state 's'.

    Returns
    -------
    belief_set : BeliefSet
        A new sequence of beliefs.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    # Getting initial belief
    b0 = belief_set.belief_list[0]
    belief_list = [b0]

    # Choose a random starting state
    s = b0.random_state()

    # Setting the working belief
    b = b0

    for _ in range(max_generation - 1): #-1 due to a one belief already being present in the set
        # Choose action based on mdp value function
        a_star = xp.argmax(mdp_policy.alpha_vector_array[:,s])

        # Pick a random next state (weighted by transition probabilities)
        s_p = model.transition(s, a_star)

        # Pick a random observation weighted by observation probabilities in state s_p and after having done action a_star
        o = model.observe(s_p, a_star)

        # Generate a new belief based on a_star and o
        b_p = b.update(a_star, o)

        # Record new belief
        belief_list.append(b_p)

        # Updating s and b
        s = s_p
        b = b_p

        # Reset and belief if end state is reached
        if s in model.end_states:
            s = b0.random_state()
            b = b0

    return BeliefSet(model, belief_list)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, mdp_policy=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Foward Search Value Iteration: - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
mdp_policy ValueFunction

The MDP solution to guide the expand process. If it is not provided, the Value Iteration for the MDP version of the problem will be run. (using the same gamma and eps as set here; horizon=1000)

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/fsvi_agent.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          mdp_policy: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Foward Search Value Iteration:
    - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    mdp_policy : ValueFunction, optional
        The MDP solution to guide the expand process.
        If it is not provided, the Value Iteration for the MDP version of the problem will be run. (using the same gamma and eps as set here; horizon=1000)
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    if mdp_policy is not None:
        self.mdp_policy = mdp_policy
    elif (self.mdp_policy is None) or overwrite_training:
        log('MDP_policy, not provided. Solving MDP with Value Iteration...')
        self.mdp_policy, hist = vi_solver.solve(model = self.model,
                                                horizon = 1000,
                                                initial_value_function = initial_value_function,
                                                gamma = gamma,
                                                eps = eps,
                                                use_gpu = use_gpu,
                                                history_tracking_level = 1,
                                                print_progress = print_progress)

        if print_stats:
            print(hist.summary)

    return super().train(expansions = expansions,
                         full_backup = False,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats,
                         mdp_policy = self.mdp_policy)

HSVI_Agent

Bases: PBVI_Agent

A flavor of the PBVI Agent.

TODO: Do document of HSVI agent

TODO: FIX HSVI expand

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/hsvi_agent.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
class HSVI_Agent(PBVI_Agent):
    '''
    A flavor of the PBVI Agent. 

    # TODO: Do document of HSVI agent
    # TODO: FIX HSVI expand

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int
               ) -> BeliefSet:
        '''
        The expand function of the  Heuristic Search Value Iteration (HSVI) technique.
        It is a redursive function attempting to minimize the bound between the upper and lower estimations of the value function.

        It is developped by Smith T. and Simmons R. and described in the paper "Heuristic Search Value Iteration for POMDPs".

        Parameters
        ----------
        belief_set : BeliefSet
            List of beliefs to expand on.
        value_function : ValueFunction
            The current value function. Used to compute the value at belief points.
        max_generation : int, default=10
            The max amount of beliefs that can be added to the belief set at once.

        Returns
        -------
        belief_set : BeliefSet
            A new sequence of beliefs.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        if conv_term is None:
            conv_term = self.eps

        # Update convergence term
        conv_term /= self.gamma

        # Find best a based on upper bound v
        max_qv = -xp.inf
        best_a = -1
        for a in model.actions:
            b_probs = xp.einsum('sor,s->o', model.reachable_transitional_observation_table[:,a,:,:], b.values)

            b_prob_val = 0
            for o in model.observations:
                b_prob_val += (b_probs[o] * upper_bound_belief_value_map.evaluate(b.update(a,o)))

            qva = float(xp.dot(model.expected_rewards_table[:,a], b.values) + (self.gamma * b_prob_val))

            # qva = upper_bound_belief_value_map.qva(b, a, gamma=self.gamma)
            if qva > max_qv:
                max_qv = qva
                best_a = a

        # Choose o that max gap between bounds
        b_probs = xp.einsum('sor,s->o', model.reachable_transitional_observation_table[:,best_a,:,:], b.values)

        max_o_val = -xp.inf
        best_v_diff = -xp.inf
        next_b = b

        for o in model.observations:
            bao = b.update(best_a, o)

            upper_v_bao = upper_bound_belief_value_map.evaluate(bao)
            lower_v_bao = xp.max(xp.dot(value_function.alpha_vector_array, bao.values))

            v_diff = (upper_v_bao - lower_v_bao)

            o_val = b_probs[o] * v_diff

            if o_val > max_o_val:
                max_o_val = o_val
                best_v_diff = v_diff
                next_b = bao

        # if bounds_split < conv_term or max_generation <= 0:
        if best_v_diff < conv_term or max_generation <= 1:
            return BeliefSet(model, [next_b])

        # Add the belief point and associated value to the belief-value mapping
        upper_bound_belief_value_map.add(b, max_qv)

        # Go one step deeper in the recursion
        b_set = self.expand_hsvi(model=model,
                                 b=next_b,
                                 value_function=value_function,
                                 upper_bound_belief_value_map=upper_bound_belief_value_map,
                                 conv_term=conv_term,
                                 max_generation=max_generation-1)

        # Append the nex belief of this iteration to the deeper beliefs
        new_belief_list = b_set.belief_list
        new_belief_list.append(next_b)

        return BeliefSet(model, new_belief_list)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Heuristic Search Value Iteration:
        - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        return super().train(expansions = expansions,
                             full_backup = False,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats)

expand(belief_set, value_function, max_generation)

The expand function of the Heuristic Search Value Iteration (HSVI) technique. It is a redursive function attempting to minimize the bound between the upper and lower estimations of the value function.

It is developped by Smith T. and Simmons R. and described in the paper "Heuristic Search Value Iteration for POMDPs".

Parameters:

Name Type Description Default
belief_set BeliefSet

List of beliefs to expand on.

required
value_function ValueFunction

The current value function. Used to compute the value at belief points.

required
max_generation int

The max amount of beliefs that can be added to the belief set at once.

10

Returns:

Name Type Description
belief_set BeliefSet

A new sequence of beliefs.

Source code in olfactory_navigation/agents/hsvi_agent.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int
           ) -> BeliefSet:
    '''
    The expand function of the  Heuristic Search Value Iteration (HSVI) technique.
    It is a redursive function attempting to minimize the bound between the upper and lower estimations of the value function.

    It is developped by Smith T. and Simmons R. and described in the paper "Heuristic Search Value Iteration for POMDPs".

    Parameters
    ----------
    belief_set : BeliefSet
        List of beliefs to expand on.
    value_function : ValueFunction
        The current value function. Used to compute the value at belief points.
    max_generation : int, default=10
        The max amount of beliefs that can be added to the belief set at once.

    Returns
    -------
    belief_set : BeliefSet
        A new sequence of beliefs.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    if conv_term is None:
        conv_term = self.eps

    # Update convergence term
    conv_term /= self.gamma

    # Find best a based on upper bound v
    max_qv = -xp.inf
    best_a = -1
    for a in model.actions:
        b_probs = xp.einsum('sor,s->o', model.reachable_transitional_observation_table[:,a,:,:], b.values)

        b_prob_val = 0
        for o in model.observations:
            b_prob_val += (b_probs[o] * upper_bound_belief_value_map.evaluate(b.update(a,o)))

        qva = float(xp.dot(model.expected_rewards_table[:,a], b.values) + (self.gamma * b_prob_val))

        # qva = upper_bound_belief_value_map.qva(b, a, gamma=self.gamma)
        if qva > max_qv:
            max_qv = qva
            best_a = a

    # Choose o that max gap between bounds
    b_probs = xp.einsum('sor,s->o', model.reachable_transitional_observation_table[:,best_a,:,:], b.values)

    max_o_val = -xp.inf
    best_v_diff = -xp.inf
    next_b = b

    for o in model.observations:
        bao = b.update(best_a, o)

        upper_v_bao = upper_bound_belief_value_map.evaluate(bao)
        lower_v_bao = xp.max(xp.dot(value_function.alpha_vector_array, bao.values))

        v_diff = (upper_v_bao - lower_v_bao)

        o_val = b_probs[o] * v_diff

        if o_val > max_o_val:
            max_o_val = o_val
            best_v_diff = v_diff
            next_b = bao

    # if bounds_split < conv_term or max_generation <= 0:
    if best_v_diff < conv_term or max_generation <= 1:
        return BeliefSet(model, [next_b])

    # Add the belief point and associated value to the belief-value mapping
    upper_bound_belief_value_map.add(b, max_qv)

    # Go one step deeper in the recursion
    b_set = self.expand_hsvi(model=model,
                             b=next_b,
                             value_function=value_function,
                             upper_bound_belief_value_map=upper_bound_belief_value_map,
                             conv_term=conv_term,
                             max_generation=max_generation-1)

    # Append the nex belief of this iteration to the deeper beliefs
    new_belief_list = b_set.belief_list
    new_belief_list.append(next_b)

    return BeliefSet(model, new_belief_list)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Heuristic Search Value Iteration: - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/hsvi_agent.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Heuristic Search Value Iteration:
    - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    return super().train(expansions = expansions,
                         full_backup = False,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats)

Infotaxis_Agent

Bases: Agent

An agent following the Infotaxis principle. It is a Model-Based approach that aims to make steps towards where the agent has the greatest likelihood to minimize the entropy of the belief. The belief is (as for the PBVI agent) a probability distribution over the state space of how much the agent is to be confident in each state. The technique was developped and described in the following article: Vergassola, M., Villermaux, E., & Shraiman, B. I. (2007). 'Infotaxis' as a strategy for searching without gradients.

It does not need to be trained to the train(), save() and load() function are not implemented.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/infotaxis_agent.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
class Infotaxis_Agent(Agent):
    '''
    An agent following the Infotaxis principle.
    It is a Model-Based approach that aims to make steps towards where the agent has the greatest likelihood to minimize the entropy of the belief.
    The belief is (as for the PBVI agent) a probability distribution over the state space of how much the agent is to be confident in each state.
    The technique was developped and described in the following article: Vergassola, M., Villermaux, E., & Shraiman, B. I. (2007). 'Infotaxis' as a strategy for searching without gradients.

    It does not need to be trained to the train(), save() and load() function are not implemented.


    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def __init__(self,
                 environment: Environment,
                 threshold: float | None = 3e-6,
                 actions: dict[str, np.ndarray] | np.ndarray | None = None,
                 name: str | None=None,
                 seed: int = 12131415,
                 model: Model | None = None,
                 environment_converter: Callable | None = None,
                 **converter_parameters
                 ) -> None:
        super().__init__(
            environment = environment,
            threshold = threshold,
            actions = actions,
            name = name,
            seed = seed
        )

        # Converting the olfactory environment to a POMDP Model
        if model is not None:
            loaded_model = model
        elif callable(environment_converter):
            loaded_model = environment_converter(agent=self, **converter_parameters)
        else:
            # Using the exact converter
            loaded_model = exact_converter(agent=self)
        self.model:Model = loaded_model

        # Status variables
        self.belief = None
        self.action_played = None


    def to_gpu(self) -> Agent:
        '''
        Function to send the numpy arrays of the agent to the gpu.
        It returns a new instance of the Agent class with the arrays on the gpu

        Returns
        -------
        gpu_agent
        '''
        # Generating a new instance
        cls = self.__class__
        gpu_agent = cls.__new__(cls)

        # Copying arguments to gpu
        for arg, val in self.__dict__.items():
            if isinstance(val, np.ndarray):
                setattr(gpu_agent, arg, cp.array(val))
            elif arg == 'rnd_state':
                setattr(gpu_agent, arg, cp.random.RandomState(self.seed))
            elif isinstance(val, Model):
                setattr(gpu_agent, arg, val.gpu_model)
            elif isinstance(val, BeliefSet) or isinstance(val, Belief):
                setattr(gpu_agent, arg, val.to_gpu())
            else:
                setattr(gpu_agent, arg, val)

        # Self reference instances
        self._alternate_version = gpu_agent
        gpu_agent._alternate_version = self

        gpu_agent.on_gpu = True
        return gpu_agent


    def initialize_state(self,
                         n: int = 1
                         ) -> None:
        '''
        To use an agent within a simulation, the agent's state needs to be initialized.
        The initialization consists of setting the agent's initial belief.
        Multiple agents can be used at once for simulations, for this reason, the belief parameter is a BeliefSet by default.

        Parameters
        ----------
        n : int, default=1
            How many agents are to be used during the simulation.
        '''
        self.belief = BeliefSet(self.model, [Belief(self.model) for _ in range(n)])


    def choose_action(self) -> np.ndarray:
        '''
        Function to let the agent or set of agents choose an action based on their current belief.
        Following the Infotaxis principle, it will choose an action that will minimize the sum of next entropies.

        Returns
        -------
        movement_vector : np.ndarray
            A single or a list of actions chosen by the agent(s) based on their belief.
        '''
        xp = np if not self.on_gpu else cp

        n = len(self.belief)

        best_entropy = xp.ones(n) * -1
        best_action = xp.ones(n, dtype=int) * -1

        current_entropy = self.belief.entropies

        for a in self.model.actions:
            total_entropy = xp.zeros(n)

            for o in self.model.observations:
                b_ao = self.belief.update(actions=xp.ones(n, dtype=int)*a,
                                           observations=xp.ones(n, dtype=int)*o,
                                           throw_error=False)

                # Computing entropy
                with warnings.catch_warnings():
                    warnings.simplefilter('ignore')
                    b_ao_entropy = b_ao.entropies

                b_prob = xp.dot(self.belief.belief_array, xp.sum(self.model.reachable_transitional_observation_table[:,a,o,:], axis=1))

                total_entropy += (b_prob * (current_entropy - b_ao_entropy))

            # Checking if action is superior to previous best
            superiority_mask = best_entropy < total_entropy
            best_action[superiority_mask] = a
            best_entropy[superiority_mask] = total_entropy[superiority_mask]

        # Recording the action played
        self.action_played = best_action

        # Converting action indexes to movement vectors
        movemement_vector = self.action_set[best_action,:]

        return movemement_vector


    def update_state(self,
                     observation: np.ndarray,
                     source_reached: np.ndarray
                     ) -> None | np.ndarray:
        '''
        Function to update the internal state(s) of the agent(s) based on the previous action(s) taken and the observation(s) received.

        Parameters
        ----------
        observation : np.ndarray
            The observation(s) the agent(s) made.
        source_reached : np.ndarray
            A boolean array of whether the agent(s) have reached the source or not.

        Returns
        -------
        update_successfull : np.ndarray, optional
            If nothing is returned, it means all the agent's state updates have been successfull.
            Else, a boolean np.ndarray of size n can be returned confirming for each agent whether the update has been successful or not.
        '''
        assert self.belief is not None, "Agent was not initialized yet, run the initialize_state function first"

        # GPU support
        xp = np if not self.on_gpu else cp

        # TODO: Make dedicated observation discretization function
        # Set the thresholds as a vector
        threshold = self.threshold
        if not isinstance(threshold, list):
            threshold = [threshold]

        # Ensure 0.0 and 1.0 begin and end the threshold list
        if threshold[0] != -xp.inf:
            threshold = [-xp.inf] + threshold

        if threshold[-1] != xp.inf:
            threshold = threshold + [xp.inf]
        threshold = xp.array(threshold)

        # Setting observation ids
        observation_ids = xp.argwhere((observation[:,None] >= threshold[:-1][None,:]) & (observation[:,None] < threshold[1:][None,:]))[:,1]
        observation_ids[source_reached] = len(threshold) # Observe source, goal is always last observation with len(threshold)-1 being the amount of observation buckets.

        # Update the set of belief
        self.belief = self.belief.update(actions=self.action_played, observations=observation_ids)

        # Check for failed updates
        update_successful = (self.belief.belief_array.sum(axis=1) != 0.0)

        return update_successful


    def kill(self,
             simulations_to_kill: np.ndarray
             ) -> None:
        '''
        Function to kill any simulations that have not reached the source but can't continue further

        Parameters
        ----------
        simulations_to_kill : np.ndarray
            A boolean array of the simulations to kill.
        '''
        if all(simulations_to_kill):
            self.belief = None
        else:
            self.belief = BeliefSet(self.belief.model, self.belief.belief_array[~simulations_to_kill])

choose_action()

Function to let the agent or set of agents choose an action based on their current belief. Following the Infotaxis principle, it will choose an action that will minimize the sum of next entropies.

Returns:

Name Type Description
movement_vector ndarray

A single or a list of actions chosen by the agent(s) based on their belief.

Source code in olfactory_navigation/agents/infotaxis_agent.py
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def choose_action(self) -> np.ndarray:
    '''
    Function to let the agent or set of agents choose an action based on their current belief.
    Following the Infotaxis principle, it will choose an action that will minimize the sum of next entropies.

    Returns
    -------
    movement_vector : np.ndarray
        A single or a list of actions chosen by the agent(s) based on their belief.
    '''
    xp = np if not self.on_gpu else cp

    n = len(self.belief)

    best_entropy = xp.ones(n) * -1
    best_action = xp.ones(n, dtype=int) * -1

    current_entropy = self.belief.entropies

    for a in self.model.actions:
        total_entropy = xp.zeros(n)

        for o in self.model.observations:
            b_ao = self.belief.update(actions=xp.ones(n, dtype=int)*a,
                                       observations=xp.ones(n, dtype=int)*o,
                                       throw_error=False)

            # Computing entropy
            with warnings.catch_warnings():
                warnings.simplefilter('ignore')
                b_ao_entropy = b_ao.entropies

            b_prob = xp.dot(self.belief.belief_array, xp.sum(self.model.reachable_transitional_observation_table[:,a,o,:], axis=1))

            total_entropy += (b_prob * (current_entropy - b_ao_entropy))

        # Checking if action is superior to previous best
        superiority_mask = best_entropy < total_entropy
        best_action[superiority_mask] = a
        best_entropy[superiority_mask] = total_entropy[superiority_mask]

    # Recording the action played
    self.action_played = best_action

    # Converting action indexes to movement vectors
    movemement_vector = self.action_set[best_action,:]

    return movemement_vector

initialize_state(n=1)

To use an agent within a simulation, the agent's state needs to be initialized. The initialization consists of setting the agent's initial belief. Multiple agents can be used at once for simulations, for this reason, the belief parameter is a BeliefSet by default.

Parameters:

Name Type Description Default
n int

How many agents are to be used during the simulation.

1
Source code in olfactory_navigation/agents/infotaxis_agent.py
152
153
154
155
156
157
158
159
160
161
162
163
164
165
def initialize_state(self,
                     n: int = 1
                     ) -> None:
    '''
    To use an agent within a simulation, the agent's state needs to be initialized.
    The initialization consists of setting the agent's initial belief.
    Multiple agents can be used at once for simulations, for this reason, the belief parameter is a BeliefSet by default.

    Parameters
    ----------
    n : int, default=1
        How many agents are to be used during the simulation.
    '''
    self.belief = BeliefSet(self.model, [Belief(self.model) for _ in range(n)])

kill(simulations_to_kill)

Function to kill any simulations that have not reached the source but can't continue further

Parameters:

Name Type Description Default
simulations_to_kill ndarray

A boolean array of the simulations to kill.

required
Source code in olfactory_navigation/agents/infotaxis_agent.py
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
def kill(self,
         simulations_to_kill: np.ndarray
         ) -> None:
    '''
    Function to kill any simulations that have not reached the source but can't continue further

    Parameters
    ----------
    simulations_to_kill : np.ndarray
        A boolean array of the simulations to kill.
    '''
    if all(simulations_to_kill):
        self.belief = None
    else:
        self.belief = BeliefSet(self.belief.model, self.belief.belief_array[~simulations_to_kill])

to_gpu()

Function to send the numpy arrays of the agent to the gpu. It returns a new instance of the Agent class with the arrays on the gpu

Returns:

Type Description
gpu_agent
Source code in olfactory_navigation/agents/infotaxis_agent.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def to_gpu(self) -> Agent:
    '''
    Function to send the numpy arrays of the agent to the gpu.
    It returns a new instance of the Agent class with the arrays on the gpu

    Returns
    -------
    gpu_agent
    '''
    # Generating a new instance
    cls = self.__class__
    gpu_agent = cls.__new__(cls)

    # Copying arguments to gpu
    for arg, val in self.__dict__.items():
        if isinstance(val, np.ndarray):
            setattr(gpu_agent, arg, cp.array(val))
        elif arg == 'rnd_state':
            setattr(gpu_agent, arg, cp.random.RandomState(self.seed))
        elif isinstance(val, Model):
            setattr(gpu_agent, arg, val.gpu_model)
        elif isinstance(val, BeliefSet) or isinstance(val, Belief):
            setattr(gpu_agent, arg, val.to_gpu())
        else:
            setattr(gpu_agent, arg, val)

    # Self reference instances
    self._alternate_version = gpu_agent
    gpu_agent._alternate_version = self

    gpu_agent.on_gpu = True
    return gpu_agent

update_state(observation, source_reached)

Function to update the internal state(s) of the agent(s) based on the previous action(s) taken and the observation(s) received.

Parameters:

Name Type Description Default
observation ndarray

The observation(s) the agent(s) made.

required
source_reached ndarray

A boolean array of whether the agent(s) have reached the source or not.

required

Returns:

Name Type Description
update_successfull (ndarray, optional)

If nothing is returned, it means all the agent's state updates have been successfull. Else, a boolean np.ndarray of size n can be returned confirming for each agent whether the update has been successful or not.

Source code in olfactory_navigation/agents/infotaxis_agent.py
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
def update_state(self,
                 observation: np.ndarray,
                 source_reached: np.ndarray
                 ) -> None | np.ndarray:
    '''
    Function to update the internal state(s) of the agent(s) based on the previous action(s) taken and the observation(s) received.

    Parameters
    ----------
    observation : np.ndarray
        The observation(s) the agent(s) made.
    source_reached : np.ndarray
        A boolean array of whether the agent(s) have reached the source or not.

    Returns
    -------
    update_successfull : np.ndarray, optional
        If nothing is returned, it means all the agent's state updates have been successfull.
        Else, a boolean np.ndarray of size n can be returned confirming for each agent whether the update has been successful or not.
    '''
    assert self.belief is not None, "Agent was not initialized yet, run the initialize_state function first"

    # GPU support
    xp = np if not self.on_gpu else cp

    # TODO: Make dedicated observation discretization function
    # Set the thresholds as a vector
    threshold = self.threshold
    if not isinstance(threshold, list):
        threshold = [threshold]

    # Ensure 0.0 and 1.0 begin and end the threshold list
    if threshold[0] != -xp.inf:
        threshold = [-xp.inf] + threshold

    if threshold[-1] != xp.inf:
        threshold = threshold + [xp.inf]
    threshold = xp.array(threshold)

    # Setting observation ids
    observation_ids = xp.argwhere((observation[:,None] >= threshold[:-1][None,:]) & (observation[:,None] < threshold[1:][None,:]))[:,1]
    observation_ids[source_reached] = len(threshold) # Observe source, goal is always last observation with len(threshold)-1 being the amount of observation buckets.

    # Update the set of belief
    self.belief = self.belief.update(actions=self.action_played, observations=observation_ids)

    # Check for failed updates
    update_successful = (self.belief.belief_array.sum(axis=1) != 0.0)

    return update_successful

PBVI_Agent

Bases: Agent

A generic Point-Based Value Iteration based agent. It relies on Model-Based reinforcement learning as described in: Pineau J. et al, Point-based value iteration: An anytime algorithm for POMDPs The training consist in two steps:

  • Expand: Where belief points are explored based on the some strategy (to be defined by subclasses).

  • Backup: Using the generated belief points, the value function is updated.

The belief points are probability distributions over the state space and are therefore vectors of |S| elements.

Actions are chosen based on a value function. A value function is a set of alpha vectors of dimentionality |S|. Each alpha vector is associated to a single action but multiple alpha vectors can be associated to the same action. To choose an action at a given belief point, a dot product is taken between each alpha vector and the belief point and the action associated with the highest result is chosen.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/pbvi_agent.py
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
class PBVI_Agent(Agent):
    '''
    A generic Point-Based Value Iteration based agent. It relies on Model-Based reinforcement learning as described in: Pineau J. et al, Point-based value iteration: An anytime algorithm for POMDPs
    The training consist in two steps:

    - Expand: Where belief points are explored based on the some strategy (to be defined by subclasses).

    - Backup: Using the generated belief points, the value function is updated.

    The belief points are probability distributions over the state space and are therefore vectors of |S| elements.

    Actions are chosen based on a value function. A value function is a set of alpha vectors of dimentionality |S|.
    Each alpha vector is associated to a single action but multiple alpha vectors can be associated to the same action.
    To choose an action at a given belief point, a dot product is taken between each alpha vector and the belief point and the action associated with the highest result is chosen.

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : Model
        The environment converted to a POMDP model using the "from_environment" constructor of the Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def __init__(self,
                 environment: Environment,
                 threshold: float | None = 3e-6,
                 actions: dict[str, np.ndarray] | np.ndarray | None = None,
                 name: str | None = None,
                 seed: int = 12131415,
                 model: Model | None = None,
                 environment_converter: Callable | None = None,
                 **converter_parameters
                 ) -> None:
        super().__init__(
            environment = environment,
            threshold = threshold,
            actions = actions,
            name = name,
            seed = seed
        )

        # Converting the olfactory environment to a POMDP Model
        if model is not None:
            loaded_model = model
        elif callable(environment_converter):
            loaded_model = environment_converter(agent=self, **converter_parameters)
        else:
            # Using the exact converter
            loaded_model = exact_converter(agent=self)
        self.model:Model = loaded_model

        # Trainable variables
        self.trained_at = None
        self.value_function = None

        # Status variables
        self.belief = None
        self.action_played = None


    def to_gpu(self) -> Agent:
        '''
        Function to send the numpy arrays of the agent to the gpu.
        It returns a new instance of the Agent class with the arrays on the gpu

        Returns
        -------
        gpu_agent : Agent
            A copy of the agent with the arrays on the GPU.
        '''
        assert gpu_support, "GPU support is not enabled, Cupy might need to be installed..."

        # Generating a new instance
        cls = self.__class__
        gpu_agent = cls.__new__(cls)

        # Copying arguments to gpu
        for arg, val in self.__dict__.items():
            if isinstance(val, np.ndarray):
                setattr(gpu_agent, arg, cp.array(val))
            elif arg == 'rnd_state':
                setattr(gpu_agent, arg, cp.random.RandomState(self.seed))
            elif isinstance(val, Model):
                setattr(gpu_agent, arg, val.gpu_model)
            elif isinstance(val, ValueFunction):
                setattr(gpu_agent, arg, val.to_gpu())
            elif isinstance(val, BeliefSet) or isinstance(val, Belief):
                setattr(gpu_agent, arg, val.to_gpu())
            else:
                setattr(gpu_agent, arg, val)

        # Self reference instances
        self._alternate_version = gpu_agent
        gpu_agent._alternate_version = self

        gpu_agent.on_gpu = True
        return gpu_agent


    def save(self,
             folder: str | None = None,
             force: bool = False,
             save_environment: bool = False
             ) -> None:
        '''
        The save function for PBVI Agents consists in recording the value function after the training.
        It saves the agent in a folder with the name of the agent (class name + training timestamp).
        In this folder, there will be the metadata of the agent (all the attributes) in a json format and the value function.

        Optionally, the environment can be saved too to be able to load it alongside the agent for future reuse.
        If the agent has already been saved, the saving will not happen unless the force parameter is toggled.

        Parameters
        ----------
        folder : str, optional
            The folder under which to save the agent (a subfolder will be created under this folder).
            The agent will therefore be saved at <folder>/Agent-<agent_name> .
            By default the current folder is used.
        force : bool, default=False
            Whether to overwrite an already saved agent with the same name at the same path.
        save_environment : bool, default=False
            Whether to save the environment data along with the agent.
        '''
        assert self.trained_at is not None, "The agent is not trained, there is nothing to save."

        # GPU support
        if self.on_gpu:
            self.to_cpu().save(folder=folder, force=force, save_environment=save_environment)
            return

        # Adding env name to folder path
        if folder is None:
            folder = f'./Agent-{self.name}'
        else:
            folder += '/Agent-' + self.name

        # Checking the folder exists or creates it
        if not os.path.exists(folder):
            os.mkdir(folder)
        elif len(os.listdir(folder)):
            if force:
                shutil.rmtree(folder)
                os.mkdir(folder)
            else:
                raise Exception(f'{folder} is not empty. If you want to overwrite the saved model, enable "force".')

        # If requested save environment
        if save_environment:
            self.environment.save(folder=folder)

        # TODO: Add actions to save function
        # Generating the metadata arguments dictionary
        arguments = {}
        arguments['name'] = self.name
        arguments['class'] = self.class_name
        arguments['threshold'] = self.threshold
        arguments['environment_name'] = self.environment.name
        arguments['environment_saved_at'] = self.environment.saved_at
        arguments['trained_at'] = self.trained_at
        arguments['seed'] = self.seed

        # Output the arguments to a METADATA file
        with open(folder + '/METADATA.json', 'w') as json_file:
            json.dump(arguments, json_file, indent=4)

        # Save value function
        self.value_function.save(folder=folder, file_name='Value_Function.npy')

        # Finalization
        self.saved_at = os.path.abspath(folder).replace('\\', '/')
        print(f'Agent saved to: {folder}')


    @classmethod
    def load(cls,
             folder: str
             ) -> 'PBVI_Agent':
        '''
        Function to load a PBVI agent from a given folder it has been saved to.
        It will load the environment the agent has been trained on along with it.

        If it is a subclass of the PBVI_Agent, an instance of that specific subclass will be returned.

        Parameters
        ----------
        folder : str
            The agent folder.

        Returns
        -------
        instance : PBVI_Agent
            The loaded instance of the PBVI Agent.
        '''
        # Load arguments
        arguments = None
        with open(folder + '/METADATA.json', 'r') as json_file:
            arguments = json.load(json_file)

        # Load environment
        environment = Environment.load(arguments['environment_saved_at'])

        # Load specific class
        if arguments['class'] != 'PBVI_Agent':
            from olfactory_navigation import agents
            cls = {name:obj for name, obj in inspect.getmembers(agents)}[arguments['class']]

        # Build instance
        instance = cls(
            environment=environment,
            threshold=arguments['threshold'],
            name=arguments['name'],
            seed=arguments['seed']
        )

        # Load and set the value function on the instance
        instance.value_function = ValueFunction.load(
            file=folder + '/Value_Function.npy',
            model=instance.model
        )
        instance.trained_at = arguments['trained_at']
        instance.saved_at = folder

        return instance


    def train(self,
              expansions: int,
              full_backup: bool = True,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True,
              **expand_arguments
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        full_backup : bool, default=True
            Whether to force the backup function has to be run on the full set beliefs uncovered since the beginning or only on the new points.
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.
        expand_arguments : kwargs
            An arbitrary amount of parameters that will be passed on to the expand function.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        # GPU support
        if use_gpu and not self.on_gpu:
            gpu_agent = self.to_gpu()
            solver_history = super(self.__class__, gpu_agent).train(
                expansions=expansions,
                full_backup=full_backup,
                update_passes=update_passes,
                max_belief_growth=max_belief_growth,
                initial_belief=initial_belief,
                initial_value_function=initial_value_function,
                prune_level=prune_level,
                prune_interval=prune_interval,
                limit_value_function_size=limit_value_function_size,
                gamma=gamma,
                eps=eps,
                use_gpu=use_gpu,
                history_tracking_level=history_tracking_level,
                overwrite_training=overwrite_training,
                print_progress=print_progress,
                print_stats=print_stats,
                **expand_arguments
            )
            self.value_function = gpu_agent.value_function.to_cpu()
            return solver_history

        xp = np if not self.on_gpu else cp

        # Getting model
        model = self.model

        # Initial belief
        if initial_belief is None:
            belief_set = BeliefSet(model, [Belief(model)])
        elif isinstance(initial_belief, BeliefSet):
            belief_set = initial_belief.to_gpu() if self.on_gpu else initial_belief 
        else:
            initial_belief = Belief(model, xp.array(initial_belief.values))
            belief_set = BeliefSet(model, [initial_belief])

        # Handeling the case where the agent is already trained
        if (self.value_function is not None):
            if overwrite_training:
                print('[warning] The value function is being overwritten')
                self.trained_at = None
                self.name = '-'.join(self.name.split('-')[:-1])
                self.value_function = None
            else:
                initial_value_function = self.value_function

        # Initial value function
        if initial_value_function is None:
            value_function = ValueFunction(model, model.expected_rewards_table.T, model.actions)
        else:
            value_function = initial_value_function.to_gpu() if self.on_gpu else initial_value_function

        # Convergence check boundary
        max_allowed_change = eps * (gamma / (1-gamma))

        # History tracking
        training_history = TrainingHistory(tracking_level=history_tracking_level,
                                           model=model,
                                           gamma=gamma,
                                           eps=eps,
                                           expand_append=full_backup,
                                           initial_value_function=value_function,
                                           initial_belief_set=belief_set)

        # Loop
        iteration = 0
        expand_value_function = value_function
        old_value_function = value_function

        try:
            iterator = trange(expansions, desc='Expansions') if print_progress else range(expansions)
            iterator_postfix = {}
            for expansion_i in iterator:

                # 1: Expand belief set
                start_ts = datetime.now()

                new_belief_set = self.expand(belief_set=belief_set,
                                             value_function=value_function,
                                             max_generation=max_belief_growth,
                                             **expand_arguments)

                # Add new beliefs points to the total belief_set
                belief_set = belief_set.union(new_belief_set)

                expand_time = (datetime.now() - start_ts).total_seconds()
                training_history.add_expand_step(expansion_time=expand_time, belief_set=belief_set)

                # 2: Backup, update value function (alpha vector set)
                for _ in range(update_passes) if (not print_progress or update_passes <= 1) else trange(update_passes, desc=f'Backups {expansion_i}'):
                    start_ts = datetime.now()

                    # Backup step
                    value_function = self.backup(belief_set if full_backup else new_belief_set,
                                                 value_function,
                                                 gamma=gamma,
                                                 append=(not full_backup),
                                                 belief_dominance_prune=False)
                    backup_time = (datetime.now() - start_ts).total_seconds()

                    # Additional pruning
                    if (iteration % prune_interval) == 0 and iteration > 0:
                        start_ts = datetime.now()
                        vf_len = len(value_function)

                        value_function.prune(prune_level)

                        prune_time = (datetime.now() - start_ts).total_seconds()
                        alpha_vectors_pruned = len(value_function) - vf_len
                        training_history.add_prune_step(prune_time, alpha_vectors_pruned)

                    # Check if value function size is above threshold
                    if limit_value_function_size >= 0 and len(value_function) > limit_value_function_size:
                        # Compute matrix multiplications between avs and beliefs
                        alpha_value_per_belief = xp.matmul(value_function.alpha_vector_array, belief_set.belief_array.T)

                        # Select the useful alpha vectors
                        best_alpha_vector_per_belief = xp.argmax(alpha_value_per_belief, axis=0)
                        useful_alpha_vectors = xp.unique(best_alpha_vector_per_belief)

                        # Select a random selection of vectors to delete
                        unuseful_alpha_vectors = xp.delete(xp.arange(len(value_function)), useful_alpha_vectors)
                        random_vectors_to_delete = self.rnd_state.choice(unuseful_alpha_vectors,
                                                                         size=max_belief_growth,
                                                                         p=(xp.arange(len(unuseful_alpha_vectors))[::-1] / xp.sum(xp.arange(len(unuseful_alpha_vectors)))))
                                                                         # replace=False,
                                                                         # p=1/len(unuseful_alpha_vectors))

                        value_function = ValueFunction(model=model,
                                                       alpha_vectors=xp.delete(value_function.alpha_vector_array, random_vectors_to_delete, axis=0),
                                                       action_list=xp.delete(value_function.actions, random_vectors_to_delete))

                        iterator_postfix['|useful|'] = useful_alpha_vectors.shape[0]

                    # Compute the change between value functions
                    max_change = self.compute_change(value_function, old_value_function, belief_set)

                    # History tracking
                    training_history.add_backup_step(backup_time, max_change, value_function)

                    # Convergence check
                    if max_change < max_allowed_change:
                        break

                    old_value_function = value_function

                    # Update iteration counter
                    iteration += 1

                # Compute change with old expansion value function
                expand_max_change = self.compute_change(expand_value_function, value_function, belief_set)

                if expand_max_change < max_allowed_change:
                    if print_progress:
                        print('Converged!')
                    break

                expand_value_function = value_function

                iterator_postfix['|V|'] = len(value_function)
                iterator_postfix['|B|'] = len(belief_set)

                if print_progress:
                    iterator.set_postfix(iterator_postfix)

        except MemoryError as e:
            print(f'Memory full: {e}')
            print('Returning value function and history as is...\n')

        # Final pruning
        start_ts = datetime.now()
        vf_len = len(value_function)

        value_function.prune(prune_level)

        # History tracking
        prune_time = (datetime.now() - start_ts).total_seconds()
        alpha_vectors_pruned = len(value_function) - vf_len
        training_history.add_prune_step(prune_time, alpha_vectors_pruned)

        # Record when it was trained
        self.trained_at = datetime.now().strftime("%Y%m%d_%H%M%S")
        self.name += f'-trained_{self.trained_at}'

        # Saving value function
        self.value_function = value_function

        # Print stats if requested
        if print_stats:
            print(training_history.summary)

        return training_history


    def compute_change(self,
                       value_function: ValueFunction,
                       new_value_function: ValueFunction,
                       belief_set: BeliefSet
                       ) -> float:
        '''
        Function to compute whether the change between two value functions can be considered as having converged based on the eps parameter of the Solver.
        It check for each belief, the maximum value and take the max change between believe's value functions.
        If this max change is lower than eps * (gamma / (1 - gamma)).

        Parameters
        ----------
        value_function : ValueFunction
            The first value function to compare.
        new_value_function : ValueFunction
            The second value function to compare.
        belief_set : BeliefSet
            The set of believes to check the values on to compute the max change on.

        Returns
        -------
        max_change : float
            The maximum change between value functions at belief points.
        '''
        # Get numpy corresponding to the arrays
        xp = np if not gpu_support else cp.get_array_module(value_function.alpha_vector_array)

        # Computing Delta for each beliefs
        max_val_per_belief = xp.max(xp.matmul(belief_set.belief_array, value_function.alpha_vector_array.T), axis=1)
        new_max_val_per_belief = xp.max(xp.matmul(belief_set.belief_array, new_value_function.alpha_vector_array.T), axis=1)
        max_change = xp.max(xp.abs(new_max_val_per_belief - max_val_per_belief))

        return max_change


    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int,
               **kwargs
               ) -> BeliefSet:
        '''
        Abstract function!
        This function should be implemented in subclasses.
        The expand function consists in the exploration of the belief set.
        It takes as input a belief set and generates at most 'max_generation' beliefs from it.

        The current value function is also passed as an argument as it is used in some PBVI techniques to guide the belief exploration.

        Parameters
        ----------
        belief_set : BeliefSet
            The belief or set of beliefs to be used as a starting point for the exploration.
        value_function : ValueFunction
            The current value function. To be used to guide the exploration process.
        max_generation : int
            How many beliefs to be generated at most.
        kwargs
            Special parameters for the particular flavors of the PBVI Agent.

        Returns
        -------
        new_belief_set : BeliefSet
            A new (or expanded) set of beliefs.
        '''
        raise NotImplementedError('PBVI class is abstract so expand function is not implemented, make an PBVI_agent subclass to implement the method')


    def backup(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               gamma: float = 0.99,
               append: bool = False,
               belief_dominance_prune: bool = True
               ) -> ValueFunction:
        '''
        This function has purpose to update the set of alpha vectors. It does so in 3 steps:
        1. It creates projections from each alpha vector for each possible action and each possible observation
        2. It collapses this set of generated alpha vectors by taking the weighted sum of the alpha vectors weighted by the observation probability and this for each action and for each belief.
        3. Then it further collapses the set to take the best alpha vector and action per belief
        In the end we have a set of alpha vectors as large as the amount of beliefs.

        The alpha vectors are also pruned to avoid duplicates and remove dominated ones.

        Parameters
        ----------
        belief_set : BeliefSet
            The belief set to use to generate the new alpha vectors with.
        value_function : ValueFunction
            The alpha vectors to generate the new set from.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        append : bool, default=False
            Whether to append the new alpha vectors generated to the old alpha vectors before pruning.
        belief_dominance_prune : bool, default=True
            Whether, before returning the new value function, checks what alpha vectors have a supperior value, if so it adds it.

        Returns
        -------
        new_alpha_set : ValueFunction
            A list of updated alpha vectors.
        '''
        xp = np if not self.on_gpu else cp
        model = self.model

        # Step 1
        vector_array = value_function.alpha_vector_array
        vectors_array_reachable_states = vector_array[xp.arange(vector_array.shape[0])[:,None,None,None], model.reachable_states[None,:,:,:]]

        gamma_a_o_t = gamma * xp.einsum('saor,vsar->aovs', model.reachable_transitional_observation_table, vectors_array_reachable_states)

        # Step 2
        belief_array = belief_set.belief_array # bs
        best_alpha_ind = xp.argmax(xp.tensordot(belief_array, gamma_a_o_t, (1,3)), axis=3) # argmax(bs,aovs->baov) -> bao

        best_alphas_per_o = gamma_a_o_t[model.actions[None,:,None,None], model.observations[None,None,:,None], best_alpha_ind[:,:,:,None], model.states[None,None,None,:]] # baos

        alpha_a = model.expected_rewards_table.T + xp.sum(best_alphas_per_o, axis=2) # as + bas

        # Step 3
        best_actions = xp.argmax(xp.einsum('bas,bs->ba', alpha_a, belief_array), axis=1)
        alpha_vectors = xp.take_along_axis(alpha_a, best_actions[:,None,None],axis=1)[:,0,:]

        # Belief domination
        if belief_dominance_prune:
            best_value_per_belief = xp.sum((belief_array * alpha_vectors), axis=1)
            old_best_value_per_belief = xp.max(xp.matmul(belief_array, vector_array.T), axis=1)
            dominating_vectors = best_value_per_belief > old_best_value_per_belief

            best_actions = best_actions[dominating_vectors]
            alpha_vectors = alpha_vectors[dominating_vectors]

        # Creation of value function
        new_value_function = ValueFunction(model, alpha_vectors, best_actions)

        # Union with previous value function
        if append:
            new_value_function.extend(value_function)

        return new_value_function


    def modify_environment(self,
                           new_environment: Environment
                           ) -> 'Agent':
        '''
        Function to modify the environment of the agent.
        If the agent is already trained, the trained element should also be adapted to fit this new environment.

        Parameters
        ----------
        new_environment : Environment
            A modified environment.

        Returns
        -------
        modified_agent : PBVI_Agent
            A new pbvi agent with a modified environment
        '''
        # GPU support
        if self.on_gpu:
            return self.to_cpu().modify_environment(new_environment=new_environment)

        # Creating a new agent instance
        modified_agent = self.__class__(environment=new_environment,
                                        threshold=self.threshold,
                                        name=self.name)

        # Modifying the value function
        if self.value_function is not None:
            reshaped_vf_array = np.array([cv2.resize(av, np.array(modified_agent.model.state_grid.shape)[::-1]).ravel()
                                          for av in self.value_function.alpha_vector_array.reshape(len(self.value_function), *self.model.state_grid.shape)])
            modified_vf = ValueFunction(modified_agent.model, alpha_vectors=reshaped_vf_array, action_list=self.value_function.actions)
            modified_agent.value_function = modified_vf

        return modified_agent


    def initialize_state(self,
                         n: int = 1
                         ) -> None:
        '''
        To use an agent within a simulation, the agent's state needs to be initialized.
        The initialization consists of setting the agent's initial belief.
        Multiple agents can be used at once for simulations, for this reason, the belief parameter is a BeliefSet by default.

        Parameters
        ----------
        n : int, default=1
            How many agents are to be used during the simulation.
        '''
        assert self.value_function is not None, "Agent was not trained, run the training function first..."

        self.belief = BeliefSet(self.model, [Belief(self.model) for _ in range(n)])


    def choose_action(self) -> np.ndarray:
        '''
        Function to let the agent or set of agents choose an action based on their current belief.

        Returns
        -------
        movement_vector : np.ndarray
            A single or a list of actions chosen by the agent(s) based on their belief.
        '''
        assert self.belief is not None, "Agent was not initialized yet, run the initialize_state function first"

        # Evaluated value function
        _, action = self.value_function.evaluate_at(self.belief)

        # Recording the action played
        self.action_played = action

        # Converting action indexes to movement vectors
        movemement_vector = self.action_set[action,:]

        return movemement_vector


    def update_state(self,
                     observation: np.ndarray,
                     source_reached: np.ndarray
                     ) -> None | np.ndarray:
        '''
        Function to update the internal state(s) of the agent(s) based on the previous action(s) taken and the observation(s) received.

        Parameters
        ----------
        observation : np.ndarray
            The observation(s) the agent(s) made.
        source_reached : np.ndarray
            A boolean array of whether the agent(s) have reached the source or not.

        Returns
        -------
        update_successfull : np.ndarray, optional
            If nothing is returned, it means all the agent's state updates have been successfull.
            Else, a boolean np.ndarray of size n can be returned confirming for each agent whether the update has been successful or not.
        '''
        assert self.belief is not None, "Agent was not initialized yet, run the initialize_state function first"

        # GPU support
        xp = np if not self.on_gpu else cp

        # TODO: Make dedicated observation discretization function
        # Set the thresholds as a vector
        threshold = self.threshold
        if not isinstance(threshold, list):
            threshold = [threshold]

        # Ensure 0.0 and 1.0 begin and end the threshold list
        if threshold[0] != -xp.inf:
            threshold = [-xp.inf] + threshold

        if threshold[-1] != xp.inf:
            threshold = threshold + [xp.inf]
        threshold = xp.array(threshold)

        # Setting observation ids
        observation_ids = xp.argwhere((observation[:,None] >= threshold[:-1][None,:]) & (observation[:,None] < threshold[1:][None,:]))[:,1]
        observation_ids[source_reached] = len(threshold) # Observe source, goal is always last observation with len(threshold)-1 being the amount of observation buckets.

        # Update the set of beliefs
        self.belief = self.belief.update(actions=self.action_played, observations=observation_ids, throw_error=False)

        # Check for failed updates
        update_successful = (self.belief.belief_array.sum(axis=1) != 0.0)

        return update_successful


    def kill(self,
             simulations_to_kill: np.ndarray
             ) -> None:
        '''
        Function to kill any simulations that have not reached the source but can't continue further

        Parameters
        ----------
        simulations_to_kill : np.ndarray
            A boolean array of the simulations to kill.
        '''
        if all(simulations_to_kill):
            self.belief = None
        else:
            self.belief = BeliefSet(self.belief.model, self.belief.belief_array[~simulations_to_kill])

backup(belief_set, value_function, gamma=0.99, append=False, belief_dominance_prune=True)

This function has purpose to update the set of alpha vectors. It does so in 3 steps: 1. It creates projections from each alpha vector for each possible action and each possible observation 2. It collapses this set of generated alpha vectors by taking the weighted sum of the alpha vectors weighted by the observation probability and this for each action and for each belief. 3. Then it further collapses the set to take the best alpha vector and action per belief In the end we have a set of alpha vectors as large as the amount of beliefs.

The alpha vectors are also pruned to avoid duplicates and remove dominated ones.

Parameters:

Name Type Description Default
belief_set BeliefSet

The belief set to use to generate the new alpha vectors with.

required
value_function ValueFunction

The alpha vectors to generate the new set from.

required
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
append bool

Whether to append the new alpha vectors generated to the old alpha vectors before pruning.

False
belief_dominance_prune bool

Whether, before returning the new value function, checks what alpha vectors have a supperior value, if so it adds it.

True

Returns:

Name Type Description
new_alpha_set ValueFunction

A list of updated alpha vectors.

Source code in olfactory_navigation/agents/pbvi_agent.py
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
def backup(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           gamma: float = 0.99,
           append: bool = False,
           belief_dominance_prune: bool = True
           ) -> ValueFunction:
    '''
    This function has purpose to update the set of alpha vectors. It does so in 3 steps:
    1. It creates projections from each alpha vector for each possible action and each possible observation
    2. It collapses this set of generated alpha vectors by taking the weighted sum of the alpha vectors weighted by the observation probability and this for each action and for each belief.
    3. Then it further collapses the set to take the best alpha vector and action per belief
    In the end we have a set of alpha vectors as large as the amount of beliefs.

    The alpha vectors are also pruned to avoid duplicates and remove dominated ones.

    Parameters
    ----------
    belief_set : BeliefSet
        The belief set to use to generate the new alpha vectors with.
    value_function : ValueFunction
        The alpha vectors to generate the new set from.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    append : bool, default=False
        Whether to append the new alpha vectors generated to the old alpha vectors before pruning.
    belief_dominance_prune : bool, default=True
        Whether, before returning the new value function, checks what alpha vectors have a supperior value, if so it adds it.

    Returns
    -------
    new_alpha_set : ValueFunction
        A list of updated alpha vectors.
    '''
    xp = np if not self.on_gpu else cp
    model = self.model

    # Step 1
    vector_array = value_function.alpha_vector_array
    vectors_array_reachable_states = vector_array[xp.arange(vector_array.shape[0])[:,None,None,None], model.reachable_states[None,:,:,:]]

    gamma_a_o_t = gamma * xp.einsum('saor,vsar->aovs', model.reachable_transitional_observation_table, vectors_array_reachable_states)

    # Step 2
    belief_array = belief_set.belief_array # bs
    best_alpha_ind = xp.argmax(xp.tensordot(belief_array, gamma_a_o_t, (1,3)), axis=3) # argmax(bs,aovs->baov) -> bao

    best_alphas_per_o = gamma_a_o_t[model.actions[None,:,None,None], model.observations[None,None,:,None], best_alpha_ind[:,:,:,None], model.states[None,None,None,:]] # baos

    alpha_a = model.expected_rewards_table.T + xp.sum(best_alphas_per_o, axis=2) # as + bas

    # Step 3
    best_actions = xp.argmax(xp.einsum('bas,bs->ba', alpha_a, belief_array), axis=1)
    alpha_vectors = xp.take_along_axis(alpha_a, best_actions[:,None,None],axis=1)[:,0,:]

    # Belief domination
    if belief_dominance_prune:
        best_value_per_belief = xp.sum((belief_array * alpha_vectors), axis=1)
        old_best_value_per_belief = xp.max(xp.matmul(belief_array, vector_array.T), axis=1)
        dominating_vectors = best_value_per_belief > old_best_value_per_belief

        best_actions = best_actions[dominating_vectors]
        alpha_vectors = alpha_vectors[dominating_vectors]

    # Creation of value function
    new_value_function = ValueFunction(model, alpha_vectors, best_actions)

    # Union with previous value function
    if append:
        new_value_function.extend(value_function)

    return new_value_function

choose_action()

Function to let the agent or set of agents choose an action based on their current belief.

Returns:

Name Type Description
movement_vector ndarray

A single or a list of actions chosen by the agent(s) based on their belief.

Source code in olfactory_navigation/agents/pbvi_agent.py
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
def choose_action(self) -> np.ndarray:
    '''
    Function to let the agent or set of agents choose an action based on their current belief.

    Returns
    -------
    movement_vector : np.ndarray
        A single or a list of actions chosen by the agent(s) based on their belief.
    '''
    assert self.belief is not None, "Agent was not initialized yet, run the initialize_state function first"

    # Evaluated value function
    _, action = self.value_function.evaluate_at(self.belief)

    # Recording the action played
    self.action_played = action

    # Converting action indexes to movement vectors
    movemement_vector = self.action_set[action,:]

    return movemement_vector

compute_change(value_function, new_value_function, belief_set)

Function to compute whether the change between two value functions can be considered as having converged based on the eps parameter of the Solver. It check for each belief, the maximum value and take the max change between believe's value functions. If this max change is lower than eps * (gamma / (1 - gamma)).

Parameters:

Name Type Description Default
value_function ValueFunction

The first value function to compare.

required
new_value_function ValueFunction

The second value function to compare.

required
belief_set BeliefSet

The set of believes to check the values on to compute the max change on.

required

Returns:

Name Type Description
max_change float

The maximum change between value functions at belief points.

Source code in olfactory_navigation/agents/pbvi_agent.py
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
def compute_change(self,
                   value_function: ValueFunction,
                   new_value_function: ValueFunction,
                   belief_set: BeliefSet
                   ) -> float:
    '''
    Function to compute whether the change between two value functions can be considered as having converged based on the eps parameter of the Solver.
    It check for each belief, the maximum value and take the max change between believe's value functions.
    If this max change is lower than eps * (gamma / (1 - gamma)).

    Parameters
    ----------
    value_function : ValueFunction
        The first value function to compare.
    new_value_function : ValueFunction
        The second value function to compare.
    belief_set : BeliefSet
        The set of believes to check the values on to compute the max change on.

    Returns
    -------
    max_change : float
        The maximum change between value functions at belief points.
    '''
    # Get numpy corresponding to the arrays
    xp = np if not gpu_support else cp.get_array_module(value_function.alpha_vector_array)

    # Computing Delta for each beliefs
    max_val_per_belief = xp.max(xp.matmul(belief_set.belief_array, value_function.alpha_vector_array.T), axis=1)
    new_max_val_per_belief = xp.max(xp.matmul(belief_set.belief_array, new_value_function.alpha_vector_array.T), axis=1)
    max_change = xp.max(xp.abs(new_max_val_per_belief - max_val_per_belief))

    return max_change

expand(belief_set, value_function, max_generation, **kwargs)

Abstract function! This function should be implemented in subclasses. The expand function consists in the exploration of the belief set. It takes as input a belief set and generates at most 'max_generation' beliefs from it.

The current value function is also passed as an argument as it is used in some PBVI techniques to guide the belief exploration.

Parameters:

Name Type Description Default
belief_set BeliefSet

The belief or set of beliefs to be used as a starting point for the exploration.

required
value_function ValueFunction

The current value function. To be used to guide the exploration process.

required
max_generation int

How many beliefs to be generated at most.

required
kwargs

Special parameters for the particular flavors of the PBVI Agent.

{}

Returns:

Name Type Description
new_belief_set BeliefSet

A new (or expanded) set of beliefs.

Source code in olfactory_navigation/agents/pbvi_agent.py
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int,
           **kwargs
           ) -> BeliefSet:
    '''
    Abstract function!
    This function should be implemented in subclasses.
    The expand function consists in the exploration of the belief set.
    It takes as input a belief set and generates at most 'max_generation' beliefs from it.

    The current value function is also passed as an argument as it is used in some PBVI techniques to guide the belief exploration.

    Parameters
    ----------
    belief_set : BeliefSet
        The belief or set of beliefs to be used as a starting point for the exploration.
    value_function : ValueFunction
        The current value function. To be used to guide the exploration process.
    max_generation : int
        How many beliefs to be generated at most.
    kwargs
        Special parameters for the particular flavors of the PBVI Agent.

    Returns
    -------
    new_belief_set : BeliefSet
        A new (or expanded) set of beliefs.
    '''
    raise NotImplementedError('PBVI class is abstract so expand function is not implemented, make an PBVI_agent subclass to implement the method')

initialize_state(n=1)

To use an agent within a simulation, the agent's state needs to be initialized. The initialization consists of setting the agent's initial belief. Multiple agents can be used at once for simulations, for this reason, the belief parameter is a BeliefSet by default.

Parameters:

Name Type Description Default
n int

How many agents are to be used during the simulation.

1
Source code in olfactory_navigation/agents/pbvi_agent.py
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
def initialize_state(self,
                     n: int = 1
                     ) -> None:
    '''
    To use an agent within a simulation, the agent's state needs to be initialized.
    The initialization consists of setting the agent's initial belief.
    Multiple agents can be used at once for simulations, for this reason, the belief parameter is a BeliefSet by default.

    Parameters
    ----------
    n : int, default=1
        How many agents are to be used during the simulation.
    '''
    assert self.value_function is not None, "Agent was not trained, run the training function first..."

    self.belief = BeliefSet(self.model, [Belief(self.model) for _ in range(n)])

kill(simulations_to_kill)

Function to kill any simulations that have not reached the source but can't continue further

Parameters:

Name Type Description Default
simulations_to_kill ndarray

A boolean array of the simulations to kill.

required
Source code in olfactory_navigation/agents/pbvi_agent.py
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
def kill(self,
         simulations_to_kill: np.ndarray
         ) -> None:
    '''
    Function to kill any simulations that have not reached the source but can't continue further

    Parameters
    ----------
    simulations_to_kill : np.ndarray
        A boolean array of the simulations to kill.
    '''
    if all(simulations_to_kill):
        self.belief = None
    else:
        self.belief = BeliefSet(self.belief.model, self.belief.belief_array[~simulations_to_kill])

load(folder) classmethod

Function to load a PBVI agent from a given folder it has been saved to. It will load the environment the agent has been trained on along with it.

If it is a subclass of the PBVI_Agent, an instance of that specific subclass will be returned.

Parameters:

Name Type Description Default
folder str

The agent folder.

required

Returns:

Name Type Description
instance PBVI_Agent

The loaded instance of the PBVI Agent.

Source code in olfactory_navigation/agents/pbvi_agent.py
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
@classmethod
def load(cls,
         folder: str
         ) -> 'PBVI_Agent':
    '''
    Function to load a PBVI agent from a given folder it has been saved to.
    It will load the environment the agent has been trained on along with it.

    If it is a subclass of the PBVI_Agent, an instance of that specific subclass will be returned.

    Parameters
    ----------
    folder : str
        The agent folder.

    Returns
    -------
    instance : PBVI_Agent
        The loaded instance of the PBVI Agent.
    '''
    # Load arguments
    arguments = None
    with open(folder + '/METADATA.json', 'r') as json_file:
        arguments = json.load(json_file)

    # Load environment
    environment = Environment.load(arguments['environment_saved_at'])

    # Load specific class
    if arguments['class'] != 'PBVI_Agent':
        from olfactory_navigation import agents
        cls = {name:obj for name, obj in inspect.getmembers(agents)}[arguments['class']]

    # Build instance
    instance = cls(
        environment=environment,
        threshold=arguments['threshold'],
        name=arguments['name'],
        seed=arguments['seed']
    )

    # Load and set the value function on the instance
    instance.value_function = ValueFunction.load(
        file=folder + '/Value_Function.npy',
        model=instance.model
    )
    instance.trained_at = arguments['trained_at']
    instance.saved_at = folder

    return instance

modify_environment(new_environment)

Function to modify the environment of the agent. If the agent is already trained, the trained element should also be adapted to fit this new environment.

Parameters:

Name Type Description Default
new_environment Environment

A modified environment.

required

Returns:

Name Type Description
modified_agent PBVI_Agent

A new pbvi agent with a modified environment

Source code in olfactory_navigation/agents/pbvi_agent.py
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
def modify_environment(self,
                       new_environment: Environment
                       ) -> 'Agent':
    '''
    Function to modify the environment of the agent.
    If the agent is already trained, the trained element should also be adapted to fit this new environment.

    Parameters
    ----------
    new_environment : Environment
        A modified environment.

    Returns
    -------
    modified_agent : PBVI_Agent
        A new pbvi agent with a modified environment
    '''
    # GPU support
    if self.on_gpu:
        return self.to_cpu().modify_environment(new_environment=new_environment)

    # Creating a new agent instance
    modified_agent = self.__class__(environment=new_environment,
                                    threshold=self.threshold,
                                    name=self.name)

    # Modifying the value function
    if self.value_function is not None:
        reshaped_vf_array = np.array([cv2.resize(av, np.array(modified_agent.model.state_grid.shape)[::-1]).ravel()
                                      for av in self.value_function.alpha_vector_array.reshape(len(self.value_function), *self.model.state_grid.shape)])
        modified_vf = ValueFunction(modified_agent.model, alpha_vectors=reshaped_vf_array, action_list=self.value_function.actions)
        modified_agent.value_function = modified_vf

    return modified_agent

save(folder=None, force=False, save_environment=False)

The save function for PBVI Agents consists in recording the value function after the training. It saves the agent in a folder with the name of the agent (class name + training timestamp). In this folder, there will be the metadata of the agent (all the attributes) in a json format and the value function.

Optionally, the environment can be saved too to be able to load it alongside the agent for future reuse. If the agent has already been saved, the saving will not happen unless the force parameter is toggled.

Parameters:

Name Type Description Default
folder str

The folder under which to save the agent (a subfolder will be created under this folder). The agent will therefore be saved at /Agent- . By default the current folder is used.

None
force bool

Whether to overwrite an already saved agent with the same name at the same path.

False
save_environment bool

Whether to save the environment data along with the agent.

False
Source code in olfactory_navigation/agents/pbvi_agent.py
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
def save(self,
         folder: str | None = None,
         force: bool = False,
         save_environment: bool = False
         ) -> None:
    '''
    The save function for PBVI Agents consists in recording the value function after the training.
    It saves the agent in a folder with the name of the agent (class name + training timestamp).
    In this folder, there will be the metadata of the agent (all the attributes) in a json format and the value function.

    Optionally, the environment can be saved too to be able to load it alongside the agent for future reuse.
    If the agent has already been saved, the saving will not happen unless the force parameter is toggled.

    Parameters
    ----------
    folder : str, optional
        The folder under which to save the agent (a subfolder will be created under this folder).
        The agent will therefore be saved at <folder>/Agent-<agent_name> .
        By default the current folder is used.
    force : bool, default=False
        Whether to overwrite an already saved agent with the same name at the same path.
    save_environment : bool, default=False
        Whether to save the environment data along with the agent.
    '''
    assert self.trained_at is not None, "The agent is not trained, there is nothing to save."

    # GPU support
    if self.on_gpu:
        self.to_cpu().save(folder=folder, force=force, save_environment=save_environment)
        return

    # Adding env name to folder path
    if folder is None:
        folder = f'./Agent-{self.name}'
    else:
        folder += '/Agent-' + self.name

    # Checking the folder exists or creates it
    if not os.path.exists(folder):
        os.mkdir(folder)
    elif len(os.listdir(folder)):
        if force:
            shutil.rmtree(folder)
            os.mkdir(folder)
        else:
            raise Exception(f'{folder} is not empty. If you want to overwrite the saved model, enable "force".')

    # If requested save environment
    if save_environment:
        self.environment.save(folder=folder)

    # TODO: Add actions to save function
    # Generating the metadata arguments dictionary
    arguments = {}
    arguments['name'] = self.name
    arguments['class'] = self.class_name
    arguments['threshold'] = self.threshold
    arguments['environment_name'] = self.environment.name
    arguments['environment_saved_at'] = self.environment.saved_at
    arguments['trained_at'] = self.trained_at
    arguments['seed'] = self.seed

    # Output the arguments to a METADATA file
    with open(folder + '/METADATA.json', 'w') as json_file:
        json.dump(arguments, json_file, indent=4)

    # Save value function
    self.value_function.save(folder=folder, file_name='Value_Function.npy')

    # Finalization
    self.saved_at = os.path.abspath(folder).replace('\\', '/')
    print(f'Agent saved to: {folder}')

to_gpu()

Function to send the numpy arrays of the agent to the gpu. It returns a new instance of the Agent class with the arrays on the gpu

Returns:

Name Type Description
gpu_agent Agent

A copy of the agent with the arrays on the GPU.

Source code in olfactory_navigation/agents/pbvi_agent.py
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
def to_gpu(self) -> Agent:
    '''
    Function to send the numpy arrays of the agent to the gpu.
    It returns a new instance of the Agent class with the arrays on the gpu

    Returns
    -------
    gpu_agent : Agent
        A copy of the agent with the arrays on the GPU.
    '''
    assert gpu_support, "GPU support is not enabled, Cupy might need to be installed..."

    # Generating a new instance
    cls = self.__class__
    gpu_agent = cls.__new__(cls)

    # Copying arguments to gpu
    for arg, val in self.__dict__.items():
        if isinstance(val, np.ndarray):
            setattr(gpu_agent, arg, cp.array(val))
        elif arg == 'rnd_state':
            setattr(gpu_agent, arg, cp.random.RandomState(self.seed))
        elif isinstance(val, Model):
            setattr(gpu_agent, arg, val.gpu_model)
        elif isinstance(val, ValueFunction):
            setattr(gpu_agent, arg, val.to_gpu())
        elif isinstance(val, BeliefSet) or isinstance(val, Belief):
            setattr(gpu_agent, arg, val.to_gpu())
        else:
            setattr(gpu_agent, arg, val)

    # Self reference instances
    self._alternate_version = gpu_agent
    gpu_agent._alternate_version = self

    gpu_agent.on_gpu = True
    return gpu_agent

train(expansions, full_backup=True, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True, **expand_arguments)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
full_backup bool

Whether to force the backup function has to be run on the full set beliefs uncovered since the beginning or only on the new points.

True
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True
expand_arguments kwargs

An arbitrary amount of parameters that will be passed on to the expand function.

{}

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/pbvi_agent.py
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
def train(self,
          expansions: int,
          full_backup: bool = True,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True,
          **expand_arguments
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    full_backup : bool, default=True
        Whether to force the backup function has to be run on the full set beliefs uncovered since the beginning or only on the new points.
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.
    expand_arguments : kwargs
        An arbitrary amount of parameters that will be passed on to the expand function.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    # GPU support
    if use_gpu and not self.on_gpu:
        gpu_agent = self.to_gpu()
        solver_history = super(self.__class__, gpu_agent).train(
            expansions=expansions,
            full_backup=full_backup,
            update_passes=update_passes,
            max_belief_growth=max_belief_growth,
            initial_belief=initial_belief,
            initial_value_function=initial_value_function,
            prune_level=prune_level,
            prune_interval=prune_interval,
            limit_value_function_size=limit_value_function_size,
            gamma=gamma,
            eps=eps,
            use_gpu=use_gpu,
            history_tracking_level=history_tracking_level,
            overwrite_training=overwrite_training,
            print_progress=print_progress,
            print_stats=print_stats,
            **expand_arguments
        )
        self.value_function = gpu_agent.value_function.to_cpu()
        return solver_history

    xp = np if not self.on_gpu else cp

    # Getting model
    model = self.model

    # Initial belief
    if initial_belief is None:
        belief_set = BeliefSet(model, [Belief(model)])
    elif isinstance(initial_belief, BeliefSet):
        belief_set = initial_belief.to_gpu() if self.on_gpu else initial_belief 
    else:
        initial_belief = Belief(model, xp.array(initial_belief.values))
        belief_set = BeliefSet(model, [initial_belief])

    # Handeling the case where the agent is already trained
    if (self.value_function is not None):
        if overwrite_training:
            print('[warning] The value function is being overwritten')
            self.trained_at = None
            self.name = '-'.join(self.name.split('-')[:-1])
            self.value_function = None
        else:
            initial_value_function = self.value_function

    # Initial value function
    if initial_value_function is None:
        value_function = ValueFunction(model, model.expected_rewards_table.T, model.actions)
    else:
        value_function = initial_value_function.to_gpu() if self.on_gpu else initial_value_function

    # Convergence check boundary
    max_allowed_change = eps * (gamma / (1-gamma))

    # History tracking
    training_history = TrainingHistory(tracking_level=history_tracking_level,
                                       model=model,
                                       gamma=gamma,
                                       eps=eps,
                                       expand_append=full_backup,
                                       initial_value_function=value_function,
                                       initial_belief_set=belief_set)

    # Loop
    iteration = 0
    expand_value_function = value_function
    old_value_function = value_function

    try:
        iterator = trange(expansions, desc='Expansions') if print_progress else range(expansions)
        iterator_postfix = {}
        for expansion_i in iterator:

            # 1: Expand belief set
            start_ts = datetime.now()

            new_belief_set = self.expand(belief_set=belief_set,
                                         value_function=value_function,
                                         max_generation=max_belief_growth,
                                         **expand_arguments)

            # Add new beliefs points to the total belief_set
            belief_set = belief_set.union(new_belief_set)

            expand_time = (datetime.now() - start_ts).total_seconds()
            training_history.add_expand_step(expansion_time=expand_time, belief_set=belief_set)

            # 2: Backup, update value function (alpha vector set)
            for _ in range(update_passes) if (not print_progress or update_passes <= 1) else trange(update_passes, desc=f'Backups {expansion_i}'):
                start_ts = datetime.now()

                # Backup step
                value_function = self.backup(belief_set if full_backup else new_belief_set,
                                             value_function,
                                             gamma=gamma,
                                             append=(not full_backup),
                                             belief_dominance_prune=False)
                backup_time = (datetime.now() - start_ts).total_seconds()

                # Additional pruning
                if (iteration % prune_interval) == 0 and iteration > 0:
                    start_ts = datetime.now()
                    vf_len = len(value_function)

                    value_function.prune(prune_level)

                    prune_time = (datetime.now() - start_ts).total_seconds()
                    alpha_vectors_pruned = len(value_function) - vf_len
                    training_history.add_prune_step(prune_time, alpha_vectors_pruned)

                # Check if value function size is above threshold
                if limit_value_function_size >= 0 and len(value_function) > limit_value_function_size:
                    # Compute matrix multiplications between avs and beliefs
                    alpha_value_per_belief = xp.matmul(value_function.alpha_vector_array, belief_set.belief_array.T)

                    # Select the useful alpha vectors
                    best_alpha_vector_per_belief = xp.argmax(alpha_value_per_belief, axis=0)
                    useful_alpha_vectors = xp.unique(best_alpha_vector_per_belief)

                    # Select a random selection of vectors to delete
                    unuseful_alpha_vectors = xp.delete(xp.arange(len(value_function)), useful_alpha_vectors)
                    random_vectors_to_delete = self.rnd_state.choice(unuseful_alpha_vectors,
                                                                     size=max_belief_growth,
                                                                     p=(xp.arange(len(unuseful_alpha_vectors))[::-1] / xp.sum(xp.arange(len(unuseful_alpha_vectors)))))
                                                                     # replace=False,
                                                                     # p=1/len(unuseful_alpha_vectors))

                    value_function = ValueFunction(model=model,
                                                   alpha_vectors=xp.delete(value_function.alpha_vector_array, random_vectors_to_delete, axis=0),
                                                   action_list=xp.delete(value_function.actions, random_vectors_to_delete))

                    iterator_postfix['|useful|'] = useful_alpha_vectors.shape[0]

                # Compute the change between value functions
                max_change = self.compute_change(value_function, old_value_function, belief_set)

                # History tracking
                training_history.add_backup_step(backup_time, max_change, value_function)

                # Convergence check
                if max_change < max_allowed_change:
                    break

                old_value_function = value_function

                # Update iteration counter
                iteration += 1

            # Compute change with old expansion value function
            expand_max_change = self.compute_change(expand_value_function, value_function, belief_set)

            if expand_max_change < max_allowed_change:
                if print_progress:
                    print('Converged!')
                break

            expand_value_function = value_function

            iterator_postfix['|V|'] = len(value_function)
            iterator_postfix['|B|'] = len(belief_set)

            if print_progress:
                iterator.set_postfix(iterator_postfix)

    except MemoryError as e:
        print(f'Memory full: {e}')
        print('Returning value function and history as is...\n')

    # Final pruning
    start_ts = datetime.now()
    vf_len = len(value_function)

    value_function.prune(prune_level)

    # History tracking
    prune_time = (datetime.now() - start_ts).total_seconds()
    alpha_vectors_pruned = len(value_function) - vf_len
    training_history.add_prune_step(prune_time, alpha_vectors_pruned)

    # Record when it was trained
    self.trained_at = datetime.now().strftime("%Y%m%d_%H%M%S")
    self.name += f'-trained_{self.trained_at}'

    # Saving value function
    self.value_function = value_function

    # Print stats if requested
    if print_stats:
        print(training_history.summary)

    return training_history

update_state(observation, source_reached)

Function to update the internal state(s) of the agent(s) based on the previous action(s) taken and the observation(s) received.

Parameters:

Name Type Description Default
observation ndarray

The observation(s) the agent(s) made.

required
source_reached ndarray

A boolean array of whether the agent(s) have reached the source or not.

required

Returns:

Name Type Description
update_successfull (ndarray, optional)

If nothing is returned, it means all the agent's state updates have been successfull. Else, a boolean np.ndarray of size n can be returned confirming for each agent whether the update has been successful or not.

Source code in olfactory_navigation/agents/pbvi_agent.py
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
def update_state(self,
                 observation: np.ndarray,
                 source_reached: np.ndarray
                 ) -> None | np.ndarray:
    '''
    Function to update the internal state(s) of the agent(s) based on the previous action(s) taken and the observation(s) received.

    Parameters
    ----------
    observation : np.ndarray
        The observation(s) the agent(s) made.
    source_reached : np.ndarray
        A boolean array of whether the agent(s) have reached the source or not.

    Returns
    -------
    update_successfull : np.ndarray, optional
        If nothing is returned, it means all the agent's state updates have been successfull.
        Else, a boolean np.ndarray of size n can be returned confirming for each agent whether the update has been successful or not.
    '''
    assert self.belief is not None, "Agent was not initialized yet, run the initialize_state function first"

    # GPU support
    xp = np if not self.on_gpu else cp

    # TODO: Make dedicated observation discretization function
    # Set the thresholds as a vector
    threshold = self.threshold
    if not isinstance(threshold, list):
        threshold = [threshold]

    # Ensure 0.0 and 1.0 begin and end the threshold list
    if threshold[0] != -xp.inf:
        threshold = [-xp.inf] + threshold

    if threshold[-1] != xp.inf:
        threshold = threshold + [xp.inf]
    threshold = xp.array(threshold)

    # Setting observation ids
    observation_ids = xp.argwhere((observation[:,None] >= threshold[:-1][None,:]) & (observation[:,None] < threshold[1:][None,:]))[:,1]
    observation_ids[source_reached] = len(threshold) # Observe source, goal is always last observation with len(threshold)-1 being the amount of observation buckets.

    # Update the set of beliefs
    self.belief = self.belief.update(actions=self.action_played, observations=observation_ids, throw_error=False)

    # Check for failed updates
    update_successful = (self.belief.belief_array.sum(axis=1) != 0.0)

    return update_successful

PBVI_GER_Agent

Bases: PBVI_Agent

A flavor of the PBVI Agent. The expand function consists in choosing belief points that will most decrease the error in the value function (so increasing most the value).

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/pbvi_ger_agent.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
class PBVI_GER_Agent(PBVI_Agent):
    '''
    A flavor of the PBVI Agent. The expand function consists in choosing belief points that will most decrease the error in the value function (so increasing most the value).

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int
               ) -> BeliefSet:
        '''
        Greedy Error Reduction.
        It attempts to choose the believes that will maximize the improvement of the value function by minimizing the error.
        The error is computed by the sum of the change between two beliefs and their two corresponding alpha vectors.

        Parameters
        ----------
        belief_set : BeliefSet
            List of beliefs to expand on.
        value_function : ValueFunction
            The current value function. Used to compute the value at belief points.
        max_generation : int, default=10
            The max amount of beliefs that can be added to the belief set at once.

        Returns
        -------
        belief_set_new : BeliefSet
            Union of the belief_set and the expansions of the beliefs in the belief_set.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        old_shape = belief_set.belief_array.shape
        to_generate = min(max_generation, old_shape[0])

        new_belief_array = xp.empty((old_shape[0] + to_generate, old_shape[1]))
        new_belief_array[:old_shape[0]] = belief_set.belief_array

        # Finding the min and max rewards for computation of the epsilon
        r_min = model._min_reward / (1 - self.gamma)
        r_max = model._max_reward / (1 - self.gamma)

        # Generation of all potential successor beliefs
        successor_beliefs = xp.array([[[b.update(a,o).values for o in model.observations] for a in model.actions] for b in belief_set.belief_list])

        # Finding the alphas associated with each previous beliefs
        best_alpha = xp.argmax(xp.dot(belief_set.belief_array, value_function.alpha_vector_array.T), axis = 1)
        b_alphas = value_function.alpha_vector_array[best_alpha]

        # Difference between beliefs and their successors
        b_diffs = successor_beliefs - belief_set.belief_array[:,None,None,:]

        # Computing a 'next' alpha vector made of the max and min
        alphas_p = xp.where(b_diffs >= 0, r_max, r_min)

        # Difference between alpha vectors and their successors alpha vector
        alphas_diffs = alphas_p - b_alphas[:,None,None,:]

        # Computing epsilon for all successor beliefs
        eps = xp.einsum('baos,baos->bao', alphas_diffs, b_diffs)

        # Computing the probability of the b and doing action a and receiving observation o
        bao_probs = xp.einsum('bs,saor->bao', belief_set.belief_array, model.reachable_transitional_observation_table)

        # Taking the sumproduct of the probs with the epsilons
        res = xp.einsum('bao,bao->ba', bao_probs, eps)

        # Picking the correct amount of initial beliefs and ideal actions
        b_stars, a_stars = xp.unravel_index(xp.argsort(res, axis=None)[::-1][:to_generate], res.shape)

        # And picking the ideal observations
        o_star = xp.argmax(bao_probs[b_stars[:,None], a_stars[:,None], model.observations[None,:]] * eps[b_stars[:,None], a_stars[:,None], model.observations[None,:]], axis=1)

        # Selecting the successor beliefs
        new_belief_array = successor_beliefs[b_stars[:,None], a_stars[:,None], o_star[:,None], model.states[None,:]]

        return BeliefSet(model, new_belief_array)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Greedy Error Reduction Point-Based Value Iteration:
        - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        return super().train(expansions = expansions,
                             full_backup = True,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats)

expand(belief_set, value_function, max_generation)

Greedy Error Reduction. It attempts to choose the believes that will maximize the improvement of the value function by minimizing the error. The error is computed by the sum of the change between two beliefs and their two corresponding alpha vectors.

Parameters:

Name Type Description Default
belief_set BeliefSet

List of beliefs to expand on.

required
value_function ValueFunction

The current value function. Used to compute the value at belief points.

required
max_generation int

The max amount of beliefs that can be added to the belief set at once.

10

Returns:

Name Type Description
belief_set_new BeliefSet

Union of the belief_set and the expansions of the beliefs in the belief_set.

Source code in olfactory_navigation/agents/pbvi_ger_agent.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int
           ) -> BeliefSet:
    '''
    Greedy Error Reduction.
    It attempts to choose the believes that will maximize the improvement of the value function by minimizing the error.
    The error is computed by the sum of the change between two beliefs and their two corresponding alpha vectors.

    Parameters
    ----------
    belief_set : BeliefSet
        List of beliefs to expand on.
    value_function : ValueFunction
        The current value function. Used to compute the value at belief points.
    max_generation : int, default=10
        The max amount of beliefs that can be added to the belief set at once.

    Returns
    -------
    belief_set_new : BeliefSet
        Union of the belief_set and the expansions of the beliefs in the belief_set.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    old_shape = belief_set.belief_array.shape
    to_generate = min(max_generation, old_shape[0])

    new_belief_array = xp.empty((old_shape[0] + to_generate, old_shape[1]))
    new_belief_array[:old_shape[0]] = belief_set.belief_array

    # Finding the min and max rewards for computation of the epsilon
    r_min = model._min_reward / (1 - self.gamma)
    r_max = model._max_reward / (1 - self.gamma)

    # Generation of all potential successor beliefs
    successor_beliefs = xp.array([[[b.update(a,o).values for o in model.observations] for a in model.actions] for b in belief_set.belief_list])

    # Finding the alphas associated with each previous beliefs
    best_alpha = xp.argmax(xp.dot(belief_set.belief_array, value_function.alpha_vector_array.T), axis = 1)
    b_alphas = value_function.alpha_vector_array[best_alpha]

    # Difference between beliefs and their successors
    b_diffs = successor_beliefs - belief_set.belief_array[:,None,None,:]

    # Computing a 'next' alpha vector made of the max and min
    alphas_p = xp.where(b_diffs >= 0, r_max, r_min)

    # Difference between alpha vectors and their successors alpha vector
    alphas_diffs = alphas_p - b_alphas[:,None,None,:]

    # Computing epsilon for all successor beliefs
    eps = xp.einsum('baos,baos->bao', alphas_diffs, b_diffs)

    # Computing the probability of the b and doing action a and receiving observation o
    bao_probs = xp.einsum('bs,saor->bao', belief_set.belief_array, model.reachable_transitional_observation_table)

    # Taking the sumproduct of the probs with the epsilons
    res = xp.einsum('bao,bao->ba', bao_probs, eps)

    # Picking the correct amount of initial beliefs and ideal actions
    b_stars, a_stars = xp.unravel_index(xp.argsort(res, axis=None)[::-1][:to_generate], res.shape)

    # And picking the ideal observations
    o_star = xp.argmax(bao_probs[b_stars[:,None], a_stars[:,None], model.observations[None,:]] * eps[b_stars[:,None], a_stars[:,None], model.observations[None,:]], axis=1)

    # Selecting the successor beliefs
    new_belief_array = successor_beliefs[b_stars[:,None], a_stars[:,None], o_star[:,None], model.states[None,:]]

    return BeliefSet(model, new_belief_array)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Greedy Error Reduction Point-Based Value Iteration: - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/pbvi_ger_agent.py
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Greedy Error Reduction Point-Based Value Iteration:
    - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    return super().train(expansions = expansions,
                         full_backup = True,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats)

PBVI_RA_Agent

Bases: PBVI_Agent

A flavor of the PBVI Agent. The expand function consists in choosing random belief points.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/pbvi_ra_agent.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
class PBVI_RA_Agent(PBVI_Agent):
    '''
    A flavor of the PBVI Agent. The expand function consists in choosing random belief points.

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int
               ) -> BeliefSet:
        '''
        This expansion technique relies only randomness and will generate at most 'max_generation' beliefs.

        Parameters
        ----------
        belief_set : BeliefSet
            List of beliefs to expand on.
        value_function : ValueFunction
            The current value function. (NOT USED)
        max_generation : int, default=10
            The max amount of beliefs that can be added to the belief set at once.

        Returns
        -------
        new_belief_set : BeliefSet
            Union of the belief_set and the expansions of the beliefs in the belief_set.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        # How many new beliefs to add
        generation_count = min(belief_set.belief_array.shape[0], max_generation)

        # Generation of the new beliefs at random
        new_beliefs = self.rnd_state.random((generation_count, model.state_count))
        new_beliefs /= xp.sum(new_beliefs, axis=1)[:,None]

        return BeliefSet(model, new_beliefs)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Random Point-Based Value Iteration:
        - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        return super().train(expansions = expansions,
                             full_backup = True,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats)

expand(belief_set, value_function, max_generation)

This expansion technique relies only randomness and will generate at most 'max_generation' beliefs.

Parameters:

Name Type Description Default
belief_set BeliefSet

List of beliefs to expand on.

required
value_function ValueFunction

The current value function. (NOT USED)

required
max_generation int

The max amount of beliefs that can be added to the belief set at once.

10

Returns:

Name Type Description
new_belief_set BeliefSet

Union of the belief_set and the expansions of the beliefs in the belief_set.

Source code in olfactory_navigation/agents/pbvi_ra_agent.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int
           ) -> BeliefSet:
    '''
    This expansion technique relies only randomness and will generate at most 'max_generation' beliefs.

    Parameters
    ----------
    belief_set : BeliefSet
        List of beliefs to expand on.
    value_function : ValueFunction
        The current value function. (NOT USED)
    max_generation : int, default=10
        The max amount of beliefs that can be added to the belief set at once.

    Returns
    -------
    new_belief_set : BeliefSet
        Union of the belief_set and the expansions of the beliefs in the belief_set.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    # How many new beliefs to add
    generation_count = min(belief_set.belief_array.shape[0], max_generation)

    # Generation of the new beliefs at random
    new_beliefs = self.rnd_state.random((generation_count, model.state_count))
    new_beliefs /= xp.sum(new_beliefs, axis=1)[:,None]

    return BeliefSet(model, new_beliefs)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Random Point-Based Value Iteration: - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/pbvi_ra_agent.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Random Point-Based Value Iteration:
    - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    return super().train(expansions = expansions,
                         full_backup = True,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats)

PBVI_SSEA_Agent

Bases: PBVI_Agent

A flavor of the PBVI Agent. The expand function consists in choosing belief points furtest away (L2 distance) from any other belief point already in the belief set based on that.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/pbvi_ssea_agent.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
class PBVI_SSEA_Agent(PBVI_Agent):
    '''
    A flavor of the PBVI Agent. The expand function consists in choosing belief points furtest away (L2 distance) from any other belief point already in the belief set based on that.

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int
               ) -> BeliefSet:
        '''
        Stochastic Simulation with Exploratory Action.
        Simulates running steps forward for each possible action knowing we are a state s, chosen randomly with according to the belief probability.
        These lead to a new state s_p and a observation o for each action.
        From all these and observation o we can generate updated beliefs. 
        Then it takes the belief that is furthest away from other beliefs, meaning it explores the most the belief space.

        Parameters
        ----------
        belief_set : BeliefSet
            List of beliefs to expand on.
        value_function : ValueFunction
            The current value function. (NOT USED)
        max_generation : int, default=10
            The max amount of beliefs that can be added to the belief set at once.

        Returns
        -------
        belief_set_new : BeliefSet
            Union of the belief_set and the expansions of the beliefs in the belief_set.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        old_shape = belief_set.belief_array.shape
        to_generate = min(max_generation, old_shape[0])

        # Generation of successors
        successor_beliefs = xp.array([[[b.update(a,o).values for o in model.observations] for a in model.actions] for b in belief_set.belief_list])

        # Compute the distances between each pair and of successor are source beliefs
        diff = (belief_set.belief_array[:, None,None,None, :] - successor_beliefs)
        dist = xp.sqrt(xp.einsum('bnaos,bnaos->bnao', diff, diff))

        # Taking the min distance for each belief
        belief_min_dists = xp.min(dist,axis=0)

        # Taking the max distanced successors
        b_star, a_star, o_star = xp.unravel_index(xp.argsort(belief_min_dists, axis=None)[::-1][:to_generate], successor_beliefs.shape[:-1])

        # Selecting successor beliefs
        new_belief_array = successor_beliefs[b_star[:,None], a_star[:,None], o_star[:,None], model.states[None,:]]

        return BeliefSet(model, new_belief_array)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Stochastic Search with Exploratory Action Point-Based Value Iteration:
        - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        return super().train(expansions = expansions,
                             full_backup = True,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats)

expand(belief_set, value_function, max_generation)

Stochastic Simulation with Exploratory Action. Simulates running steps forward for each possible action knowing we are a state s, chosen randomly with according to the belief probability. These lead to a new state s_p and a observation o for each action. From all these and observation o we can generate updated beliefs. Then it takes the belief that is furthest away from other beliefs, meaning it explores the most the belief space.

Parameters:

Name Type Description Default
belief_set BeliefSet

List of beliefs to expand on.

required
value_function ValueFunction

The current value function. (NOT USED)

required
max_generation int

The max amount of beliefs that can be added to the belief set at once.

10

Returns:

Name Type Description
belief_set_new BeliefSet

Union of the belief_set and the expansions of the beliefs in the belief_set.

Source code in olfactory_navigation/agents/pbvi_ssea_agent.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int
           ) -> BeliefSet:
    '''
    Stochastic Simulation with Exploratory Action.
    Simulates running steps forward for each possible action knowing we are a state s, chosen randomly with according to the belief probability.
    These lead to a new state s_p and a observation o for each action.
    From all these and observation o we can generate updated beliefs. 
    Then it takes the belief that is furthest away from other beliefs, meaning it explores the most the belief space.

    Parameters
    ----------
    belief_set : BeliefSet
        List of beliefs to expand on.
    value_function : ValueFunction
        The current value function. (NOT USED)
    max_generation : int, default=10
        The max amount of beliefs that can be added to the belief set at once.

    Returns
    -------
    belief_set_new : BeliefSet
        Union of the belief_set and the expansions of the beliefs in the belief_set.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    old_shape = belief_set.belief_array.shape
    to_generate = min(max_generation, old_shape[0])

    # Generation of successors
    successor_beliefs = xp.array([[[b.update(a,o).values for o in model.observations] for a in model.actions] for b in belief_set.belief_list])

    # Compute the distances between each pair and of successor are source beliefs
    diff = (belief_set.belief_array[:, None,None,None, :] - successor_beliefs)
    dist = xp.sqrt(xp.einsum('bnaos,bnaos->bnao', diff, diff))

    # Taking the min distance for each belief
    belief_min_dists = xp.min(dist,axis=0)

    # Taking the max distanced successors
    b_star, a_star, o_star = xp.unravel_index(xp.argsort(belief_min_dists, axis=None)[::-1][:to_generate], successor_beliefs.shape[:-1])

    # Selecting successor beliefs
    new_belief_array = successor_beliefs[b_star[:,None], a_star[:,None], o_star[:,None], model.states[None,:]]

    return BeliefSet(model, new_belief_array)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Stochastic Search with Exploratory Action Point-Based Value Iteration: - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/pbvi_ssea_agent.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Stochastic Search with Exploratory Action Point-Based Value Iteration:
    - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    return super().train(expansions = expansions,
                         full_backup = True,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats)

PBVI_SSGA_Agent

Bases: PBVI_Agent

A flavor of the PBVI Agent. The expand function consists in choosing actions in an epsilon greedy fashion and generating random observations and generating belief points based on that.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/pbvi_ssga_agent.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
class PBVI_SSGA_Agent(PBVI_Agent):
    '''
    A flavor of the PBVI Agent. The expand function consists in choosing actions in an epsilon greedy fashion and generating random observations and generating belief points based on that.

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int,
               epsilon: float = 0.99
               ) -> BeliefSet:
        '''
        Stochastic Simulation with Greedy Action.
        Simulates running a single-step forward from the beliefs in the "belief_set".
        The step forward is taking assuming we are in a random state s (weighted by the belief),
        then taking the best action a based on the belief with probability 'epsilon'.
        These lead to a new state s_p and a observation o.
        From this action a and observation o we can update our belief. 

        Parameters
        ----------
        belief_set : BeliefSet
            List of beliefs to expand on.
        value_function : ValueFunction
            The current value function. (NOT USED)
        max_generation : int, default=10
            The max amount of beliefs that can be added to the belief set at once.
        epsilon : float, default=0.99
            The epsilon parameter that determines whether to choose an action greedily or randomly.

        Returns
        -------
        belief_set_new : BeliefSet
            Union of the belief_set and the expansions of the beliefs in the belief_set.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        old_shape = belief_set.belief_array.shape
        to_generate = min(max_generation, old_shape[0])

        new_belief_array = xp.empty((to_generate, old_shape[1]))

        # Random previous beliefs
        rand_ind = self.rnd_state.choice(np.arange(old_shape[0]), to_generate, replace=False)

        for i, belief_vector in enumerate(belief_set.belief_array[rand_ind]):
            b = Belief(model, belief_vector)
            s = b.random_state()

            if self.rnd_state.random() < epsilon:
                a = self.rnd_state.choice(model.actions)
            else:
                best_alpha_index = xp.argmax(xp.dot(value_function.alpha_vector_array, b.values))
                a = value_function.actions[best_alpha_index]

            s_p = model.transition(s, a)
            o = model.observe(s_p, a)
            b_new = b.update(a, o)

            new_belief_array[i] = b_new.values

        return BeliefSet(model, new_belief_array)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True,
              epsilon: float = 0.99
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Stochastic Search with Greedy Action Point-Based Value Iteration:
        - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.
        epsilon : float, default=0.99
            Expand function parameter. threshold to how often to choose the action greedily to how often randomly.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        return super().train(expansions = expansions,
                             full_backup = True,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats,
                             epsilon = epsilon)

expand(belief_set, value_function, max_generation, epsilon=0.99)

Stochastic Simulation with Greedy Action. Simulates running a single-step forward from the beliefs in the "belief_set". The step forward is taking assuming we are in a random state s (weighted by the belief), then taking the best action a based on the belief with probability 'epsilon'. These lead to a new state s_p and a observation o. From this action a and observation o we can update our belief.

Parameters:

Name Type Description Default
belief_set BeliefSet

List of beliefs to expand on.

required
value_function ValueFunction

The current value function. (NOT USED)

required
max_generation int

The max amount of beliefs that can be added to the belief set at once.

10
epsilon float

The epsilon parameter that determines whether to choose an action greedily or randomly.

0.99

Returns:

Name Type Description
belief_set_new BeliefSet

Union of the belief_set and the expansions of the beliefs in the belief_set.

Source code in olfactory_navigation/agents/pbvi_ssga_agent.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int,
           epsilon: float = 0.99
           ) -> BeliefSet:
    '''
    Stochastic Simulation with Greedy Action.
    Simulates running a single-step forward from the beliefs in the "belief_set".
    The step forward is taking assuming we are in a random state s (weighted by the belief),
    then taking the best action a based on the belief with probability 'epsilon'.
    These lead to a new state s_p and a observation o.
    From this action a and observation o we can update our belief. 

    Parameters
    ----------
    belief_set : BeliefSet
        List of beliefs to expand on.
    value_function : ValueFunction
        The current value function. (NOT USED)
    max_generation : int, default=10
        The max amount of beliefs that can be added to the belief set at once.
    epsilon : float, default=0.99
        The epsilon parameter that determines whether to choose an action greedily or randomly.

    Returns
    -------
    belief_set_new : BeliefSet
        Union of the belief_set and the expansions of the beliefs in the belief_set.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    old_shape = belief_set.belief_array.shape
    to_generate = min(max_generation, old_shape[0])

    new_belief_array = xp.empty((to_generate, old_shape[1]))

    # Random previous beliefs
    rand_ind = self.rnd_state.choice(np.arange(old_shape[0]), to_generate, replace=False)

    for i, belief_vector in enumerate(belief_set.belief_array[rand_ind]):
        b = Belief(model, belief_vector)
        s = b.random_state()

        if self.rnd_state.random() < epsilon:
            a = self.rnd_state.choice(model.actions)
        else:
            best_alpha_index = xp.argmax(xp.dot(value_function.alpha_vector_array, b.values))
            a = value_function.actions[best_alpha_index]

        s_p = model.transition(s, a)
        o = model.observe(s_p, a)
        b_new = b.update(a, o)

        new_belief_array[i] = b_new.values

    return BeliefSet(model, new_belief_array)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True, epsilon=0.99)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Stochastic Search with Greedy Action Point-Based Value Iteration: - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True
epsilon float

Expand function parameter. threshold to how often to choose the action greedily to how often randomly.

0.99

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/pbvi_ssga_agent.py
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True,
          epsilon: float = 0.99
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Stochastic Search with Greedy Action Point-Based Value Iteration:
    - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.
    epsilon : float, default=0.99
        Expand function parameter. threshold to how often to choose the action greedily to how often randomly.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    return super().train(expansions = expansions,
                         full_backup = True,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats,
                         epsilon = epsilon)

PBVI_SSRA_Agent

Bases: PBVI_Agent

A flavor of the PBVI Agent. The expand function consists in choosing random actions and observations and generating belief points based on that.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/pbvi_ssra_agent.py
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
class PBVI_SSRA_Agent(PBVI_Agent):
    '''
    A flavor of the PBVI Agent. The expand function consists in choosing random actions and observations and generating belief points based on that.

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int
               ) -> BeliefSet:
        '''
        Stochastic Simulation with Random Action.
        Simulates running a single-step forward from the beliefs in the "belief_set".
        The step forward is taking assuming we are in a random state (weighted by the belief) and taking a random action leading to a state s_p and a observation o.
        From this action a and observation o we can update our belief.

        Parameters
        ----------
        belief_set : BeliefSet
            List of beliefs to expand on.
        value_function : ValueFunction
            The current value function. (NOT USED)
        max_generation : int, default=10
            The max amount of beliefs that can be added to the belief set at once.

        Returns
        -------
        belief_set_new : BeliefSet
            Union of the belief_set and the expansions of the beliefs in the belief_set.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        old_shape = belief_set.belief_array.shape
        to_generate = min(max_generation, old_shape[0])

        new_belief_array = xp.empty((to_generate, old_shape[1]))

        # Random previous beliefs
        rand_ind = self.rnd_state.choice(np.arange(old_shape[0]), to_generate, replace=False)

        for i, belief_vector in enumerate(belief_set.belief_array[rand_ind]):
            b = Belief(model, belief_vector)
            s = b.random_state()
            a = self.rnd_state.choice(model.actions)
            s_p = model.transition(s, a)
            o = model.observe(s_p, a)
            b_new = b.update(a, o)

            new_belief_array[i] = b_new.values

        return BeliefSet(model, new_belief_array)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Stochastic Search with Random Action Point-Based Value Iteration:
        - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        return super().train(expansions = expansions,
                             full_backup = True,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats)

expand(belief_set, value_function, max_generation)

Stochastic Simulation with Random Action. Simulates running a single-step forward from the beliefs in the "belief_set". The step forward is taking assuming we are in a random state (weighted by the belief) and taking a random action leading to a state s_p and a observation o. From this action a and observation o we can update our belief.

Parameters:

Name Type Description Default
belief_set BeliefSet

List of beliefs to expand on.

required
value_function ValueFunction

The current value function. (NOT USED)

required
max_generation int

The max amount of beliefs that can be added to the belief set at once.

10

Returns:

Name Type Description
belief_set_new BeliefSet

Union of the belief_set and the expansions of the beliefs in the belief_set.

Source code in olfactory_navigation/agents/pbvi_ssra_agent.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int
           ) -> BeliefSet:
    '''
    Stochastic Simulation with Random Action.
    Simulates running a single-step forward from the beliefs in the "belief_set".
    The step forward is taking assuming we are in a random state (weighted by the belief) and taking a random action leading to a state s_p and a observation o.
    From this action a and observation o we can update our belief.

    Parameters
    ----------
    belief_set : BeliefSet
        List of beliefs to expand on.
    value_function : ValueFunction
        The current value function. (NOT USED)
    max_generation : int, default=10
        The max amount of beliefs that can be added to the belief set at once.

    Returns
    -------
    belief_set_new : BeliefSet
        Union of the belief_set and the expansions of the beliefs in the belief_set.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    old_shape = belief_set.belief_array.shape
    to_generate = min(max_generation, old_shape[0])

    new_belief_array = xp.empty((to_generate, old_shape[1]))

    # Random previous beliefs
    rand_ind = self.rnd_state.choice(np.arange(old_shape[0]), to_generate, replace=False)

    for i, belief_vector in enumerate(belief_set.belief_array[rand_ind]):
        b = Belief(model, belief_vector)
        s = b.random_state()
        a = self.rnd_state.choice(model.actions)
        s_p = model.transition(s, a)
        o = model.observe(s_p, a)
        b_new = b.update(a, o)

        new_belief_array[i] = b_new.values

    return BeliefSet(model, new_belief_array)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Stochastic Search with Random Action Point-Based Value Iteration: - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/pbvi_ssra_agent.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Stochastic Search with Random Action Point-Based Value Iteration:
    - By default it performs the backup on the whole set of beliefs generated since the start. (so it full_backup=True)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    return super().train(expansions = expansions,
                         full_backup = True,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats)

Perseus_Agent

Bases: PBVI_Agent

A flavor of the PBVI Agent.

TODO: Do document of Perseus agent

TODO: FIX Perseus expand

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/perseus_agent.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
class Perseus_Agent(PBVI_Agent):
    '''
    A flavor of the PBVI Agent. 

    # TODO: Do document of Perseus agent
    # TODO: FIX Perseus expand

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int
               ) -> BeliefSet:
        '''
        # TODO

        Parameters
        ----------
        belief_set : BeliefSet
            List of beliefs to expand on.
        value_function : ValueFunction
            The current value function. (NOT USED)
        max_generation : int, default=10
            The max amount of beliefs that can be added to the belief set at once.

        Returns
        -------
        belief_set : BeliefSet
            A new sequence of beliefs.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        b = belief_set.belief_list[0]
        belief_sequence = []

        for i in range(max_generation):
            # Choose random action
            a = int(self.rnd_state.choice(model.actions, size=1)[0])

            # Choose random observation based on prob: P(o|b,a)
            obs_prob = xp.einsum('sor,s->o', model.reachable_transitional_observation_table[:,a,:,:], b.values)
            o = int(self.rnd_state.choice(model.observations, size=1, p=obs_prob)[0])

            # Update belief
            bao = b.update(a,o)

            # Finalization
            belief_sequence.append(bao)
            b = bao

        return BeliefSet(model, belief_sequence)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Heuristic Search Value Iteration:
        - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        return super().train(expansions = expansions,
                             full_backup = False,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats)

expand(belief_set, value_function, max_generation)

TODO

Parameters:

Name Type Description Default
belief_set BeliefSet

List of beliefs to expand on.

required
value_function ValueFunction

The current value function. (NOT USED)

required
max_generation int

The max amount of beliefs that can be added to the belief set at once.

10

Returns:

Name Type Description
belief_set BeliefSet

A new sequence of beliefs.

Source code in olfactory_navigation/agents/perseus_agent.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int
           ) -> BeliefSet:
    '''
    # TODO

    Parameters
    ----------
    belief_set : BeliefSet
        List of beliefs to expand on.
    value_function : ValueFunction
        The current value function. (NOT USED)
    max_generation : int, default=10
        The max amount of beliefs that can be added to the belief set at once.

    Returns
    -------
    belief_set : BeliefSet
        A new sequence of beliefs.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    b = belief_set.belief_list[0]
    belief_sequence = []

    for i in range(max_generation):
        # Choose random action
        a = int(self.rnd_state.choice(model.actions, size=1)[0])

        # Choose random observation based on prob: P(o|b,a)
        obs_prob = xp.einsum('sor,s->o', model.reachable_transitional_observation_table[:,a,:,:], b.values)
        o = int(self.rnd_state.choice(model.observations, size=1, p=obs_prob)[0])

        # Update belief
        bao = b.update(a,o)

        # Finalization
        belief_sequence.append(bao)
        b = bao

    return BeliefSet(model, belief_sequence)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Heuristic Search Value Iteration: - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/perseus_agent.py
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Heuristic Search Value Iteration:
    - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    return super().train(expansions = expansions,
                         full_backup = False,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats)

QMDP_Agent

Bases: PBVI_Agent

An agent that relies on Model-Based Reinforcement Learning. It is a simplified version of the PBVI_Agent. It runs the a Value Iteration solver, assuming full observability. The value function that comes out from this is therefore used to make choices.

As stated, during simulations, the agent will choose actions based on an argmax of what action has the highest matrix product of the value function with the belief vector.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/qmdp_agent.py
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
class QMDP_Agent(PBVI_Agent):
    '''
    An agent that relies on Model-Based Reinforcement Learning. It is a simplified version of the PBVI_Agent.
    It runs the a Value Iteration solver, assuming full observability. The value function that comes out from this is therefore used to make choices.

    As stated, during simulations, the agent will choose actions based on an argmax of what action has the highest matrix product of the value function with the belief vector.


    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def train(self,
              expansions: int,
              initial_value_function: ValueFunction | None = None,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Simplified version of the training. It consists in running the Value Iteration process.

        Parameters
        ----------
        expansions : int
            How many iterations to run the Value Iteration process for.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        # Handeling the case where the agent is already trained
        if (self.value_function is not None):
            if overwrite_training:
                self.trained_at = None
                self.name = '-'.join(self.name.split('-')[:-1])
                self.value_function = None
            else:
                initial_value_function = self.value_function

        model = self.model if not use_gpu else self.model.gpu_model

        # Value Iteration solving
        value_function, hist = vi_solver.solve(model = model,
                                               horizon = expansions,
                                               initial_value_function = initial_value_function,
                                               gamma = gamma,
                                               eps = eps,
                                               use_gpu = use_gpu,
                                               history_tracking_level = history_tracking_level,
                                               print_progress = print_progress)

        # Record when it was trained
        self.trained_at = datetime.now().strftime("%Y%m%d_%H%M%S")
        self.name += f'-trained_{self.trained_at}'

        self.value_function = value_function.to_cpu() if not self.on_gpu else value_function.to_gpu()

        # Print stats if requested
        if print_stats:
            print(hist.summary)

        return hist

train(expansions, initial_value_function=None, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Simplified version of the training. It consists in running the Value Iteration process.

Parameters:

Name Type Description Default
expansions int

How many iterations to run the Value Iteration process for.

required
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/qmdp_agent.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def train(self,
          expansions: int,
          initial_value_function: ValueFunction | None = None,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Simplified version of the training. It consists in running the Value Iteration process.

    Parameters
    ----------
    expansions : int
        How many iterations to run the Value Iteration process for.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    # Handeling the case where the agent is already trained
    if (self.value_function is not None):
        if overwrite_training:
            self.trained_at = None
            self.name = '-'.join(self.name.split('-')[:-1])
            self.value_function = None
        else:
            initial_value_function = self.value_function

    model = self.model if not use_gpu else self.model.gpu_model

    # Value Iteration solving
    value_function, hist = vi_solver.solve(model = model,
                                           horizon = expansions,
                                           initial_value_function = initial_value_function,
                                           gamma = gamma,
                                           eps = eps,
                                           use_gpu = use_gpu,
                                           history_tracking_level = history_tracking_level,
                                           print_progress = print_progress)

    # Record when it was trained
    self.trained_at = datetime.now().strftime("%Y%m%d_%H%M%S")
    self.name += f'-trained_{self.trained_at}'

    self.value_function = value_function.to_cpu() if not self.on_gpu else value_function.to_gpu()

    # Print stats if requested
    if print_stats:
        print(hist.summary)

    return hist