Skip to content

qmdp_agent

QMDP_Agent

Bases: PBVI_Agent

An agent that relies on Model-Based Reinforcement Learning. It is a simplified version of the PBVI_Agent. It runs the a Value Iteration solver, assuming full observability. The value function that comes out from this is therefore used to make choices.

As stated, during simulations, the agent will choose actions based on an argmax of what action has the highest matrix product of the expected action-values obtained by applying the full-observability value function to the belief (i.e., QMDP approximation).

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
thresholds float or list[float] or dict[str, float] or dict[str, list[float]]

The olfactory thresholds. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of thresholds is provided, the agent should be able to detect |thresholds|+1 levels of odor. A dictionary of (list of) thresholds can also be provided when the environment is layered. In such case, the number of layers provided must match the environment's layers and their labels must match. The thresholds provided will be converted to an array where the levels start with -inf and end with +inf.

= 3e-6
space_aware bool

Whether the agent is aware of its own position in space. This is to be used in scenarios where, for example, the agent is an enclosed container and the source is the variable. Note: The observation array will have a different shape when returned to the update_state function!

= False
spacial_subdivisions ndarray

How many spacial compartments the agent has to internally represent the space it lives in. By default, it will be as many as there are grid points in the environment.

None
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit steps in all cardinal directions are included and such for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
rng int or Generator

A seed for random generation or directly a numpy random generator.

= np.random.default_rng()
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

= exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
thresholds ndarray

An array of the thresholds of detection, starting with -inf and ending with +inf. In the case of a 2D array of thresholds, the rows of thresholds apply to the different layers of the environment.

space_aware bool
spacial_subdivisions ndarray
trained bool

Whether or not the agent needs to be trained. If an agent doesnt need training this parameter is set to True by default.

name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

rng Generator

A random number generator.

on_cpu PBVI_Agent

An instance of the agent on the CPU. If it already is, it returns itself.

on_gpu PBVI_Agent

An instance of the agent on the GPU. If it already is, it returns itself.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

Source code in olfactory_navigation/agents/qmdp_agent.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
class QMDP_Agent(PBVI_Agent):
    '''
    An agent that relies on Model-Based Reinforcement Learning. It is a simplified version of the PBVI_Agent.
    It runs the a Value Iteration solver, assuming full observability. The value function that comes out from this is therefore used to make choices.

    As stated, during simulations, the agent will choose actions based on an argmax of what action has the highest matrix product of the expected action-values obtained by applying the full-observability value function to the belief (i.e., QMDP approximation).


    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    thresholds : float or list[float] or dict[str, float] or dict[str, list[float]], default = 3e-6
        The olfactory thresholds. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of thresholds is provided, the agent should be able to detect |thresholds|+1 levels of odor.
        A dictionary of (list of) thresholds can also be provided when the environment is layered.
        In such case, the number of layers provided must match the environment's layers and their labels must match.
        The thresholds provided will be converted to an array where the levels start with -inf and end with +inf.
    space_aware : bool, default = False
        Whether the agent is aware of its own position in space.
        This is to be used in scenarios where, for example, the agent is an enclosed container and the source is the variable.
        Note: The observation array will have a different shape when returned to the update_state function!
    spacial_subdivisions : np.ndarray, optional
        How many spacial compartments the agent has to internally represent the space it lives in.
        By default, it will be as many as there are grid points in the environment.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit steps in all cardinal directions are included and such for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    rng : int or np.random.Generator, default = np.random.default_rng()
        A seed for random generation or directly a numpy random generator.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default = exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    thresholds : np.ndarray
        An array of the thresholds of detection, starting with -inf and ending with +inf.
        In the case of a 2D array of thresholds, the rows of thresholds apply to the different layers of the environment.
    space_aware : bool
    spacial_subdivisions : np.ndarray
    trained : bool
        Whether or not the agent needs to be trained. If an agent doesnt need training this parameter is set to True by default.
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    rng : np.random.Generator
        A random number generator.
    on_cpu : PBVI_Agent
        An instance of the agent on the CPU. If it already is, it returns itself.
    on_gpu : PBVI_Agent
        An instance of the agent on the GPU. If it already is, it returns itself.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    '''
    def train(self,
              expansions: int = 10,
              initial_value_function: ValueFunction = None,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Simplified version of the training. It consists in running the Value Iteration process.

        Parameters
        ----------
        expansions : int, default = 10
            How many iterations to run the Value Iteration process for.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        gamma : float, default = 0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default = 1e-6
            The smallest allowed changed for the value function.
            Below the amount of change, the value function is considered converged and the value iteration process will end early.
        use_gpu : bool, default = False
            Whether to use the GPU with cupy array to accelerate solving.
        history_tracking_level : int, default = 1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default = False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default = True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default = True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        # Handling the case where the agent is already trained
        if (self.value_function is not None):
            if overwrite_training:
                self.trained_at = None
                self.name = '-'.join(self.name.split('-')[:-1])
                self.value_function = None
            else:
                initial_value_function = self.value_function

        # Value Iteration solving
        value_function, hist = VI.solve(model = self.model,
                                        horizon = expansions,
                                        initial_value_function = initial_value_function,
                                        gamma = gamma,
                                        eps = eps,
                                        use_gpu = use_gpu,
                                        use_reachability = self.use_reachability,
                                        history_tracking_level = history_tracking_level,
                                        print_progress = print_progress)

        # Record when it was trained
        self.trained_at = datetime.now().strftime("%Y%m%d_%H%M%S")
        self.name += f'-trained_{self.trained_at}'

        self.value_function = value_function.on_cpu if not self.is_on_gpu else value_function.on_gpu

        # Print stats if requested
        if print_stats:
            print(hist.summary)

        # Validate training
        self.trained = True

        return hist

train(expansions=10, initial_value_function=None, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Simplified version of the training. It consists in running the Value Iteration process.

Parameters:

Name Type Description Default
expansions int

How many iterations to run the Value Iteration process for.

= 10
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

= 0.99
eps float

The smallest allowed changed for the value function. Below the amount of change, the value function is considered converged and the value iteration process will end early.

= 1e-6
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

= False
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

= 1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

= False
print_progress bool

Whether or not to print out the progress of the value iteration process.

= True
print_stats bool

Whether or not to print out statistics at the end of the training run.

= True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/qmdp_agent.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def train(self,
          expansions: int = 10,
          initial_value_function: ValueFunction = None,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Simplified version of the training. It consists in running the Value Iteration process.

    Parameters
    ----------
    expansions : int, default = 10
        How many iterations to run the Value Iteration process for.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    gamma : float, default = 0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default = 1e-6
        The smallest allowed changed for the value function.
        Below the amount of change, the value function is considered converged and the value iteration process will end early.
    use_gpu : bool, default = False
        Whether to use the GPU with cupy array to accelerate solving.
    history_tracking_level : int, default = 1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default = False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default = True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default = True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    # Handling the case where the agent is already trained
    if (self.value_function is not None):
        if overwrite_training:
            self.trained_at = None
            self.name = '-'.join(self.name.split('-')[:-1])
            self.value_function = None
        else:
            initial_value_function = self.value_function

    # Value Iteration solving
    value_function, hist = VI.solve(model = self.model,
                                    horizon = expansions,
                                    initial_value_function = initial_value_function,
                                    gamma = gamma,
                                    eps = eps,
                                    use_gpu = use_gpu,
                                    use_reachability = self.use_reachability,
                                    history_tracking_level = history_tracking_level,
                                    print_progress = print_progress)

    # Record when it was trained
    self.trained_at = datetime.now().strftime("%Y%m%d_%H%M%S")
    self.name += f'-trained_{self.trained_at}'

    self.value_function = value_function.on_cpu if not self.is_on_gpu else value_function.on_gpu

    # Print stats if requested
    if print_stats:
        print(hist.summary)

    # Validate training
    self.trained = True

    return hist