Skip to content

fsvi_agent

FSVI_Agent

Bases: PBVI_Agent

A particular flavor of the Point-Based Value Iteration based agent. The general concept relies on Model-Based reinforcement learning as described in: Pineau, J., Gordon, G., & Thrun, S. (2003, August). Point-based value iteration: An anytime algorithm for POMDPs The Forward Search Value Iteration algorithm is described in: Shani, G., Brafman, R. I., & Shimony, S. E. (2007, January). Forward Search Value Iteration for POMDPs

The training consist in two steps:

  • Expand: Where belief points are explored based on the some strategy (to be defined by subclasses).

  • Backup: Using the generated belief points, the value function is updated.

The belief points are probability distributions over the state space and are therefore vectors of |S| elements.

Actions are chosen based on a value function. A value function is a set of alpha vectors of dimentionality |S|. Each alpha vector is associated to a single action but multiple alpha vectors can be associated to the same action. To choose an action at a given belief point, a dot product is taken between each alpha vector and the belief point and the action associated with the highest result is chosen.

Forward Search exploration concept: It relies of the solution of the Fully-Observable (MDP) problem to guide the exploration of belief points. It makes an agent start randomly in the environment and makes him take steps following the MDP solution while generating belief points along the way. Each time the expand function is called it starts generated a new set of belief points and the update function uses only the latest generated belief points to make update the value function.

Parameters:

Name Type Description Default
environment Environment

The olfactory environment to train the agent with.

required
threshold float or list[float]

The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not. If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.

3e-6
actions dict or ndarray

The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension). Else, a dict of strings and action vectors where the strings represent the action labels. If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)

None
name str

A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.

None
seed int

For reproducible randomness.

12131415
model Model

A POMDP model to use to represent the olfactory environment. If not provided, the environment_converter parameter will be used.

None
environment_converter Callable

A function to convert the olfactory environment instance to a POMDP Model instance. By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model. This parameter will be ignored if the model parameter is provided.

exact_converter
converter_parameters dict

A set of additional parameters to be passed down to the environment converter.

{}

Attributes:

Name Type Description
environment Environment
threshold float or list[float]
name str
action_set ndarray

The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].

action_labels list[str]

The labels associated to the action vectors present in the action set.

model Model

The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.

saved_at str

The place on disk where the agent has been saved (None if not saved yet).

on_gpu bool

Whether the agent has been sent to the gpu or not.

class_name str

The name of the class of the agent.

seed int

The seed used for the random operations (to allow for reproducability).

rnd_state RandomState

The random state variable used to generate random values.

trained_at str

A string timestamp of when the agent has been trained (None if not trained yet).

value_function ValueFunction

The value function used for the agent to make decisions.

belief BeliefSet

Used only during simulations. Part of the Agent's status. Where the agent believes he is over the state space. It is a list of n belief points based on how many simulations are running at once.

action_played list[int]

Used only during simulations. Part of the Agent's status. Records what action was last played by the agent. A list of n actions played based on how many simulations are running at once.

mdp_policy ValueFunction

The solution to the fully version of the problem.

Source code in olfactory_navigation/agents/fsvi_agent.py
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
class FSVI_Agent(PBVI_Agent):
    '''
    A particular flavor of the Point-Based Value Iteration based agent.
    The general concept relies on Model-Based reinforcement learning as described in: Pineau, J., Gordon, G., & Thrun, S. (2003, August). Point-based value iteration: An anytime algorithm for POMDPs
    The Forward Search Value Iteration algorithm is described in: Shani, G., Brafman, R. I., & Shimony, S. E. (2007, January). Forward Search Value Iteration for POMDPs

    The training consist in two steps:

    - Expand: Where belief points are explored based on the some strategy (to be defined by subclasses).

    - Backup: Using the generated belief points, the value function is updated.

    The belief points are probability distributions over the state space and are therefore vectors of |S| elements.

    Actions are chosen based on a value function. A value function is a set of alpha vectors of dimentionality |S|.
    Each alpha vector is associated to a single action but multiple alpha vectors can be associated to the same action.
    To choose an action at a given belief point, a dot product is taken between each alpha vector and the belief point and the action associated with the highest result is chosen.

    Forward Search exploration concept:
    It relies of the solution of the Fully-Observable (MDP) problem to guide the exploration of belief points.
    It makes an agent start randomly in the environment and makes him take steps following the MDP solution while generating belief points along the way.
    Each time the expand function is called it starts generated a new set of belief points and the update function uses only the latest generated belief points to make update the value function.

    Parameters
    ----------
    environment : Environment
        The olfactory environment to train the agent with.
    threshold : float or list[float], default=3e-6
        The olfactory threshold. If an odor cue above this threshold is detected, the agent detects it, else it does not.
        If a list of threshold is provided, he agent should be able to detect |thresholds|+1 levels of odor.
    actions : dict or np.ndarray, optional
        The set of action available to the agent. It should match the type of environment (ie: if the environment has layers, it should contain a layer component to the action vector, and similarly for a third dimension).
        Else, a dict of strings and action vectors where the strings represent the action labels.
        If none is provided, by default, all unit movement vectors are included and shuch for all layers (if the environment has layers.)
    name : str, optional
        A custom name to give the agent. If not provided is will be a combination of the class-name and the threshold.
    seed : int, default=12131415
        For reproducible randomness.
    model : Model, optional
        A POMDP model to use to represent the olfactory environment.
        If not provided, the environment_converter parameter will be used.
    environment_converter : Callable, default=exact_converter
        A function to convert the olfactory environment instance to a POMDP Model instance.
        By default, we use an exact convertion that keeps the shape of the environment to make the amount of states of the POMDP Model.
        This parameter will be ignored if the model parameter is provided.
    converter_parameters : dict, optional
        A set of additional parameters to be passed down to the environment converter.

    Attributes
    ---------
    environment : Environment
    threshold : float or list[float]
    name : str
    action_set : np.ndarray
        The actions allowed of the agent. Formulated as movement vectors as [(layer,) (dz,) dy, dx].
    action_labels : list[str]
        The labels associated to the action vectors present in the action set.
    model : pomdp.Model
        The environment converted to a POMDP model using the "from_environment" constructor of the pomdp.Model class.
    saved_at : str
        The place on disk where the agent has been saved (None if not saved yet).
    on_gpu : bool
        Whether the agent has been sent to the gpu or not.
    class_name : str
        The name of the class of the agent.
    seed : int
        The seed used for the random operations (to allow for reproducability).
    rnd_state : np.random.RandomState
        The random state variable used to generate random values.
    trained_at : str
        A string timestamp of when the agent has been trained (None if not trained yet).
    value_function : ValueFunction
        The value function used for the agent to make decisions.
    belief : BeliefSet
        Used only during simulations.
        Part of the Agent's status. Where the agent believes he is over the state space.
        It is a list of n belief points based on how many simulations are running at once.
    action_played : list[int]
        Used only during simulations.
        Part of the Agent's status. Records what action was last played by the agent.
        A list of n actions played based on how many simulations are running at once.
    mdp_policy : ValueFunction
        The solution to the fully version of the problem.
    '''
    # FSVI special attribute
    mdp_policy = None

    def expand(self,
               belief_set: BeliefSet,
               value_function: ValueFunction,
               max_generation: int,
               mdp_policy: ValueFunction
               ) -> BeliefSet:
        '''
        Function implementing the exploration process using the MDP policy in order to generate a sequence of Beliefs following the the Forward Search Value Iteration principles.
        It is a loop is started by a initial state 's' and using the MDP policy, chooses the best action to take.
        Following this, a random next state 's_p' is being sampled from the transition probabilities and a random observation 'o' based on the observation probabilities.
        Then the given belief is updated using the chosen action and the observation received and the updated belief is added to the sequence.
        Once the state is a goal state, the loop is done and the belief sequence is returned.

        Parameters
        ----------
        belief_set : BeliefSet
            A belief set containing a single belief to start the sequence with.
            A random state will be chosen based on the probability distribution of the belief.
        value_function : ValueFunction
            The current value function. (NOT USED)
        max_generation : int
            How many beliefs to be generated at most.
        mdp_policy : ValueFunction
            The mdp policy used to choose the action from with the given state 's'.

        Returns
        -------
        belief_set : BeliefSet
            A new sequence of beliefs.
        '''
        # GPU support
        xp = np if not self.on_gpu else cp
        model = self.model

        # Getting initial belief
        b0 = belief_set.belief_list[0]
        belief_list = [b0]

        # Choose a random starting state
        s = b0.random_state()

        # Setting the working belief
        b = b0

        for _ in range(max_generation - 1): #-1 due to a one belief already being present in the set
            # Choose action based on mdp value function
            a_star = xp.argmax(mdp_policy.alpha_vector_array[:,s])

            # Pick a random next state (weighted by transition probabilities)
            s_p = model.transition(s, a_star)

            # Pick a random observation weighted by observation probabilities in state s_p and after having done action a_star
            o = model.observe(s_p, a_star)

            # Generate a new belief based on a_star and o
            b_p = b.update(a_star, o)

            # Record new belief
            belief_list.append(b_p)

            # Updating s and b
            s = s_p
            b = b_p

            # Reset and belief if end state is reached
            if s in model.end_states:
                s = b0.random_state()
                b = b0

        return BeliefSet(model, belief_list)


    def train(self,
              expansions: int,
              update_passes: int = 1,
              max_belief_growth: int = 10,
              initial_belief: BeliefSet | Belief | None = None,
              initial_value_function: ValueFunction | None = None,
              mdp_policy: ValueFunction | None = None,
              prune_level: int = 1,
              prune_interval: int = 10,
              limit_value_function_size: int = -1,
              gamma: float = 0.99,
              eps: float = 1e-6,
              use_gpu: bool = False,
              history_tracking_level: int = 1,
              overwrite_training: bool = False,
              print_progress: bool = True,
              print_stats: bool = True
              ) -> TrainingHistory:
        '''
        Main loop of the Point-Based Value Iteration algorithm.
        It consists in 2 steps, Backup and Expand.
        1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
        2. Backup: Updates the alpha vectors based on the current belief set

        Foward Search Value Iteration:
        - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

        Parameters
        ----------
        expansions : int
            How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
        update_passes : int, default=1
            How many times the backup function has to be run every time the belief set is expanded.
        max_belief_growth : int, default=10
            How many beliefs can be added at every expansion step to the belief set.
        initial_belief : BeliefSet or Belief, optional
            An initial list of beliefs to start with.
        initial_value_function : ValueFunction, optional
            An initial value function to start the solving process with.
        mdp_policy : ValueFunction, optional
            The MDP solution to guide the expand process.
            If it is not provided, the Value Iteration for the MDP version of the problem will be run. (using the same gamma and eps as set here; horizon=1000)
        prune_level : int, default=1
            Parameter to prune the value function further before the expand function.
        prune_interval : int, default=10
            How often to prune the value function. It is counted in number of backup iterations.
        limit_value_function_size : int, default=-1
            When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
            If set to -1, the value function can grow without bounds.
        use_gpu : bool, default=False
            Whether to use the GPU with cupy array to accelerate solving.
        gamma : float, default=0.99
            The discount factor to value immediate rewards more than long term rewards.
            The learning rate is 1/gamma.
        eps : float, default=1e-6
            The smallest allowed changed for the value function.
            Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
        history_tracking_level : int, default=1
            How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
        overwrite_training : bool, default=False
            Whether to force the overwriting of the training if a value function already exists for this agent.
        print_progress : bool, default=True
            Whether or not to print out the progress of the value iteration process.
        print_stats : bool, default=True
            Whether or not to print out statistics at the end of the training run.

        Returns
        -------
        solver_history : SolverHistory
            The history of the solving process with some plotting options.
        '''
        if mdp_policy is not None:
            self.mdp_policy = mdp_policy
        elif (self.mdp_policy is None) or overwrite_training:
            log('MDP_policy, not provided. Solving MDP with Value Iteration...')
            self.mdp_policy, hist = vi_solver.solve(model = self.model,
                                                    horizon = 1000,
                                                    initial_value_function = initial_value_function,
                                                    gamma = gamma,
                                                    eps = eps,
                                                    use_gpu = use_gpu,
                                                    history_tracking_level = 1,
                                                    print_progress = print_progress)

            if print_stats:
                print(hist.summary)

        return super().train(expansions = expansions,
                             full_backup = False,
                             update_passes = update_passes,
                             max_belief_growth = max_belief_growth,
                             initial_belief = initial_belief,
                             initial_value_function = initial_value_function,
                             prune_level = prune_level,
                             prune_interval = prune_interval,
                             limit_value_function_size = limit_value_function_size,
                             gamma = gamma,
                             eps = eps,
                             use_gpu = use_gpu,
                             history_tracking_level = history_tracking_level,
                             overwrite_training = overwrite_training,
                             print_progress = print_progress,
                             print_stats = print_stats,
                             mdp_policy = self.mdp_policy)

expand(belief_set, value_function, max_generation, mdp_policy)

Function implementing the exploration process using the MDP policy in order to generate a sequence of Beliefs following the the Forward Search Value Iteration principles. It is a loop is started by a initial state 's' and using the MDP policy, chooses the best action to take. Following this, a random next state 's_p' is being sampled from the transition probabilities and a random observation 'o' based on the observation probabilities. Then the given belief is updated using the chosen action and the observation received and the updated belief is added to the sequence. Once the state is a goal state, the loop is done and the belief sequence is returned.

Parameters:

Name Type Description Default
belief_set BeliefSet

A belief set containing a single belief to start the sequence with. A random state will be chosen based on the probability distribution of the belief.

required
value_function ValueFunction

The current value function. (NOT USED)

required
max_generation int

How many beliefs to be generated at most.

required
mdp_policy ValueFunction

The mdp policy used to choose the action from with the given state 's'.

required

Returns:

Name Type Description
belief_set BeliefSet

A new sequence of beliefs.

Source code in olfactory_navigation/agents/fsvi_agent.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def expand(self,
           belief_set: BeliefSet,
           value_function: ValueFunction,
           max_generation: int,
           mdp_policy: ValueFunction
           ) -> BeliefSet:
    '''
    Function implementing the exploration process using the MDP policy in order to generate a sequence of Beliefs following the the Forward Search Value Iteration principles.
    It is a loop is started by a initial state 's' and using the MDP policy, chooses the best action to take.
    Following this, a random next state 's_p' is being sampled from the transition probabilities and a random observation 'o' based on the observation probabilities.
    Then the given belief is updated using the chosen action and the observation received and the updated belief is added to the sequence.
    Once the state is a goal state, the loop is done and the belief sequence is returned.

    Parameters
    ----------
    belief_set : BeliefSet
        A belief set containing a single belief to start the sequence with.
        A random state will be chosen based on the probability distribution of the belief.
    value_function : ValueFunction
        The current value function. (NOT USED)
    max_generation : int
        How many beliefs to be generated at most.
    mdp_policy : ValueFunction
        The mdp policy used to choose the action from with the given state 's'.

    Returns
    -------
    belief_set : BeliefSet
        A new sequence of beliefs.
    '''
    # GPU support
    xp = np if not self.on_gpu else cp
    model = self.model

    # Getting initial belief
    b0 = belief_set.belief_list[0]
    belief_list = [b0]

    # Choose a random starting state
    s = b0.random_state()

    # Setting the working belief
    b = b0

    for _ in range(max_generation - 1): #-1 due to a one belief already being present in the set
        # Choose action based on mdp value function
        a_star = xp.argmax(mdp_policy.alpha_vector_array[:,s])

        # Pick a random next state (weighted by transition probabilities)
        s_p = model.transition(s, a_star)

        # Pick a random observation weighted by observation probabilities in state s_p and after having done action a_star
        o = model.observe(s_p, a_star)

        # Generate a new belief based on a_star and o
        b_p = b.update(a_star, o)

        # Record new belief
        belief_list.append(b_p)

        # Updating s and b
        s = s_p
        b = b_p

        # Reset and belief if end state is reached
        if s in model.end_states:
            s = b0.random_state()
            b = b0

    return BeliefSet(model, belief_list)

train(expansions, update_passes=1, max_belief_growth=10, initial_belief=None, initial_value_function=None, mdp_policy=None, prune_level=1, prune_interval=10, limit_value_function_size=-1, gamma=0.99, eps=1e-06, use_gpu=False, history_tracking_level=1, overwrite_training=False, print_progress=True, print_stats=True)

Main loop of the Point-Based Value Iteration algorithm. It consists in 2 steps, Backup and Expand. 1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function 2. Backup: Updates the alpha vectors based on the current belief set

Foward Search Value Iteration: - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

Parameters:

Name Type Description Default
expansions int

How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)

required
update_passes int

How many times the backup function has to be run every time the belief set is expanded.

1
max_belief_growth int

How many beliefs can be added at every expansion step to the belief set.

10
initial_belief BeliefSet or Belief

An initial list of beliefs to start with.

None
initial_value_function ValueFunction

An initial value function to start the solving process with.

None
mdp_policy ValueFunction

The MDP solution to guide the expand process. If it is not provided, the Value Iteration for the MDP version of the problem will be run. (using the same gamma and eps as set here; horizon=1000)

None
prune_level int

Parameter to prune the value function further before the expand function.

1
prune_interval int

How often to prune the value function. It is counted in number of backup iterations.

10
limit_value_function_size int

When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function If set to -1, the value function can grow without bounds.

-1
use_gpu bool

Whether to use the GPU with cupy array to accelerate solving.

False
gamma float

The discount factor to value immediate rewards more than long term rewards. The learning rate is 1/gamma.

0.99
eps float

The smallest allowed changed for the value function. Bellow the amound of change, the value function is considered converged and the value iteration process will end early.

1e-6
history_tracking_level int

How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)

1
overwrite_training bool

Whether to force the overwriting of the training if a value function already exists for this agent.

False
print_progress bool

Whether or not to print out the progress of the value iteration process.

True
print_stats bool

Whether or not to print out statistics at the end of the training run.

True

Returns:

Name Type Description
solver_history SolverHistory

The history of the solving process with some plotting options.

Source code in olfactory_navigation/agents/fsvi_agent.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
def train(self,
          expansions: int,
          update_passes: int = 1,
          max_belief_growth: int = 10,
          initial_belief: BeliefSet | Belief | None = None,
          initial_value_function: ValueFunction | None = None,
          mdp_policy: ValueFunction | None = None,
          prune_level: int = 1,
          prune_interval: int = 10,
          limit_value_function_size: int = -1,
          gamma: float = 0.99,
          eps: float = 1e-6,
          use_gpu: bool = False,
          history_tracking_level: int = 1,
          overwrite_training: bool = False,
          print_progress: bool = True,
          print_stats: bool = True
          ) -> TrainingHistory:
    '''
    Main loop of the Point-Based Value Iteration algorithm.
    It consists in 2 steps, Backup and Expand.
    1. Expand: Expands the belief set base with a expansion strategy given by the parameter expand_function
    2. Backup: Updates the alpha vectors based on the current belief set

    Foward Search Value Iteration:
    - By default it performs the backup only on set of beliefs generated by the expand function. (so it full_backup=False)

    Parameters
    ----------
    expansions : int
        How many times the algorithm has to expand the belief set. (the size will be doubled every time, eg: for 5, the belief set will be of size 32)
    update_passes : int, default=1
        How many times the backup function has to be run every time the belief set is expanded.
    max_belief_growth : int, default=10
        How many beliefs can be added at every expansion step to the belief set.
    initial_belief : BeliefSet or Belief, optional
        An initial list of beliefs to start with.
    initial_value_function : ValueFunction, optional
        An initial value function to start the solving process with.
    mdp_policy : ValueFunction, optional
        The MDP solution to guide the expand process.
        If it is not provided, the Value Iteration for the MDP version of the problem will be run. (using the same gamma and eps as set here; horizon=1000)
    prune_level : int, default=1
        Parameter to prune the value function further before the expand function.
    prune_interval : int, default=10
        How often to prune the value function. It is counted in number of backup iterations.
    limit_value_function_size : int, default=-1
        When the value function size crosses this threshold, a random selection of 'max_belief_growth' alpha vectors will be removed from the value function
        If set to -1, the value function can grow without bounds.
    use_gpu : bool, default=False
        Whether to use the GPU with cupy array to accelerate solving.
    gamma : float, default=0.99
        The discount factor to value immediate rewards more than long term rewards.
        The learning rate is 1/gamma.
    eps : float, default=1e-6
        The smallest allowed changed for the value function.
        Bellow the amound of change, the value function is considered converged and the value iteration process will end early.
    history_tracking_level : int, default=1
        How thorough the tracking of the solving process should be. (0: Nothing; 1: Times and sizes of belief sets and value function; 2: The actual value functions and beliefs sets)
    overwrite_training : bool, default=False
        Whether to force the overwriting of the training if a value function already exists for this agent.
    print_progress : bool, default=True
        Whether or not to print out the progress of the value iteration process.
    print_stats : bool, default=True
        Whether or not to print out statistics at the end of the training run.

    Returns
    -------
    solver_history : SolverHistory
        The history of the solving process with some plotting options.
    '''
    if mdp_policy is not None:
        self.mdp_policy = mdp_policy
    elif (self.mdp_policy is None) or overwrite_training:
        log('MDP_policy, not provided. Solving MDP with Value Iteration...')
        self.mdp_policy, hist = vi_solver.solve(model = self.model,
                                                horizon = 1000,
                                                initial_value_function = initial_value_function,
                                                gamma = gamma,
                                                eps = eps,
                                                use_gpu = use_gpu,
                                                history_tracking_level = 1,
                                                print_progress = print_progress)

        if print_stats:
            print(hist.summary)

    return super().train(expansions = expansions,
                         full_backup = False,
                         update_passes = update_passes,
                         max_belief_growth = max_belief_growth,
                         initial_belief = initial_belief,
                         initial_value_function = initial_value_function,
                         prune_level = prune_level,
                         prune_interval = prune_interval,
                         limit_value_function_size = limit_value_function_size,
                         gamma = gamma,
                         eps = eps,
                         use_gpu = use_gpu,
                         history_tracking_level = history_tracking_level,
                         overwrite_training = overwrite_training,
                         print_progress = print_progress,
                         print_stats = print_stats,
                         mdp_policy = self.mdp_policy)