The classic “agent-environment loop” pictured below is simplified representation of reinforcement learning that Gymnasium implements.
This loop is implemented using the following gymnasium code
import gymnasium as gym env = gym.make("LunarLander-v2", render_mode="human") observation, info = env.reset() for _ in range(1000): action = env.action_space.sample() # agent policy that uses the observation and info observation, reward, terminated, truncated, info = env.step(action) if terminated or truncated: observation, info = env.reset() env.close()
First, an environment is created using make
with an additional keyword "render_mode"
that specifies how the environment should be visualised. See render
for details on the default meaning of different render modes. In this example, we use the "LunarLander"
environment where the agent controls a spaceship that needs to land safely.
After initializing the environment, we reset
the environment to get the first observation of the environment. For initializing the environment with a particular random seed or options (see environment documentation for possible values) use the seed
or options
parameters with reset
.
Next, the agent performs an action in the environment, step
, this can be imagined as moving a robot or pressing a button on a games’ controller that causes a change within the environment. As a result, the agent receives a new observation from the updated environment along with a reward for taking the action. This reward could be for instance positive for destroying an enemy or a negative reward for moving into lava. One such action-observation exchange is referred to as a timestep.
However, after some timesteps, the environment may end, this is called the terminal state. For instance, the robot may have crashed, or the agent have succeeded in completing a task, the environment will need to stop as the agent cannot continue. In gymnasium, if the environment has terminated, this is returned by step
. Similarly, we may also want the environment to end after a fixed number of timesteps, in this case, the environment issues a truncated signal. If either of terminated
or truncated
are true
then reset
should be called next to restart the environment.