How to use Arena

Whether you are developing models for delivery route optimisation or energy distribution, Arena enables you to build and deploy faster than ever before and at no cost.

For the best user experience, we recommend using Google Chrome.

How can I upload and validate a custom environment?
To upload a custom environment, first navigate to the Custom Environments section on the sidebar and select the New Environment button on the top right.

This will bring up a pop up window where you will be able to start setting up your custom environment. Get started by inputting the environment Name and Description as well as selecting whether it is a Single-agent or Multi-agent environment.

Once done, click Next. Now you are ready to start uploading your environment file(s).

If your environment is more complex than a single file, you can upload a folder containing nested folders and scripts. At the highest level within the directory, a requirements.txt file should be included to specify dependencies, and a config.yaml file should also be included within a configs directory to outline any arguments that the environment will require.
‍‍

Click Next to upload your Requirements file or type out your dependencies in the box on the right.

Once done, you can move on to uploading a config file or providing any arguments required by the constructor of your environment class.


After the Configuration step, you can move on to selecting the path to your environment class. Arena will automatically detect these and display a list or possible options. Select the entrypoint you want to use from the list.

Here, you also have the option to rename the first version of the environment. "v1" is the default version name however you can change this in the entry field shown.


If you want to make changes after uploading your file(s), you can select any file to bring up the code. Once you have made changes, click the Validate button to run the validation checks against the entire environment. If you would like to save your changes, simply click Save.

If you make changes to any of the files and would like to keep the original version, simply click the Save button and enter a new version name. This will create a new version of your environment.

You can view the various versions and the results of the validation checks by selecting the custom environment name in the breadcrumb. This will open up a the full list of versions in an accordion.

Once the environment has been successfully validated, you can select it as an environment in the Environment section of the training workflow (follow the "How do I train an agent?" FAQ which explains walks through the training workflow). When you navigate to this section you can locate your custom environment by searching for it or looking through the Environments table.

After selecting the environment, you will be shown a summary of the environment, the validation checks, and a graph showing the reward distribution of a random agent for your reference as well as a rendered gif of your environment if your environment has a render function and an rgb_array render mode.
What format should my custom environment be in?
Custom environments must be implemented in Python and they must follow the Gymnasium API. Uploaded environments will be validated to ensure they follow the Gymnasium API and you will not be able to commence training until these checks are all completed successfully.

If uploaded environments use a render method, the environment will only be rendered during validation if an rgb_array render mode is provided.

Custom environment packages should be uploaded as a folder, containing all necessary files. At the highest level within the directory, a requirements.txt file should be included to specify dependencies, and a config.yaml file within a configs directory to outline any arguments that the environment will require. You can also upload a single script if your environment is contained entirely within this script.
How do I train an agent?
Start by creating a group, then set up a project within that group. Inside the project, you can create an experiment, where you can train an agent. This structure helps you keep your work organised.

Note:
Group: A Group is the top-level organization unit. It helps you manage and categorize multiple projects, allowing teams to collaborate and share resources.
Project: Within a Group, you create Projects. Each Project is a workspace where you can manage related experiments and keep all the work for a specific goal or task together.
Experiment: An Experiment is part of a Project and is where you define and run your tests. It's the environment where you configure the settings, parameters, and data to train an agent.
Agent: The Agent is what you train in an Experiment. It's the entity that learns and improves based on the data and parameters you've set in the Experiment.

To start, you will need to create a group by selecting the Experiment Groups tab on the sidebar and then selecting the New Group button.


Once you have created a group, select the View button on the new tile which will take you to a new page allowing you to create projects for your experiments.

Select the New Project or the Create Project button if you are creating the first project within the group. A pop up will appear allowing you to enter a project name and description.


Once a project has been created, you can now host multiple experiments within this project. To create an experiment in your project, select the New Experiment button on the project page. A pop up will appear allowing you to enter an experiment name and description.

Once an experiment is created, you will be taken into the experiment creation flow. To train an agent you will have to choose an environment. This can either be one of the many popular environments available on the platform or a custom environment which has been previously uploaded.

Once you have selected an environment, the validation results will be displayed. The Environment Summary table contains a summary of the operational parameters of the selected environment as well as its known category (if not custom) and version saved. The Environment Test Results from the environment validation are displayed in the top right quadrant of the page. Finally, a random agent rollout of the environment is displayed with the scores plotted over 100 episodes and the environment rendering if available.

Next, you will be prompted to select an algorithm that is compatible with your chosen environment. DQN, DQN Rainbow and PPO are algorithms used to train discrete-action type agents. DDPG, TD3 and PPO are algorithms used to train continuous-action type agents. If you need to learn about any of the given parameters or specifications, you can hover over the information icons provided. The AgileRL framework documentation also offers a great overview of SOTA reinforcement learning algorithms and techniques along with references.

MLP (linear layers) or CNN (convolutional layers stacked with linear layers) are loaded based on whether the environment has vector or image observations respectively. Those architectures can be modified by adding or deleting layers directly on the page.

After completing the Algorithm section, you will need to click the Next button to proceed to the Training setup. All experiment configurations are pre-populated with template values. The various training sections include environment vectorization (that each agent in the population trains on separately) and the replay buffer and its parameters for off-policy algorithms.

AgileRL Arena uses evolutionary population training with agent mutations. A tournament selection occurs at the end of each successive epoch upon which the fittest agents are preserved, cloned and mutated according to a Tournament Selection method. This uses a random subset of the population to select the next fittest agent, as well as elitism (whether to keep the overall fittest agent). Those selected agents are then mutated according to random probabilities that the user can set manually. The mutations are applied to evolve the neural network architectures and algorithm learning hyperparameters.

When you are done with the Training section, select the Next button, which will take you to the Resources page, which will display a summary of the experiment settings. On this page, you will be able to select the size of the experiments compute node.

You can either start training immediately by clicking on the Train button on the top-right corner shown in the figure above, or simply saving the experiment and submitting a training job later. Once you click Train or Save, you will be redirected to the Experiments page which will show the new experiment as well as previous experiments.
How can I monitor an experiment?

Visualising experiments and experiment controls
‍‍
Running and Completed experiments can be viewed on the Results tab. You may apply further controls to your experiments (as shown below): Stop training and Edit mutation parameters during an experiment, and View logs (during and after an experiment is completed), the latter of which may be helpful in making decisions for a live experiment.

Experiments currently in training will be listed as Running, and you will be able to visualise their performance results in real-time. You can set the data for many experiments as either exposed or hidden using the eye-shaped icon (which will be shown as whole or crossed, respectively). If you think that an experiment has ended early (results converged), or that mutation probabilities and ranges need to change for the better, you may click the relevant icon at the end of each row in the Running table. This will give you the ability to Stop or Edit the experiment.

Updating Hyperparameter Optimisation
‍‍
Over the course of an experiment, you may want to adjust the mutations to boost the learning of the agents. You can do this in two ways: adjusting the probability of specific mutations, and adjusting the range of the target changes, such as the learning step. To do this, select the Edit mutation parameters button in the experiment menu. A popup will appear which will allow you to update the mutation parameters in real time.

Note: You can also disable the mutation process for the agents by setting probabilities to 0 for all mutation types except None.

Looking at the logs
‍‍‍
In order to match a particular mutation and agent selection to a performance change in one of the visualisation plots, you may filter the logs by time range via the Run Query button on the top right along with the start/end time and dates selection tool. Then you can inspect the fitness evaluations for each agent and judge the quality of the current hyperparameters selection. You may also directly search for a particular keyword or substring in the logs.

Resuming a stopped experiment
‍‍
An experiment which has run its course (succeeded) or has been halted can be resumed from the Experiments page. It will run again for the same number of steps it has been initially set with upon its creation. In other words, an experiment of 1 million steps will always be scheduled to run for that many steps every time it is run.

How can I analyse the performance of my agent?
Once training has been started, population performance can be tracked via the Results page. Performance metrics of an experiment will be updated periodically, until the experiment completes (and the visualisations reach their final state).
How to visualise results of an experiment
One or more experiments may be selected for visualisation. When more than one experiment is selected for visualisation, it is possible to directly compare the performance of distinct algorithms (or environments) simultaneously, on the same plots. A visualisation will load the data from an experiment run. The eye-shaped icon is used to show or hide this data from the plots of an experiment.

Users have the option to view default plots or create custom graphs. In the accordion on the right of the graph page is split into 3 sections. The Custom Charts section for users to create their own graphs, the Default Training Charts section which shows preconfigured graphs (explained below) and the Hyperparameter and Network Optimization Charts section which displays default charts showing the how the hyperparameters evolve over time.

‍Note: Plots only appear if an experiment has already produced data to show. Experiments which have just been created will not immediately have the data.

Visualising a single run
Visualising your experiment as a single run alone yields insight into the evolutionary HPO. To do so, select or expose only one of the experiments. This will reveal metrics for each of the agents in the population. Those are: the score (i.e. taken from the active training or learning episodes), fitness (evaluated separately without the training), learning algorithm losses, episode time (calculated from the evaluations) and finally the number of steps through time (to compare the speed of different agents).

In any of the plots, clicking on one of the legends allows you to hide or display the entity in the plot (in the case of this single experiment inspection: each agent in the population).

Comparing two or more runs
When comparing two or more runs, the visualisations switch from the agent-level to population-level. Now, for each experiment, you’ll be able to see the averaged score across agents within the population, the best fitness, the average learning algorithm losses, the average episode time, and the number of steps taken.

Experiment summaries and comparison
There will be a table beneath the performance plots, which can be expanded for each domain (e.g. Environment, Algorithm…), to display specification summaries of the selected visualisation experiments side-by-side.

A concise summary can be extracted by filtering the display to show only the differences across experiments, which can be useful for determining the best training configuration.
How can I deploy and use my trained agent?
Deployment is done from the Experiments page. This can be done by selecting the Deploy button on the row of the experiment you wish to deploy. A pop up will then appear from which you can pick the checkpoint you wish to deploy. By default, the latest checkpoint is selected.

After you have selected Confirm, the agent will be added to the Deployments page permanently. There, it can be activated and de-activated via the ‘Connect’ and ‘Disconnect’ icon on its corresponding row.

The Deployments page comes with a default API (script) that you should use for querying deployed agents. A connection is established to a deployed agent via its URL. An API key token is also provided for each experiment and this will be passed as a token to the API for authentication (with OAuth2). A query must be supplied with a batch of (single or several) environment states that the agent can return actions for (with the generic objective of stepping through the corresponding environments) while the agent continues to be deployed natively in AgileRL arena. The expected http response is 200 / success returned along with the action value(s) and an acknowledgement as the input batch size. Otherwise, the error code shall be returned along with a JSON response if available.
How does evolutionary hyperparameter optimisation work?
Traditionally, hyperparameter optimization (HPO) for reinforcement learning (RL) is particularly difficult when compared to other types of machine learning. This is for several reasons, including the relative sample inefficiency of RL and its sensitivity to hyperparameters.

AgileRL significantly improves HPO for reinforcement learning through the use of evolutionary algorithms to reduce overall training time whilst making the process more robust. Evolutionary algorithms have been shown to allow faster, automatic convergence to optimal hyperparameters than other HPO methods by taking advantage of shared memory between a population of agents acting in identical environments.

At regular intervals, after learning from shared experiences, a population of agents can be evaluated in an environment. Through tournament selection, the best agents are selected to survive until the next generation, and their offspring are mutated to further explore the hyperparameter space. Eventually, the optimal hyperparameters for learning in a given environment can be reached in significantly less steps than are required using other HPO methods.

Our evolutionary approach allows for HPO in a single training run compared to Bayesian methods that require multiple sequential training runs to achieve similar, and often inferior, results.

Interactive library

Check out the interactive library to help you get the most out of Arena.