The idea for this project was to build a bot that can learn to play Pokemon, specifically to battle other trainers. The bot would learn the different mechanics of the game, from choosing the optimal moves each turn to making long-term strategies to win matches.
The easiest platform to develop such a bot is Pokemon Showdown, an online platform that is lightweight, free to play, and very accessible for this purpose. Previous work has also been done on similar projects, specifically with the Poke-env environment, which provides easy access to all the data needed, eliminating much of the technical implementation required for a classic Pokemon game.
The goal was to build a bot for the online game Pokemon Showdown using reinforcement learning methods such as:
- DDQN
- PPO
- Reinforce
The bot would be hosted on the online Pokemon Showdown server, allowing players to battle against it with the help of Poke-env.
Reinforce is a policy gradient method that directly optimizes the agent's policy through trial and error by adjusting action probabilities based on rewards. It relies solely on the return from the environment to update the policy, without the need for a value function. While simple, it can be slow and less stable due to high variance in the updates, especially in complex environments with delayed rewards.
Proximal Policy Optimization (PPO) is a policy gradient method that improves on REINFORCE by using a clipped objective to prevent large, destabilizing policy updates. Unlike REINFORCE, PPO often pairs with an Actor-Critic architecture, where the critic estimates the value function to stabilize learning. Its stability and efficiency make it a more robust choice, especially for continuous and large-scale tasks.
Double Deep Q-Network (DDQN) is a value-based method that refines the original DQN by separating action selection and evaluation to avoid overestimating Q-values. Unlike PPO and REINFORCE, which focus on learning a policy, DDQN learns the value of state-action pairs and uses these values to guide decision-making. This method is particularly effective in environments where learning precise action values is crucial for long-term success.
The Actor-Critic method combines two networks: the actor, which selects actions based on the current policy, and the critic, which evaluates the value of the state to guide the actor's updates. This architecture reduces the high variance typically seen in pure policy gradient methods like REINFORCE by incorporating value estimates. By leveraging the critic's feedback, the actor improves its policy more efficiently, making the Actor-Critic method well-suited for continuous action spaces and complex environments.The final model we used was an Actor-Critic with PPO policy. The architecture consists of an actor-network and a critic network, with the following layers:
- Input Layer: Takes in the state of the environment (
state_dim
features). - 1st Hidden Layer: Fully connected layer with 64 units and Tanh activation.
- 2nd Hidden Layer: Fully connected layer with 128 units and Tanh activation.
- 3rd Hidden Layer: Another fully connected layer with 128 units and Tanh activation.
- Output Layer: Fully connected layer with
action_dim
units, using Softmax activation to output probabilities for each action.
- Input Layer: Same as the actor-network, takes in the state (
state_dim
features). - 1st Hidden Layer: Fully connected layer with 64 units and Tanh activation.
- 2nd Hidden Layer: Fully connected layer with 128 units and Tanh activation.
- 3rd Hidden Layer: Another fully connected layer with 128 units and Tanh activation.
- Output Layer: A single unit (scalar output), representing the estimated value of the input state (used for value prediction).
The state space ( S ) consists of all possible states in the environment. Each state ( s ) is defined at each turn with 12 battle elements concatenated, which correspond to:
- [0] Our Active Pokémon index
- [1] Opponent Active Pokémon index
- [2-5] Active Pokémon moves base power (default to -1 if a move doesn't have base power)
- [6-9] Active Pokémon moves damage multipliers
- [10] Our remaining Pokémon
- [11] Opponent remaining Pokémon
The action space ( A ) consists of all possible actions we can take. The action space is a range ([0, 8]) with a total length of 9. Each action ( a ) in ( A ) corresponds to one of the following choices:
- [0] Use 1st Active Pokémon move
- [1] Use 2nd Active Pokémon move
- [2] Use 3rd Active Pokémon move
- [3] Use 4th Active Pokémon move
- [4] Switch to 1st next Pokémon
- [5] Switch to 2nd next Pokémon
- [6] Switch to 3rd next Pokémon
- [7] Switch to 4th next Pokémon
- [8] Switch to 5th next Pokémon
-
Ensure Python 3.8 or later and torch is installed on your system.
-
Install the required Python dependencies using pip:
pip install -r requirements.txt
part1.mp4
To battle the bot, follow these steps:
-
Create Two Pokémon Showdown Accounts:
- You need two accounts: one to host the bot and another for yourself.
- Create these accounts at Pokémon Showdown.
-
Prepare the Account Information:
- Create a file named
Account.txt
in the same directory as yourPPO2.py
script. - This file should contain the username and password of the account you will use to host the bot.
- Create a file named
-
Run the PPO Script:
-
Ensure the
PPO2.py
script, model weights file, andaccounts.txt
are all in the same folder. -
Execute the script with the following command:
python PPO2.py
-
-
Set Up Your Team:
-
Go to the Pokémon Showdown team builder and create a team using the following string. Copy and paste this string into the team builder:
Qwilfish (Qwilfish-Hisui) @ Eviolite Ability: Intimidate Level: 83 Tera Type: Flying EVs: 85 HP / 85 Atk / 85 Def / 85 SpA / 85 SpD / 85 Spe - Toxic Spikes - Crunch - Gunk Shot - Spikes Medicham @ Choice Band Ability: Pure Power Level: 86 Tera Type: Fighting EVs: 85 HP / 85 Atk / 85 Def / 85 SpA / 85 SpD / 85 Spe - Zen Headbutt - Ice Punch - Poison Jab - Close Combat Orthworm @ Chesto Berry Ability: Earth Eater Level: 88 Tera Type: Electric EVs: 85 HP / 85 Atk / 85 Def / 85 SpA / 85 SpD / 85 Spe - Body Press - Coil - Rest - Iron Tail Chandelure @ Choice Scarf Ability: Flash Fire Level: 83 Tera Type: Fire EVs: 85 HP / 85 Def / 85 SpA / 85 SpD / 85 Spe IVs: 0 Atk - Trick - Shadow Ball - Energy Ball - Fire Blast Floatzel @ Leftovers Ability: Water Veil Level: 85 Tera Type: Dark EVs: 85 HP / 85 Atk / 85 Def / 85 SpA / 85 SpD / 85 Spe - Crunch - Low Kick - Wave Crash - Bulk Up Spiritomb @ Leftovers Ability: Infiltrator Level: 90 Tera Type: Dark EVs: 85 HP / 85 Atk / 85 Def / 85 SpA / 85 SpD / 85 Spe - Poltergeist - Toxic - Foul Play - Sucker Punch
-
-
Challenge the Bot:
- In Pokémon Showdown, use the search feature to find the username associated with the bot.
- Challenge the bot. It should automatically accept the challenge.
-
Enjoy the Battle:
- Have fun battling the bot!