This software project and dataset accompanies the research paper: Does Spatial Cognition Emerge in Frontier Models?, Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Krähenbühl, Vladlen Koltun. Published in ICLR 2025
We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.
- Install mamba following instructions from Miniforge. Create a mamba environment.
mamba create -n space-benchmark python=3.9 cmake=3.14.0 -y mamba activate space-benchmark
- Clone this repo and install requirements.
pip install -r requirements.txt
- Install habitat for large-scale cognition navigation experiments.
mamba install habitat-sim=0.3.0 headless -c conda-forge -c aihabitat
- Generate a security token from huggingface and set the environment variables.
export HF_TOKEN=<security token> export HF_HUB_ENABLE_HF_TRANSFER=1
- Set OpenAI and Anthropic API keys for evaluating GPT and Claude models.
export OPENAI_API_KEY=<openai api key> export ANTHROPIC_API_KEY=<anthropic api key>
The SPACE dataset is available here. Download it to <repository_root>/data/SPACE_data_release
.
mkdir <repository_root>/data
cd <repository_root>/data
wget https://ml-site.cdn-apple.com/datasets/space/space.tar.gz
tar -xvzf space.tar.gz
rm space.tar.gz
Command:
python -m space.evaluate_qas \
--model_name <model_name>_qa \
--data_path data/SPACE_data_release/<qa_task_name>/qas.json \
--save_dir experiments/<qa_task_name>
QA task names: CBTT_text
, CBTT_vision
, DirectionEstimationBEVImage
, DirectionEstimationBEVText
, DirectionEstimationEgo
, DistanceEstimationBEVImage
, DistanceEstimationBEVText
, DistanceEstimationEgo
, JLO_text
, JLO_vision
, MPFB_text
, MPFB_vision
, MRT_text
, MRT_vision
, MapSketchingBEVImage
, MapSketchingBEVText
, MapSketchingEgo
, PTT_text
, PTT_vision
, SAdd_text
, SAdd_vision
, SAtt_text
, SAtt_vision
, WLT_vision
Models supported for multimodal presentation: claude35sonnet
, gpt4o
, gpt4v
, phi35vision
, pixtral12b
Models supported for text-only presentation: claude35sonnet
, gpt4o
, gpt4v
, llama3_8b
, llama3_70b
, mixtral8x7b
, mixtral8x22b
, mistral123b
, yi15_9b
, yi15_34b
This evaluation requires habitat-sim to be installed.
Command for egocentric navigation:
python -m space.evaluate_egonav \
--model_name <model_name>_egonav \
--envs_dir data/SPACE_data_release/3D_scenes/ \
--save_dir experiments/RouteRetracingEgo \
--walkthrough_key shortestpath \
--max_steps 250
Command for discrete map image navigation:
python -m space.evaluate_dmnav \
--model_name <model_name>_dminav \
--envs_dir data/SPACE_data_release/2D_scenes \
--walkthroughs_dir data/SPACE_data_release/BEV_image_walkthroughs/ \
--obs_type image \
--save_dir experiments/RouteRetracingDiscreteMapImage \
--walkthrough_key shortestpath
Command for discrete map text navigation:
python -m space.evaluate_dmnav \
--model_name <model_name>_dmtnav \
--envs_dir data/SPACE_data_release/2D_scenes \
--walkthroughs_dir data/SPACE_data_release/BEV_text_walkthroughs/ \
--obs_type text \
--save_dir experiments/RouteRetracingDiscreteMapText \
--walkthrough_key shortestpath
Notes:
- These commands evaluate on route retracing. Set
--walkthrough_key walkthrough
for evaluating on shortcut discovery. - Models supported for multimodal presentation:
claude35sonnet
,gpt4o
,gpt4v
- Models supported for text-only presentation:
claude35sonnet
,gpt4o
,gpt4v
,llama3_8b
,llama3_70b
,mixtral8x7b
,mixtral8x22b
,mistral123b
,yi15_9b
,yi15_34b
Command for multimodal presentation:
python -m space.evaluate_cswm \
--model_name <model_name>_cswm_vision \
--envs_dir data/SPACE_data_release/CSWM_vision/ \
--save_dir experiments/CSWM_vision \
--game_mode vision
Command for text-only presentation:
python -m space.evaluate_cswm \
--model_name <model_name>_cswm_text \
--envs_dir data/SPACE_data_release/CSWM_text/ \
--save_dir experiments/CSWM_text \
--game_mode text
Models supported for multimodal presentation: claude35sonnet
, gpt4o
, gpt4v
Models supported for text-only presentation: claude35sonnet
, gpt4o
, gpt4v
, llama3_8b
, llama3_70b
, mixtral8x7b
, mixtral8x22b
, mistral123b
, yi15_9b
, yi15_34b
Command for multimodal presentation:
python -m space.evaluate_mct \
--model_name <model_name>_mct_vision \
--envs_dir data/SPACE_data_release/MCT_vision/envs \
--save_dir experiments/MCT_vision
Command for text-only presentation:
python -m space.evaluate_mct \
--model_name <model_name>_mct_text \
--envs_dir data/SPACE_data_release/MCT_text/envs \
--save_dir experiments/MCT_text
Models supported for multimodal presentation: claude35sonnet
, gpt4o
, gpt4v
Models supported for text-only presentation: claude35sonnet
, gpt4o
, gpt4v
, llama3_8b
, llama3_70b
, mixtral8x7b
, mixtral8x22b
, mistral123b
, yi15_9b
, yi15_34b
@inproceedings{ramakrishnan2025space,
title={Does Spatial Cognition Emerge in Frontier Models?},
author={Ramakrishnan, Santhosh Kumar and Wijmans, Erik and Kraehenbuehl, Philipp and Koltun, Vladlen},
booktitle={International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=WK6K1FMEQ1}
}
This project's code is released under the Apple Sample Code License (see LICENSE). This project's data is released under the CC-BY-NC-ND license (see LICENSE_DATA).
Our codebase is built using opensource contributions, please see Acknowledgements for more details.
Please check the paper for a complete list of references and datasets used in this work.