Affiliation:
1. University of Trento, Italy
2. Snap Inc., USA
3. LTCI, Télécom Paris, Institut Polytechnique de Paris, France
4. MPI for Informatics, SIC, Germany
5. University of Trento, Fondazione Bruno Kessler, Italy
Abstract
Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment’s state driven by the actions of its agents. While such a paradigm enables users to
play
a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with
prompts
specified as a set of
natural language
actions and
desired states
. The result—a Promptable Game Model (PGM)—makes it possible for a user to
play
the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the
director’s mode
, where the game is played by specifying goals for the agents in the form of a prompt. This requires learning “game AI,” encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, and devise a strategy to win a point. To render the resulting state, we use a compositional NeRF representation encapsulated in our synthesis model. To foster future research, we present newly collected, annotated and calibrated Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state-of-the-art. Our framework, data, and models are available at snap-research.github.io/promptable-game-models.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design
Reference85 articles.
1. Panos Achlioptas, Ian Huang, Minhyuk Sung, Sergey Tulyakov, and Leonidas Guibas. 2023. ChangeIt3D: Language-assisted 3D shape edits and deformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’23).
2. TEACH: Temporal Action Composition for 3D Humans
3. Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. 2018. Stochastic variational video prediction. In Proceedings of the International Conference on Learning Representations (ICLR’18).
4. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
5. Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献