IBRL is a sample-efficient
We equip robots with commonsense reasoning skills by enabling them to actively gather missing information from the environment. To reason in the real world, robots must go beyond passively querying LLMs and actively gather information from the environment that is required to make the right decision. We propose an approach that leverages an LLM and vision language model (VLM) to help a robot actively perceive its environment to perform grounded commonsense reasoning
InstructRL enables humans to specify what kind of strategies they expect from their AI partners through natural language instructions. We use pretrained large language models to generate a prior policy conditioned on the human instruction and use the prior to regularize the RL objective. This leads to the RL agent converging to equilibria that are aligned with human preferences.
We create Cicero, the first AI that achieves human-level performance in Diplomacy, a board game that requires complex strategic planning and natural langauge communication. We address this challenge using combinations of imitation learning, reinforcement learning, search, and large langauge models.
We produce meaningfully diverse and reasonable joint policies using adversarial reward shaping. We show that naively applying adversarial rewards will lead to agents learning deliberate sabotaging behaviors and we address it using variants of off-belief learning.
We study self-explaining deviations (SEDs), a specific class of coordination problems where players deviate from the common understanding of what reasonable behavior would be in normal circumstances assuming that other agents will realize, using theory of mind, that the circumstance must be abnormal.
In multi-agent partially observable settings, when policies deviate, as they often do when performing search at test time, the belief function becomes inaccurate. We address the changing belief problem in search by fine-tuning the belief model on-the-fly at test time, achieving better performance and eliminating the need to exact belief tracking.
Off-belief learning provably converges to a sequence of fixed joint policies in Dec-POMDP regardless of random seeds, hyper-parameters or even underlying RL algorithms, which effectively solves the ZSC problem discovered in Other-Play. OBL also achieves the best human-AI performance in Hanabi among methods that do not use human data thanks to its grounded, hierachical reasoning ability that resembles humans.
We propose piKL, which regularize search towards human behavioral clone (BC) policy to obtain better models of human. PiKL not only achieves higher scores but also predicts human moves better than the BC policy trained for the exact purpose.
We apply piKL (regularizing search towards a human model) to both search and RL in the coordination game of Hanabi and show that it achieves better performance with a diverse group of human players in large-scale human experiments.
Search algorithms (such as MCTS, or Monte-Carlo Search) are often designed by humans. Here, we use RL as a test-time policy improvement operator in place of traditional search. RL search can be applied to environments where tabular search is too expensive to run, and it outperforms tabular search in Atari games.
We modernize the idea of K-level reasoning in the context of deep learning to train a sequence of agents with increasing yet predictable capabilities to reason. We emperically show that the agents produced this way achieves good human-AI coordination performance.
We discover that in Dec-POMDP, policies trained with the exact same algorithms but with different small details such as seeds can be completely incompatible. We formalize this problem and propose Zero-Shot Coordination as a sanity check for MARL algorithms. Then we create a method to address it by preventing arbitrary symmetry breaking.
We propose Trajectory Diversity (TrajeDi), a differentiable objective for generating diverse reinforcement learning policies. We apply this method to multi-agent environments to produce diverse agents and then train a common best response policy that generalize better to unseen partners.
We significantly speed up the SPARTA search algorithm using learned belief model and bootstrapping from Q-functions. Thanks to the neural network belief model, the new search algorithm generalize to unseen partners, making it applicable in human-AI coordination settings.
Rather than following the gradient, which corresponds to a locally greedy direction, we instead follow the eigenvectors of the Hessian. By iteratively following and branching amongst the ridges, we effectively span the loss surface to find qualitatively different solutions.
In this paper we propose two different search techniques that can be applied to improve an arbitrary agreed-upon policy in a cooperative partially observable game. We prove that these search procedures are theoretically guaranteed to at least maintain the original performance of the agreed-upon policy and achieve new SOTA in the Hanabi benchmark.
We present a Simplified Action Decoder (SAD), which resolves contradiction between explorative and informative actions in multi-agent RL. During training SAD allows other agents to not only observe the (exploratory) action chosen, but agents instead also observe the greedy action of their team mates. SAD establishes a new SOTA for 2-5 players on the self-play part of the Hanabi challenge.
We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a latent plan in natural language, which is then executed by a separate model. We create a new miniRTS environments and collect human langauge and trajectory data for this task.
| |||||||||||
|