Multimodal Representations for Vision, Language, and Embodied AI

Multimodal Representations for Vision, Language, and Embodied AI
Author :
Publisher :
Total Pages :
Release :
ISBN-13 : OCLC:1258029490
ISBN-10 :
Rating : 4/5 ( Downloads)

Book Synopsis Multimodal Representations for Vision, Language, and Embodied AI by : Kevin Chen

Download or read book Multimodal Representations for Vision, Language, and Embodied AI written by Kevin Chen and published by . This book was released on 2021 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Recent years have seen incredible growth and advances in artificial intelligence research. Much of this progress has primarily been made on three fronts: computer vision, natural language processing, and robotics. For example, image recognition is widely considered the holy grail of computer vision, whereas language modeling and translation have been fundamental tasks in natural language processing. However, many practical applications and tasks require going beyond solving these domain-specific problems and instead require solving problems which involve all three of the domains together. An autonomous system not only needs to be able to recognize objects in an image, but also interpret natural language descriptions or commands and understand how they might relate to its perceived visual observations. Furthermore, a robot needs to utilize this information for decision-making and determining which physical actions to take in order to complete a task. In the first part of this dissertation, I present a method for learning how to relate natural language and 3D shapes such that the system can draw connections about words like "round" described in a text description with the geometric attributes of round in a 3D object. To relate the two modalities, we rely a cross-modal embedding space for multimodal reasoning and learn this space without fine-grained, attribute-level categorical annotations. By learning how to relate these two modalities, we can perform tasks such as text-to-shape retrieval and shape manipulation, and also enable new tasks such as text-to-shape generation. In the second part of this dissertation, we allow the agent to be embodied and explore a task which relies on all three domains (computer vision, natural language, and robotics): robot navigation by following natural language instructions. Rather than relying on a fixed dataset of images or 3D objects, the agent is now situated in a physical environment and captures its own visual observations of the space using an onboard camera. To draw connections between vision, language, and robot physical state, we propose a system that performs planning and control using a topological map. This fundamental abstraction allows the agent to relate parts of the language instruction with relevant spatial regions of the environment and to relate a stream of visual observations with physical movements and actions.


Multimodal Representations for Vision, Language, and Embodied AI Related Books

Multimodal Representations for Vision, Language, and Embodied AI
Language: en
Pages:
Authors: Kevin Chen
Categories:
Type: BOOK - Published: 2021 - Publisher:

DOWNLOAD EBOOK

Recent years have seen incredible growth and advances in artificial intelligence research. Much of this progress has primarily been made on three fronts: comput
Multimodal Intelligent Information Presentation
Language: en
Pages: 346
Authors: Oliviero Stock
Categories: Language Arts & Disciplines
Type: BOOK - Published: 2006-03-30 - Publisher: Springer Science & Business Media

DOWNLOAD EBOOK

Intelligent Multimodal Information Presentation relates to the ability of a computer system to automatically produce interactive information presentations, taki
Multimodal Vision-language Representation Learning
Language: en
Pages: 0
Authors: 葛玉莹
Categories: Computer vision
Type: BOOK - Published: 2023 - Publisher:

DOWNLOAD EBOOK

Advances in Natural Multimodal Dialogue Systems
Language: en
Pages: 392
Authors: Jan van Kuppevelt
Categories: Computers
Type: BOOK - Published: 2005-12-06 - Publisher: Springer Science & Business Media

DOWNLOAD EBOOK

The main topic of this volume is natural multimodal interaction. The book is unique in that it brings together a great many contributions regarding aspects of n
MultiMedia Modeling
Language: en
Pages: 523
Authors: Stevan Rudinac
Categories:
Type: BOOK - Published: - Publisher: Springer Nature

DOWNLOAD EBOOK