ICML 2024 Workshop

Multi-modal Foundation Model meets Embodied AI

The Forty-first International Conference on Machine Learning
@ Messe Wien Exhibition Congress Center, Vienna
Fri 26 Jul

Schedule

Please select your time zone:

Time	Session	Description	Duration(mins)
Jul 26, 07:00	Opening Remarks	Lu Sheng	10
Jul 26, 07:10	Keynote Talk	General-Purpose Embodied AI Sergey Levine (UCB)	30
Jul 26, 07:40	Keynote Talk	On Building General-Purpose Robots Lerrel Pinto (NYU)	30
Jul 26, 08:10	Poster session #1 + Coffee Break		40
Jul 26, 08:50	Keynote Talk	Foundation models for robotics Chelsea Finn (Stanford)	30
Jul 26, 09:20	Panel Discussion	Early career researchers in Embodied AI: Challenges and Opportunities in Multimodal Foundation Models Zhenfei Yin, Mahi Shafiullah, Haoshu Fang, Yilun Du, Boyuan Chen	55
Jul 26, 10:15	Lunch Break	-	45
Jul 26, 11:00	Poster session#2	-	60
Jul 26, 12:00	Keynote Talk	Compositional Foundation Models Yilun Du (MIT)	30
Jul 26, 12:30	Outstanding Paper Talk	DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning	10
Jul 26, 12:40	Outstanding Paper Talk	Instruction-Guided Visual Masking	10
Jul 26, 12:50	Outstanding Paper Talk	BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks	10
Jul 26, 13:00	Outstanding Paper Talk	Behavior Generation with Latent Actions	10
Jul 26, 13:10	Outstanding Paper Talk	Multimodal foundation world models for generalist embodied agents	10
Jul 26, 13:20	Challenge	MFM-EAI Challenge 1&2&3	30
Jul 26, 13:50	Keynote Talk	LEO: An embodied generalist agent in 3D world and Beyond Xiaojian Ma (BIGAI)	30
Jul 26, 14:20	Keynote Talk	Generative Interactive Environments Jake Bruce (Google Deepmind)	30
Jul 26, 14:50	End of Program		10

Overview

Welcome to the ICML 2024 Workshop on Multi-modal Foundation Model meets Embodied AI(MFM-EAI)

In recent years, Multi-modal Foundation Models (MFM) such as CLIP (20), ImageBind (13), DALL·E 3 (3), GPT- 4V (18), Gemini (22), and Sora (5), have emerged as one of the most captivating and rapidly advancing areas in AI, drawing significant attention and progressing swiftly. The open-source community for MFM has also seen vigorous growth, with the emergence of models and algorithms like LLaVA (17), LAMM (26), MiniGPT-4 (28), Stable Diffusion (21), and OpenSora (14). These MFMs are now actively exploring ultimate application scenarios beyond traditional computer vision and natural language processing tasks. Simultaneously bolstered by MFM, Embodied Artificial Intelligence (EAI) agents have achieved considerable progress in addressing complex tasks in both simulators and real-world environments. However, at the intersection of these fields, a multitude of open questions and unexplored territories remain, including long-horizon decision-making, motion planning with specific trajectories, adapting to new environments, and more. This workshop is dedicated to exploring several critical challenges, including but not limited to:

Generalist capability of MFM: Recent studies have revealed that the prominent perceptual and reasoning abilities of multi-modal large language models (MLLMs) can enhance embodied agents' decision-making (19). Furthermore, the massive knowledge acquired from pre-training and the ability to decompose tasks can make it easier for embodied agents to generalize to new environments and tasks(6). We encourage studies to explore these crucial capabilities of MFMs that can contribute to significant performance gains for embodied agents. Tentative research questions: what potential capabilities do MFM possess that can empower embodied agents, and how can these capabilities be further improved and evaluated?

Multi-modal foundation model for embodied agents: The generalist abilities and extensive knowledge of MFMs can potentially address complex, long-horizon, and open-ended EAI tasks that elude traditional methods. We welcome research that leverages MFMs to guide high-level perception and planning (8), assist with low-level control and execution (27), or directly control physical entities on an end-to-end basis (4). We also encourage the exploration of innovative applications of MFMs and novel framework designs for embodied agents. Tentative research questions: what constitutes an effective system architecture for MFM-based embodied agents, and how can MFM augment the perceptual and decision-making capabilities?

Generative models as world simulators: Simulators for EAI scenarios need to perform high-fidelity simulations of physical laws in the real world, such as parallax, visual contrast, and linear perspective, both visually and in terms of interaction feedback. Recent works like Sora have preliminarily shown that, in specific scenarios, effective simulation is feasible, making it increasingly possible to simulate the real world based on controllably generated content. Tentative research questions: Can generative models be used as physics-aware world models, and what benchmarks or evaluation criteria should be developed to assess the fidelity of these simulations?

Data collection for imitation learning: The accumulation of diverse and large-scale demonstrations is essential for improving the adaptability and proficiency of robot imitation learning (IL). When MFM is introduced to construct embodied agents, the existing data within the field of robotics becomes even more insufficient. We welcome studies that gather a rich and varied dataset for IL (10) or collect in-the-wild data cost-effectively and design policy to learn from it (11). Tentative research questions: How to efficiently collect demonstrations for IL besides teleoperation, and how to efficiently use the collected demonstrations (especially how to bridge gaps in different domains).

Call for Papers

This workshop is designed to unite researchers and practitioners from MFM and EAI, providing a platform to discuss the most recent advancements, methodologies, and practical applications. By encouraging collaboration between these two domains, we aim to create new opportunities to tackle complex challenges at the intersection of both fields. We invite submissions on the following topics:

Training and evaluation of MFM in open-ended scenarios (17; 26)

Data collection for training embodied Agents (7; 10; 12; 6)

Framework designs for MFM-powered embodied agents (23; 19)

Perception and high-level planning in embodied agents empowered by MFM (9; 1; 15)

Decision-making and low-level control in embodied agents empowered by MFM (16; 2)

Evaluation of the capability of embodied agents (24)

Generative model as world simulator (25)

Limitations of MFM in empowering EAI

Submission Guide

Submission Instructions

Thank you for serving as a reviewer for the MFM-EAI ICML2024 workshop. Your expertise and dedication contribute greatly to the success of this event. This guideline aims to provide a clear understanding of how to review papers for this workshop.

Confidentiality. All review assignments and the content of the papers you review should be kept confidential. Do not share these materials or discuss them with others unless they are also reviewers for the same paper.
Conflict of Interest. If you recognize a conflict of interest with any paper you are assigned to review, please notify the program chairs immediately.
Length Requirement. Any paper submission exceeding 4 pages (excluding references) should be rejected directly. Please verify the length of the paper before proceeding with the review. References do not count towards the page limit.
Review Criteria. We aim for a more inclusive and diverse selection of papers, and the quality should be a primary focus. Consider the following aspects:

(1) Relevance: Does the paper align with the theme of the workshop, i.e., multi-modal foundation models and embodied AI?

(2) Originality: Does the paper present new ideas or results, or does it significantly build upon previous work?

(3) Technical Soundness: Is the methodology correct and properly explained? Are the claims supported by theoretical analysis or experimental results?

(4) Clarity: Is the paper well-written and well-structured? Is it easy for readers to understand the problem, the approach, and the results?

(5) Impact: If the results are applied, do they have the potential to contribute to the advancement of multi-modal foundation models and embodied AI?

Key Dates

Paper Submission Open	May 5, 2024
Paper Submission Deadline	May 30, 2024
Acceptance Notification	June 16, 2024
Camera-Ready Deadline	June 30, 2024
Workshop Date	July 26, 2024

All deadlines are specified in AoE (Anywhere on Earth).

Organizing Committee

The University of Tokyo

Yuke Zhu

University of Texas at Austin

Speakers & Panelists

Challenge Organizer

Track 1: EgoPlan Challenge

ARC Lab, Tencent PCG & Tencent AI Lab

Peng Cheng Laboratory

Ying Shan

ARC Lab, Tencent PCG & Tencent AI Lab

Xihui Liu

HKU

Track 2: Composable Generalization Agent Challenge

Track 3: World Model Challenge

Steering Committee

MIT-IBM Watson AI Lab

Tatsuya Harada

The University of Tokyo

Submission Q&A

What is the deadline for submission?

The deadline for submission is May 23, 2024, anywhere in the world.

What is the page-limit for the submissions(4-pages, 8-pages, with or without references)?

The page limit for submissions is 4 pages, without references.

What template do we use for the submission?

The template should be the same as the one used for ICML 2024 submissions.

Are papers previously presented at other conferences accepted?

Yes, papers previously presented at other conferences are accepted. If submitting a paper previously accepted at another conference, a copy of the acceptance notification email must be provided.

Is it fine if we submit our work to a different conference at the same time as the workshop?

Yes, it's acceptable to submit your work to a different conference at the same time as our workshop.

Will the papers be archived/will there be proceedings?

Accepted papers will not be included in the proceedings but will have a poster presentation on the day of the workshop.

Is anonymized submission required for papers?

Yes, all submitted papers must be anonymized. Papers previously accepted at other conferences do not need to be anonymized.

Is supplementary material allowed? What formats are supported?

Yes, supplementary material is optional, and supported formats include pdf, mp4, and zip.

How will papers be reviewed?

All papers will be peer-reviewed by three experts in the field in a double-blind manner.

References

[1] Ajay, A., Han, S., Du, Y., Li, S., Gupta, A., Jaakkola, T., Tenenbaum, J., Kaelbling, L., Srivastava, A., Agrawal, P.: Compositional foundation models for hierarchical planning (2023)

[2] Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems 35, 24639–24654 (2022)

[3] Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Com- puter Science. https://cdn. openai. com/papers/dall-e-3. pdf 2(3), 8 (2023)

[4] Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language- action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023)

[5] Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024), https://openai.com/research/ video-generation-models-as-world-simulators

[6] Chen, Z., Shi, Z., Lu, X., He, L., Qian, S., Fang, H.S., Yin, Z., Ouyang, W., Shao, J., Qiao, Y., et al.: Rh20t-p: A primitive-level robotic dataset towards composable generalization agents. arXiv preprint arXiv:2403.19622 (2024)

[7] Collaboration, O.X.E., Padalkar, A., Pooley, A., Jain, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Singh, A., Brohan, A., Raffin, A., Wahid, A., Burgess-Limerick, B., Kim, B., Sch ̈olkopf, B., Ichter, B., Lu, C., Xu, C., Finn, C., Xu, C., Chi, C., Huang, C., Chan, C., Pan, C., Fu, C., Devin, C., Driess, D., Pathak, D., Shah, D., B ̈uchler, D., Kalashnikov, D., Sadigh, D., Johns, E., Ceola, F., Xia, F., Stulp, F., Zhou, G., Sukhatme, G.S., Salhotra, G., Yan, G., Schiavi, G., Su, H., Fang, H.S., Shi, H., Amor, H.B., Christensen, H.I., Furuta, H., Walke, H., Fang, H., Mordatch, I., Radosavovic, I., Leal, I., Liang, J., Kim, J., Schneider, J., Hsu, J., Bohg, J., Bingham, J., Wu, J., Wu, J., Luo, J., Gu, J., Tan, J., Oh, J., Malik, J., Tompson, J., Yang, J., Lim, J.J., Silv ́erio, J., Han, J., Rao, K., Pertsch, K., Hausman, K., Go, K., Gopalakrishnan, K., Goldberg, K., Byrne, K., Oslund, K., Kawa- harazuka, K., Zhang, K., Majd, K., Rana, K., Srinivasan, K., Chen, L.Y., Pinto, L., Tan, L., Ott, L., Lee, L., Tomizuka, M., Du, M., Ahn, M., Zhang, M., Ding, M., Srirama, M.K., Sharma, M., Kim, M.J., Kanazawa, N., Hansen, N., Heess, N., Joshi, N.J., Suenderhauf, N., Palo, N.D., Shafiullah, N.M.M., Mees, O., Kroemer, O., Sanketi, P.R., Wohlhart, P., Xu, P., Sermanet, P., Sundaresan, P., Vuong, Q., Rafailov, R., Tian, R., Doshi, R., Mart ́ın-Mart ́ın, R., Mendonca, R., Shah, R., Hoque, R., Julian, R., Bustamante, S., Kirmani, S., Levine, S., Moore, S., Bahl, S., Dass, S., Song, S., Xu, S., Haldar, S., Adebola, S., Guist, S., Nasiriany, S., Schaal, S., Welker, S., Tian, S., Dasari, S., Belkhale, S., Osa, T., Harada, T., Matsushima, T., Xiao, T., Yu, T., Ding, T., Davchev, T., Zhao, T.Z., Armstrong, T., Darrell, T., Jain, V., Vanhoucke, V., Zhan, W., Zhou, W., Burgard, W., Chen, X., Wang, X., Zhu, X., Li, X., Lu, Y., Chebotar, Y., Zhou, Y., Zhu, Y., Xu, Y., Wang, Y., Bisk, Y., Cho, Y., Lee, Y., Cui, Y., hua Wu, Y., Tang, Y., Zhu, Y., Li, Y., Iwasawa, Y., Matsuo, Y., Xu, Z., Cui, Z.J.: Open X-Embodiment: Robotic learning datasets and RT-X models. https: //arxiv.org/abs/2310.08864 (2023)

[8] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

[9] Du, Y., Yang, M., Dai, B., Dai, H., Nachum, O., Tenenbaum, J.B., Schuurmans, D., Abbeel, P.: Learning uni- versal policies via text-guided video generation. arXiv preprint arXiv:2302.00111 (2023)

[10] Fang, H.S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., Lu, C.: Rh20t: A robotic dataset for learning diverse skills in one-shot. arXiv preprint arXiv:2307.00595 (2023)

[11] Fang, H., Fang, H.S., Wang, Y., Ren, J., Chen, J., Zhang, R., Wang, W., Lu, C.: Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild. arXiv preprint arXiv:2309.14975 (2023)

[12] Fu, Z., Zhao, T.Z., Finn, C.: Mobile aloha: Learning bimanual mobile manipulation with low-cost wholebody teleoperation. In: arXiv (2024)

[13] Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15180–15190 (2023)

[14] Lab, P.Y., etc., T.A.: Open-sora-plan (Apr 2024). https://doi.org/10.5281/zenodo.10948109, https: //doi.org/10.5281/zenodo.10948109

[15] Lai, B., Dai, X., Chen, L., Pang, G., Rehg, J.M., Liu, M.: Lego: Learning egocentric action frame generation via visual instruction tuning. arXiv preprint arXiv:2312.03849 (2023)

[16] Lifshitz, S., Paster, K., Chan, H., Ba, J., McIlraith, S.: Steve-1: A generative model for text-to-behavior in minecraft. arXiv preprint arXiv:2306.00937 (2023)

[17] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)

[18] OpenAI: Gpt-4v(ision) system card (2023), https://openai.com/research/ gpt-4v-system-card

[19] Qin, Y., Zhou, E., Liu, Q., Yin, Z., Sheng, L., Zhang, R., Qiao, Y., Shao, J.: Mp5: A multi-modal openended embodied system in minecraft via active perception. arXiv preprint arXiv:2312.07472 (2023)

[20] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

[21] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

[22] Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T., Lazaridou, A., Firat, O., Molloy, J., Isard, M., Barham, P.R., Hennigan, T., Lee, B., Viola, F., Reynolds, M., Xu, Y., Doherty, R., Collins, E., Meyer, C., Rutherford, E., Moreira, E., Ayoub, K., Goel, M., Tucker, G., Piqueras, E., Krikun, M., Barr, I., Savinov, N., Danihelka, I., Roelofs, B., White, A., Andreassen, A., von Glehn, T., Yagati, L., Kazemi, M., Gonzalez, L., Khalman, M., Sygnowski, J., Frechette, A., Smith, C., Culp, L., Proleev, L., Luan, Y., Chen, X., Lottes, J., Schucher, N., Lebron, F., Rrustemi, A., Clay, N., Crone, P., Kocisky, T., Zhao, J., Perz, B., Yu, D., Howard, H., Bloniarz, A., Rae, J.W., Lu, H., Sifre, L., Maggioni, M., Alcober, F., Garrette, D., Barnes, M., Thakoor, S., Austin, J., Barth-Maron, G., Wong, W., Joshi, R., Chaabouni, R., Fatiha, D., Ahuja, A., Liu, R., Li, Y., Cogan, S., Chen, J., Jia, C., Gu, C., Zhang, Q., Grimstad, J., Hartman, A.J., Chadwick, M., Tomar, G.S., Garcia, X., Senter, E., Taropa, E., Pillai, T.S., Devlin, J., Laskin, M., de Las Casas, D., Valter, D., Tao, C., Blanco, L., Badia, A.P., Reitter, D., Chen, M., Brennan, J., Rivera, C., Brin, S., Iqbal, S., Surita, G., Labanowski, J., Rao, A., Winkler, S., Parisotto, E., Gu, Y., Olszewska, K., Zhang, Y., Addanki, R., Miech, A., Louis, A., Shafey, L.E., Teplyashin, D., Brown, G., Catt, E., Attaluri, N., Balaguer, J., Xiang, J., Wang, P., Ashwood, Z., Briukhov, A., Webson, A., Ganapathy, S., Sanghavi, S., Kannan, A., Chang, M.W., Stjerngren, A., Djolonga, J., Sun, Y., Bapna, A., Aitchison, M., Pejman, P., Michalewski, H., Yu, T., Wang, C., Love, J., Ahn, J., Bloxwich, D., Han, K., Humphreys, P., Sellam, T., Bradbury, J., Godbole, V., Samangooei, S., Damoc, B., Kaskasoli, A., Arnold, S.M.R., Vasudevan, V., Agrawal, S., Riesa, J., Lepikhin, D., Tanburn, R., Srinivasan, S., Lim, H., Hodkinson, S., Shyam, P., Fer- ret, J., Hand, S., Garg, A., Paine, T.L., Li, J., Li, Y., Giang, M., Neitz, A., Abbas, Z., York, S., Reid, M., Cole, E., Chowdhery, A., Das, D., Rogozi ́nska, D., Nikolaev, V., Sprechmann, P., Nado, Z., Zilka, L., Prost, F., He, L., Monteiro, M., Mishra, G., Welty, C., Newlan, J., Jia, D., Allamanis, M., Hu, C.H., de Liedekerke, R., Gilmer, J., Saroufim, C., Rijhwani, S., Hou, S., Shrivastava, D., Baddepudi, A., Goldin, A., Ozturel, A., Cassirer, A., Xu, Y., Sohn, D., Sachan, D., Amplayo, R.K., Swanson, C., Petrova, D., Narayan, S., Guez, A.,Brahma, S., Landon, J., Patel, M., Zhao, R., Villela, K., Wang, L., Jia, W., Rahtz, M., Gim ́enez, M., Yeung, L., Lin, H., Keeling, J., Georgiev, P., Mincu, D., Wu, B., Haykal, S., Saputro, R., Vodrahalli, K., Qin, J., Cankara, Z., Sharma, A., Fernando, N., Hawkins, W., Neyshabur, B., Kim, S., Hutter, A., Agrawal, P., Castro-Ros, A., van den Driessche, G., Wang, T., Yang, F., yiin Chang, S., Komarek, P., McIlroy, R., Lu ˇci ́c, M., Zhang, G., Farhan, W., Sharman, M., Natsev, P., Michel, P., Cheng, Y., Bansal, Y., Qiao, S., Cao, K., Shakeri, S., Butterfield, C., Chung, J., Rubenstein, P.K., Agrawal, S., Mensch, A., Soparkar, K., Lenc, K., Chung, T., Pope, A., Maggiore, L., Kay, J., Jhakra, P., Wang, S., Maynez, J., Phuong, M., Tobin, T., Tacchetti, A., Trebacz, M., Robinson, K., Katariya, Y., Riedel, S., Bailey, P., Xiao, K., Ghelani, N., Aroyo, L., Slone, A., Houlsby, N., Xiong, X., Yang, Z., Gribovskaya, E., Adler, J., Wirth, M., Lee, L., Li, M., Kagohara, T., Pavagadhi, J., Bridgers, S., Bortsova, A., Ghemawat, S., Ahmed, Z., Liu, T., Powell, R., Bolina, V., Iinuma, M., Zablotskaia, P., Besley, J., Chung, D.W., Dozat, T., Comanescu, R., Si, X., Greer, J., Su, G., Polacek, M., Kaufman, R.L., Tokumine, S., Hu, H., Buchatskaya, E., Miao, Y., Elhawaty, M., Siddhant, A., Tomasev, N., Xing, J., Greer, C., Miller, H., Ashraf, S., Roy, A., Zhang, Z., Ma, A., Filos, A., Besta, M., Blevins, R., Klimenko, T., Yeh, C.K., Changpinyo, S., Mu, J., Chang, O., Pajarskas, M., Muir, C., Cohen, V., Lan, C.L., Haridasan, K., Marathe, A., Hansen, S., Douglas, S., Samuel, R., Wang, M., Austin, S., Lan, C., Jiang, J., Chiu, J., Lorenzo, J.A., Sj ̈osund, L.L., Cevey, S., Gleicher, Z., Avrahami, T., Boral, A., Srinivasan, H., Selo, V., May, R., Aisopos, K., Hussenot, L., Soares, L.B., Baumli, K., Chang, M.B., Recasens, A., Caine, B., Pritzel, A., Pavetic, F., Pardo, F., Gergely, A., Frye, J., Ramasesh, V., Horgan, D., Badola, K., Kassner, N., Roy, S., Dyer, E., Campos, V., Tomala, A., Tang, Y., Badawy, D.E., White, E., Mustafa, B., Lang, O., Jindal, A., Vikram, S., Gong, Z., Caelles, S., Hemsley, R., Thornton, G., Feng, F., Stokowiec, W., Zheng, C., Thacker, P., C ̧ a ̆glar ̈Unl ̈u, Zhang, Z., Saleh, M., Svensson, J., Bileschi, M., Patil, P., Anand, A., Ring, R., Tsihlas, K., Vezer, A., Selvi, M., Shevlane, T., Rodriguez, M., Kwiatkowski, T., Daruki, S., Rong, K., Dafoe, A., FitzGerald, N., Gu-Lemberg, K., Khan, M., Hendricks, L.A., Pellat, M., Feinberg, V., CobonKerr, J., Sainath, T., Rauh, M., Hashemi, S.H., Ives, R., Hasson, Y., Li, Y., Noland, E., Cao, Y., Byrd, N., Hou, L., Wang, Q., Sottiaux, T., Paganini, M., Lespiau, J.B., Moufarek, A., Hassan, S., Shivakumar, K., van Amersfoort, J., Mandhane, A., Joshi, P., Goyal, A., Tung, M., Brock, A., Sheahan, H., Misra, V., Li, C., Raki ́cevi ́c, N., Dehghani, M., Liu, F., Mittal, S., Oh, J., Noury, S., Sezener, E., Huot, F., Lamm, M., Cao, N.D., Chen, C., Elsayed, G., Chi, E., Mahdieh, M., Tenney, I., Hua, N., Petrychenko, I., Kane, P., Scandinaro, D., Jain, R., Uesato, J., Datta, R., Sadovsky, A., Bunyan, O., Rabiej, D., Wu, S., Zhang, J., Vasudevan, G., Leurent, E., Alnahlawi, M., Georgescu, I., Wei, N., Zheng, I., Chan, B., Rabinovitch, P.G., Stanczyk, P., Zhang, Y., Steiner, D., Naskar, S., Azzam, M., Johnson, M., Paszke, A., Chiu, C.C., Elias, J.S., Mohiuddin, A., Muhammad, F., Miao, J., Lee, A., Vieillard, N., Potluri, S., Park, J., Davoodi, E., Zhang, J., Stanway, J., Garmon, D., Karmarkar, A., Dong, Z., Lee, J., Kumar, A., Zhou, L., Evens, J., Isaac, W., Chen, Z., Jia, J., Levskaya, A., Zhu, Z., Gorgolewski, C., Grabowski, P., Mao, Y., Magni, A., Yao, K., Snaider, J., Casagrande, N., Suganthan, P., Palmer, E., Irving, G., Loper, E., Faruqui, M., Arkatkar, I., Chen, N., Shafran, I., Fink, M., Casta ̃no, A., Giannoumis, I., Kim, W., Rybi ́nski, M., Sreevatsa, A., Prendki, J., Soergel, D., Goedeckemeyer, A., Gierke, W., Jafari, M., Gaba, M., Wiesner, J., Wright, D.G., Wei, Y., Vashisht, H., Kulizhskaya, Y., Hoover, J., Le, M., Li, L., Iwuanyanwu, C., Liu, L., Ramirez, K., Khorlin, A., Cui, A., LIN, T., Georgiev, M., Wu, M., Aguilar, R., Pallo, K., Chakladar, A., Repina, A., Wu, X., van der Weide, T., Ponnapalli, P., Kaplan, C., Simsa, J., Li, S., Dousse, O., Yang, F., Piper, J., Ie, N., Lui, M., Pasumarthi, R., Lintz, N., Vijayakumar, A., Thiet, L.N., Andor, D., Valenzuela, P., Paduraru, C., Peng, D., Lee, K., Zhang, S., Greene, S., Nguyen, D.D., Kurylowicz, P., Velury, S., Krause, S., Hardin, C., Dixon, L., Janzer, L., Choo, K., Feng, Z., Zhang, B., Singhal, A., Latkar, T., Zhang, M., Le, Q., Abellan, E.A., Du, D., McKinnon, D., Antropova, N., Bolukbasi, T., Keller, O., Reid, D., Finchelstein, D., Raad, M.A., Crocker, R., Hawkins, P., Dadashi, R., Gaffney, C., Lall, S., Franko, K., Filonov, E., Bulanova, A., Leblond, R., Yadav, V., Chung, S., Askham, H., Cobo, L.C., Xu, K., Fischer, F., Xu, J., Sorokin, C., Alberti, C., Lin, C.C., Evans, C., Zhou, H., Dimitriev, A., Forbes, H., Banarse, D., Tung, Z., Liu, J., Omernick, M., Bishop, C., Kumar, C., Sterneck, R., Foley, R., Jain, R., Mishra, S., Xia, J., Bos, T., Cideron, G., Amid, E., Piccinno, F., Wang, X., Banzal, P., Gurita, P., Noga, H., Shah, P., Mankowitz, D.J., Polozov, A., Kushman, N., Krakovna, V., Brown, S., Bateni, M., Duan, D., Firoiu, V., Thotakuri, M., Natan, T., Mohananey, A., Geist, M., Mudgal, S., Girgin, S., Li, H., Ye, J., Roval, O., Tojo, R., Kwong, M., Lee-Thorp, J., Yew, C., Yuan, Q., Bagri, S., Sinopalnikov, D., Ramos, S., Mellor, J., Sharma, A., Severyn, A., Lai, J., Wu, K., Cheng, H.T., Miller, D., Sonnerat, N., Vnukov, D., Greig, R., Beattie, J., Caveness, E., Bai, L., Eisenschlos, J., Korchemniy, A., Tsai, T., Jasarevic, M., Kong, W., Dao, P., Zheng, Z., Liu, F., Yang, F., Zhu, R., Geller, M., Teh, T.H., Sanmiya, J., Gladchenko,E., Trdin, N., Sozanschi, A., Toyama, D., Rosen, E., Tavakkol, S., Xue, L., Elkind, C., Woodman, O., Carpenter, J., Papamakarios, G., Kemp, R., Kafle, S., Grunina, T., Sinha, R., Talbert, A., Goyal, A., Wu, D., Owusu-Afriyie, D., Du, C., Thornton, C., Pont-Tuset, J., Narayana, P., Li, J., Fatehi, S., Wieting, J., Ajmeri, O., Uria, B., Zhu, T., Ko, Y., Knight, L., H ́eliou, A., Niu, N., Gu, S., Pang, C., Tran, D., Li, Y., Levine, N., Stolovich, A., Kalb, N., Santamaria-Fernandez, R., Goenka, S., Yustalim, W., Strudel, R., Elqursh, A., Lakshminarayanan, B., Deck, C., Upadhyay, S., Lee, H., Dusenberry, M., Li, Z., Wang, X., Levin, K., Hoffmann, R., Holtmann-Rice, D., Bachem, O., Yue, S., Arora, S., Malmi, E., Mirylenka, D., Tan, Q., Koh, C., Yeganeh, S.H., P ̃oder, S., Zheng, S., Pongetti, F., Tariq, M., Sun, Y., Ionita, L., Seyedhosseini, M., Tafti, P., Kotikalapudi, R., Liu, Z., Gulati, A., Liu, J., Ye, X., Chrzaszcz, B., Wang, L., Sethi, N., Li, T., Brown, B., Singh, S., Fan, W., Parisi, A., Stanton, J., Kuang, C., Koverkathu, V., Choquette-Choo, C.A., Li, Y., Lu, T., Ittycheriah, A., Shroff, P., Sun, P., Varadarajan, M., Bahargam, S., Willoughby, R., Gaddy, D., Dasgupta, I., Desjardins, G., Cornero, M., Robenek, B., Mittal, B., Albrecht, B., Shenoy, A., Moiseev, F., Jacobsson, H., Ghaffarkhah, A., Rivi `ere, M., Walton, A., Crepy, C., Parrish, A., Liu, Y., Zhou, Z., Farabet, C., Radebaugh, C., Srinivasan, P., van der Salm, C., Fidjeland, A., Scellato, S., Latorre-Chimoto, E., Klimczak-Pluci ́nska, H., Bridson, D., de Cesare, D., Hudson, T., Mendolicchio, P., Walker, L., Morris, A., Penchev, I., Mauger, M., Guseynov, A., Reid, A., Odoom, S., Loher, L., Cotruta, V., Yenugula, M., Grewe, D., Petrushkina, A., Duerig, T., Sanchez, A., Yadlowsky, S., Shen, A., Globerson, A., Kurzrok, A., Webb, L., Dua, S., Li, D., Lahoti, P., Bhupatiraju, S., Hurt, D., Qureshi, H., Agarwal, A., Shani, T., Eyal, M., Khare, A., Belle, S.R., Wang, L., Tekur, C., Kale, M.S., Wei, J., Sang, R., Saeta, B., Liechty, T., Sun, Y., Zhao, Y., Lee, S., Nayak, P., Fritz, D., Vuyyuru, M.R., Aslanides, J., Vyas, N., Wicke, M., Ma, X., Bilal, T., Eltyshev, E., Balle, D., Martin, N., Cate, H., Manyika, J., Amiri, K., Kim, Y., Xiong, X., Kang, K., Luisier, F., Tripuraneni, N., Madras, D., Guo, M., Waters, A., Wang, O., Ainslie, J., Baldridge, J., Zhang, H., Pruthi, G., Bauer, J., Yang, F., Mansour, R., Gelman, J., Xu, Y., Polovets, G., Liu, J., Cai, H., Chen, W., Sheng, X., Xue, E., Ozair, S., Yu, A., Angermueller, C., Li, X., Wang, W., Wiesinger, J., Koukoumidis, E., Tian, Y., Iyer, A., Gurumurthy, M., Goldenson, M., Shah, P., Blake, M., Yu, H., Urbanowicz, A., Palomaki, J., Fernando, C., Brooks, K., Durden, K., Mehta, H., Momchev, N., Rahimtoroghi, E., Georgaki, M., Raul, A., Ruder, S., Redshaw, M., Lee, J., Jalan, K., Li, D., Perng, G., Hechtman, B., Schuh, P., Nasr, M., Chen, M., Milan, K., Mikulik, V., Strohman, T., Franco, J., Green, T., Hassabis, D., Kavukcuoglu, K., Dean, J., Vinyals, O.: Gemini: A family of highly capable multimodal models (2023)

[23] Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., Anandkumar, A.: Voyager: An openended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)

[24] Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Kaelbling, L., Schuurmans, D., Abbeel, P.: Learning inter- active real-world simulators (2024)

[25] Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Kaelbling, L., Schuurmans, D., Abbeel, P.: Learning inter- active real-world simulators (2024)

[26] Yin, Z., Wang, J., Cao, J., Shi, Z., Liu, D., Li, M., Sheng, L., Bai, L., Huang, X., Wang, Z., et al.: Lamm: Language-assisted multi-modal instructiontuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687 (2023)

[27] Zhou, E., Qin, Y., Yin, Z., Huang, Y., Zhang, R., Sheng, L., Qiao, Y., Shao, J.: Mine- dreamer: Learning to follow instructions via chain- of-imagination for simulated-world control. arXiv preprint arXiv:2403.12037 (2024)

[28] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Expand