• 28/8/2023 - Workshop schedule is announced.
  • 8/8/2023 - Workshop papers notification is announced.
  • 19/7/2023 - Workshop papers submission is delayed.
  • 27/4/2023 - Important dates are updated.
  • 6/4/2023 - CFP is released.
  • 6/4/2023 - Workshop homepage is now available.

Call for Papers

This workshop intends to 1) provide a platform for researchers to present their latest works and receive feedback from experts in the field, 2) foster discussions on current challenges and opportunities in multimodal analysis and application, 3) identify emerging trends and opportunities in the field, and 4) explore their potential impact on future research and development. Potential topics include, but are not limited to:
  • Multimodal content creation
  • Multimodal data analysis and understanding
  • Multimodal question answering
  • Multimodal information retrieval
  • Multimodal recommendation
  • Multimodal summarization and text generation
  • Multimodal conversational agents
  • Multimodal machine translation
  • Multimodal fusion and integration of information
  • Multimodal applications/pipelines
  • Multimodal data management and indexing
Important dates:
  • Workshop Papers Submission: 21 July 2023 26 July 2023
  • Workshop Papers Notification: 8 August 2023
  • Camera-ready Submission: 12 August 2023 27 August 2023
  • Conference dates: 28 October 2023 – 3 November 2023
Please note: The submission deadline is at 11:59 p.m. of the stated deadline date Anywhere on Earth.


  • Submission Guidelines:
  • Submitted papers (.pdf format) must use the ACM Article Template. Please remember to add Concepts and Keywords. Submissions can be of varying length from 4 to 8 pages, plus additional pages for the reference pages, i.e., the reference page(s) are not counted to the page limit of 4 to 8 pages. There is no distinction between long and short papers, but the authors may themselves decide on the appropriate length of the paper. All papers will undergo the same review process and review period. Paper submissions must conform with the "double-blind" review policy. All papers will be peer-reviewed by experts in the field. Acceptance will be based on relevance to the workshop, scientific novelty, and technical quality. The workshop papers will be published in the ACM Digital Library.
  • Submission Site:
  • Organizers

    • Zheng Wang (Huawei Singapore Research Center, Singapore)
    • Cheng Long (Nanyang Technological University, Singapore)
    • Shihao Xu (Huawei Singapore Research Center, Singapore)
    • Bingzheng Gan (Huawei Singapore Research Center, Singapore)
    • Wei Shi (Huawei Singapore Research Center, Singapore)
    • Zhao Cao (Huawei Technologies Co., Ltd, China)
    • Tat-Seng Chua (National University of Singapore, Singapore)


    Keynote 1

    Prof. Mike Zheng Shou is a tenure-track Assistant Professor at National University of Singapore and a former Research Scientist at Facebook AI in the Bay Area. He holds a PhD degree from Columbia University in the City of New York, where he worked with Prof. Shih-Fu Chang. He was awarded the Wei Family Private Foundation Fellowship. He received the best paper finalist at CVPR'22 and the best student paper nomination at CVPR'17. His team won 1st place in multiple international challenges including ActivityNet 2017, EPIC-Kitchens 2022, Ego4D 2022 & 2023. He is a Fellow of the National Research Foundation (NRF) Singapore and has been named on the Forbes 30 Under 30 Asia list.

    Talk Title: Large Generative Models Meet Multimodal Video Intelligence
    Abstract: In this talk, I'd like to share my recent research around multimodal video intelligence in the era of large generative models. I will first talk about video-language pretraining techniques (All-in-one, EgoVLP) that use one single model to power various understanding tasks ranging from retrieval to QA. Then I will introduce challenges and our efforts of adapting these large pretrained models to AI Assistant, such a real-world application (AssistQ, AssistGPT). Finally I will delve into the reverse problem i.e. given open-world textual description, how to generate videos (Tune-A-Video, Show-1).

    Keynote 2

    Prof. Boyang Li is a Nanyang Associate Professor at the School of Computer Science and Technology, Nanyang Technological University. His research interests lie in computational narrative intelligence, multimodal learning, and machine learning. In 2021, he received the National Research Foundation Fellowship, a prestigious research award of 2.5 million Singapore Dollars. Prior to that, he worked as a senior research scientist at Baidu Research USA and a research scientist at Disney Research Pittsburgh, where he led an independent research group. He received his Ph.D. degree from Georgia Institute of Technology in 2015 and his Bachelor degree from Nanyang Technological University in 2008. He currently serves as a senior action editor for ACL Rolling Review and an associate editor for IEEE Transactions on Audio, Speech and Language Processing. His work has been reported by international media outlets such as the Guardian, New Scientist, US National Public Radio, Engadget, TechCrunch, and so on.

    Talk Title: Unlocking Multimedia Capabilities of Gigantic Pretrained Language Models
    Abstract: A large language model (LLM) can be analogized to an enormous treasure box guarded by a lock. It contains extensive knowledge, but applying appropriate knowledge to solve the problem at hand requires special techniques. In this talk, I will discuss techniques to unlock the capability of LLMs to process both visual and linguistic information. VisualGPT is one of the earliest works that finetunes an LLM for a vision-language task. InstructBLIP is an instruction-tuned large vision-language model, which set new states of the art on several vision-language tasks. In addition, I will talk about how to unlock zero-shot capabilities without end-to-end finetuning. Plug-and-Play VQA and Img2LLM achieve excellent results on visual question-answering by simply connecting existing pretrained networks using natural language and model interpretations. Finally, I will describe a new multimodal dataset, Synopses of Movie Narratives, or SyMoN, for story understanding. I will argue that story understanding is an important objective in the pursuit of artificial general intelligence (AGI). Compared to other multimodal story datasets, the special advantages of SyMoN include (1) event descriptions at the right level of granularity, (2) abundant mental state descriptions, (3) the use of diverse storytelling techniques, and (4) the provision of easy-to-use automatic performance evaluation.

    Keynote 3

    Prof. Ziwei Liu is currently a Nanyang Assistant Professor at Nanyang Technological University, Singapore. His research revolves around computer vision, machine learning and computer graphics. He has published extensively on top-tier conferences and journals in relevant fields, including CVPR, ICCV, ECCV, NeurIPS, ICLR, SIGGRAPH, TPAMI, TOG and Nature - Machine Intelligence, with around 30,000 citations. He is the recipient of Microsoft Young Fellowship, Hong Kong PhD Fellowship, ICCV Young Researcher Award, HKSTP Best Paper Award, CVPR Best Paper Award Candidate, WAIC Yunfan Award and ICBS Frontiers of Science Award. He has won the championship in major computer vision competitions, including DAVIS Video Segmentation Challenge 2017, MSCOCO Instance Segmentation Challenge 2018, FAIR Self-Supervision Challenge 2019, Video Virtual Try-on Challenge 2020 and Computer Vision in the Wild Challenge 2022. He is also the lead contributor of several renowned computer vision benchmarks and softwares, including CelebA, DeepFashion, MMHuman3D and MMFashion. He serves as an Area Chair of CVPR, ICCV, NeurIPS and ICLR, as well as an Associate Editor of IJCV.

    Talk Title: Multi-Modal Generative AI with Foundation Models
    Abstract: Generating photorealistic and controllable visual contents has been a long-pursuing goal of artificial intelligence (AI), with extensive real-world applications. It is also at the core of embodied intelligence. In this talk, I will discuss our work in AI-driven visual context generation of humans, objects and scenes, with an emphasis on combing the power of neural rendering with large multimodal foundation models. Our generative AI framework has shown its effectiveness and generalizability on a wide range of tasks.