Speakers
Keynote 1
Prof. Mike Zheng Shou is a tenure-track Assistant Professor at National University of Singapore and a former Research Scientist at Facebook AI in the Bay Area. He holds a PhD degree from Columbia University in the City of New York, where he worked with Prof. Shih-Fu Chang. He was awarded the Wei Family Private Foundation Fellowship. He received the best paper finalist at CVPR'22 and the best student paper nomination at CVPR'17. His team won 1st place in multiple international challenges including ActivityNet 2017, EPIC-Kitchens 2022, Ego4D 2022 & 2023. He is a Fellow of the National Research Foundation (NRF) Singapore and has been named on the Forbes 30 Under 30 Asia list.
Talk Title: Large Generative Models Meet Multimodal Video Intelligence
Abstract: In this talk, I'd like to share my recent research around multimodal video intelligence in the era of large generative models. I will first talk about video-language pretraining techniques (All-in-one, EgoVLP) that use one single model to power various understanding tasks ranging from retrieval to QA. Then I will introduce challenges and our efforts of adapting these large pretrained models to AI Assistant, such a real-world application (AssistQ, AssistGPT). Finally I will delve into the reverse problem i.e. given open-world textual description, how to generate videos (Tune-A-Video, Show-1).
Keynote 2
Prof. Boyang Li is a Nanyang Associate Professor at the School of Computer Science and Technology, Nanyang Technological University. His research interests lie in computational narrative intelligence, multimodal learning, and machine learning. In 2021, he received the
National Research Foundation Fellowship, a prestigious research award of 2.5 million Singapore Dollars. Prior to that, he worked as a senior research scientist at Baidu Research USA and a research scientist
at Disney Research Pittsburgh, where he led an independent research group. He received his Ph.D. degree from Georgia Institute of Technology in 2015 and his Bachelor degree from Nanyang Technological University in 2008. He currently serves as a senior
action editor for ACL Rolling Review and an associate editor for IEEE Transactions on Audio, Speech and Language Processing. His work has been reported by international media outlets such as the Guardian, New Scientist, US National Public Radio, Engadget,
TechCrunch, and so on.
Talk Title: Unlocking Multimedia Capabilities of Gigantic Pretrained Language Models
Abstract: A large language model (LLM) can be analogized to an enormous treasure box guarded by a lock. It contains extensive knowledge, but applying appropriate knowledge to solve the problem at hand requires special techniques. In this talk, I will discuss techniques to unlock the capability of LLMs to process both visual and linguistic information. VisualGPT is one of the earliest works that finetunes an LLM for a vision-language task. InstructBLIP is an instruction-tuned large vision-language model, which set new states of the art on several vision-language tasks.
In addition, I will talk about how to unlock zero-shot capabilities without end-to-end finetuning. Plug-and-Play VQA and Img2LLM achieve excellent results on visual question-answering by simply connecting existing pretrained networks using natural language and model interpretations. Finally, I will describe a new multimodal dataset, Synopses of Movie Narratives, or SyMoN, for story understanding. I will argue that story understanding is an important objective in the pursuit of artificial general intelligence (AGI). Compared to other multimodal story datasets, the special advantages of SyMoN include (1) event descriptions at the right level of granularity, (2) abundant mental state descriptions, (3) the use of diverse storytelling techniques, and (4) the provision of easy-to-use automatic performance evaluation.
Keynote 3
Prof. Ziwei Liu is currently a Nanyang Assistant Professor at Nanyang Technological University, Singapore. His research revolves around computer vision, machine learning and computer graphics. He has published extensively on top-tier conferences and journals in relevant fields, including CVPR, ICCV, ECCV, NeurIPS, ICLR, SIGGRAPH, TPAMI, TOG and Nature - Machine Intelligence, with around 30,000 citations. He is the recipient of Microsoft Young Fellowship, Hong Kong PhD Fellowship, ICCV Young Researcher Award, HKSTP Best Paper Award, CVPR Best Paper Award Candidate, WAIC Yunfan Award and ICBS Frontiers of Science Award. He has won the championship in major computer vision competitions, including DAVIS Video Segmentation Challenge 2017, MSCOCO Instance Segmentation Challenge 2018, FAIR Self-Supervision Challenge 2019, Video Virtual Try-on Challenge 2020 and Computer Vision in the Wild Challenge 2022. He is also the lead contributor of several renowned computer vision benchmarks and softwares, including CelebA, DeepFashion, MMHuman3D and MMFashion. He serves as an Area Chair of CVPR, ICCV, NeurIPS and ICLR, as well as an Associate Editor of IJCV.
Talk Title: Multi-Modal Generative AI with Foundation Models
Abstract: Generating photorealistic and controllable visual contents has been a long-pursuing goal of artificial intelligence (AI), with extensive real-world applications. It is also at the core of embodied intelligence. In this talk, I will discuss our work in AI-driven visual context generation of humans, objects and scenes, with an emphasis on combing the power of neural rendering with large multimodal foundation models. Our generative AI framework has shown its effectiveness and generalizability on a wide range of tasks.
Workshop Schedule
On-site venue: Auditorium A
Date & time: 2nd November 2023, Local Ottawa Time
Zoom link: https://ntu-sg.zoom.us/j/89276840596?pwd=eFRmSWo0TDZSQ2ExTVl0NlVFSm5Kdz09
(Meeting ID: 892 7684 0596, Passcode: 080304)
Time | Title |
09:00 AM - 09:10 AM | Welcome Message from the Chairs |
09:10 AM - 09:40 AM | Keynote 1: Large Generative Models Meet Multimodal Video Intelligence |
09:40 AM - 10:10 AM | Keynote 2: Unlocking Multimedia Capabilities of Gigantic Pretrained Language Models |
10:10 AM - 10:40 AM | Keynote 3: Multi-Modal Generative AI with Foundation Models |
10:40 AM - 11:10 AM | Coffee Break |
11:10 AM - 11:30 AM | Presentation 1: SAT: Self-Attention Control for Diffusion Models Training |
11:30 AM - 11:50 AM | Presentation 2: NeurSEG: A Segment Driven Deep Neural Model for Nested Named Entity Recognition |
11:50 AM - 14:00 PM | Lunch Break |
14:00 PM - 14:20 PM | Presentation 3: ImEW: A Framework for Editing Image in the Wild |
14:20 PM - 14:40 PM | Presentation 4: Generating Multimodal Augmentations with LLMs from Song Metadata for Music Information Retrieval |
14:40 PM - 15:00 PM | Presentation 5: CGSMP: Controllable Generative Summarization via Multimodal Prompt |
15:00 PM - 15:30 PM | Coffee Break |
15:30 PM - 15:50 PM | Presentation 6: Fashion-GPT: Integrating LLMs with Fashion Retrieval System |
15:50 PM - 16:10 PM | Presentation 7: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model |
16:10 PM - 16:30 PM | Presentation 8: Multimodal Data Augmentation for Image Captioning using Diffusion Models |
16:30 PM - 16:40 PM | Workshop Closing |