See you soon!


Dr. Cordelia Schmid – Advances in Dense Video Captioning, Vision-Guided Navigation, and Robot Manipulation


In this talk, we first present recent progress in large-scale learning of multimodal video representations. We present Vid2Seq, a model for dense video captioning that takes as input video and speech and predicts both temporal boundaries and textual descriptions simultaneously. We then present an approach for video question answering and image captioning that relies on a retrieval-augmented visual language model that learns to encode world knowledge into a large-scale memory and to retrieve from it to answer knowledge-intensive queries. We show that our approach achieves state-of-the-art results in visual question answering and image captioning.

In the second part of the talk, we introduce recent work on vision-guided navigation and robot manipulation given language instructions. This work builds on and extends vision-language transformers by integrating action history and predicting actions. The History Aware Multimodal Transformer (HAMT) outperforms the state of the art on different vision-language-navigation benchmarks. Further improvements are achieved by integrating map information into the transformer architecture. We show object goal navigation in the real world, here on the Tiago robot. Next, we demonstrate that such a transformer-based approach can also be used for manipulation and evidence of the importance of 3D visual representation. Our approach achieves excellent real-world performance on a UR5 arm.


Dr. Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate in Computer Science, from the Institut National Polytechnique de Grenoble (INPG).

Her doctoral thesis on “Local Greyvalue Invariants for Image Matching and Retrieval” received the best thesis award from INPG in 1996. She received the Habilitation degree in 2001 for her thesis entitled “From Image Matching to Learning Visual Models”. Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University from 1996–1997. Since 1997 she has held a permanent research position at Inria, where she is a research director.

Dr. Schmid is a member of the German National Academy of Sciences, Leopoldina and a fellow of IEEE and the ELLIS society. She was awarded the Longuet-Higgins Prize in 2006, 2014 and 2016, the Koenderink Prize in 2018, and the Helmholtz Prize in 2023, for fundamental contributions in computer vision that have withstood the test of time. She received an ERC advanced grant in 2013, the Humboldt Research Award in 2015, the Inria & French Academy of Science Grand Prix in 2016, the Royal Society Milner Award in 2020, the PAMI Distinguished Researcher Award in 2021, and the Körber European Science Prize in 2023. Dr. Schmid has been an Associate Editor for IEEE PAMI (2001–2005) and for IJCV (2004–2012), an editor-in-chief for IJCV (2013-2018), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015, ECCV 2020 and ICCV 2023. Since 2018 she has held a joint appointment with Google Research.

Prof. Dr. Wolfram BurgardProbabilistic and Deep Learning Techniques for Robot Navigation and Automated Driving
The talk by Wolfram Burgard is available on YouTube.


For autonomous robots and automated driving, the capability to robustly perceive environments and execute their actions is the ultimate goal. The key challenge is that no sensors and actuators are perfect, which means that robots and cars need the ability to properly deal with the resulting uncertainty. In this presentation, I will introduce the probabilistic approach to robotics, which provides a rigorous statistical methodology to deal with state estimation problems. I will furthermore discuss how this approach can be combined using state-of-the-art technology from machine learning to deal with complex and changing real-world environments.


Wolfram Burgard is a distinguished Professor for Robotics and Artificial Intelligence at the University of Technology Nuremberg where he serves also as Founding Chair of the Engineering Department.  Previously, he held the position of Professor of Computer Science at the University of Freiburg from 1999 to 2021, where he established the renowned research lab for Autonomous Intelligent Systems.  His expertise lies in artificial intelligence and mobile robots, focusing on the development of robust and adaptive techniques for state estimation and control.

Wolfram Burgard’s achievements include deploying the first interactive mobile tour-guide robot, Rhino, at the Deutsches Museum Bonn in 1997. He and his team also development a groundbreaking approach that allowed a car to autonomously navigate through a complex parking garage and park itself in 2008. In 2012, he and his team developed the robot Obelix that autonomously navigated like a pedestrian from the campus of the Faculty of Engineering to the city center of Freiburg.  Wolfram Burgard has published over 350 papers and articles in robotic and artificial intelligence conferences and journals. Additionally, he co-authored the two books “Principles of Robot Motion – Theory, Algorithms, and Implementations” and “Probabilistic Robotics”. In 2009, he was honored with the Gottfried Wilhelm Leibniz Prize, the most prestigious research award in Germany.  He is Member of the Heidelberg Academy of Sciences and the German Academy of Sciences Leopoldina.

Organized and Supported by