![]() To the best of our knowledge, LipNet by Oxford University was the first end-to-end sentence-level lip-reading model that simultaneously learns spatiotemporal visual features and a sequence model. Motivated by this observation, we present our model LipSync, that maps a variable-length sequence of video frames to text, making use Deep neural networks, classification loss, trained entirely end-to-end. Studies have shown that human lip-reading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. More recent deep lip-reading approaches are end-to-end trainable (Wand et al., 2016 Chung & Zisserman, 2016a). Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. Lip-reading is the task of decoding text from the movement of a speaker’s mouth. Assael, Brendan Shillingford, Shimon Whiteson, Nando de Freitas Oxford University in collaboration with google deep-minds in 2016. This project was basically started by Yannis M. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |