|
|||
|
|
|
|
|
|||
|
00:00 |
(Beginning of video)
|
|
|
|||
|
00:00 |
Another key technical challenges and multimodal is alignment.
|
|
|
|||
|
00:06 |
Alignment comes in many different flavors and many different application and like almost all no problems will have an aspect of alignment so for example in maybe a description like your image captioning you want to know which object related to which words or in a video which lnb which elements are which events related to which a phrase.
|
|
|
|||
|
00:31 |
Part of a video caption modality transcription like text to speech or even like a generating gestures from speech will also require this alignment because gestures are not aligned.
|
|
|
|||
|
00:48 |
And in many of these new application like navigation question-answering you need to have like where's the phone so you need to find the phone but you need also to have some contacts related to that.
|
|
|
|||
|
23:24 |
(End of video)
|
|
|
0:00 |
|
|
|
0:05 | ||
|
0:10 |
|
|
|
0:15 | ||
|
0:20 | ||
|
0:25 |
|
|
|
0:30 |