|
|||
|
|
|
|
|
|||
|
00:00 |
(Beginning of video)
|
|
|
|||
|
00:01 |
Welcome to this NAACL 2022 tutorial on multimodal machine learning.
|
|
|
|||
|
00:08 |
And we will do this tutorial also with Amir Zadeh Paul Liang Multimodal. That's a passion that's a passion for us hopefully we can share this passion with you as part of this we live in this multimodal world infact Technologies are becoming more and more multimodal.
|
|
|
|||
|
00:32 |
Robot.
|
|
|
|||
|
00:34 |
Vehicles a lot of modalities through these cell phone and we see other Technologies event wearable and a augmented reality coming up we spend some of the last two years on these a video conferencing and now finally starting to doing in-person meeting a lot of them changing to a hybrid meeting with so we have to understand.
|
|
|
|||
|
01:00 |
So as we know that multi modal is core to human communication and also core to collaboration when we create a tutorial on this topic we have to ask what that what is multimodal this word multimodal in fact it was meant in mathematic as multiple modes in multiple distinct peaks in the probability distribution by the term we will use today multimodal is multimodal as multiple modalities.
|
|
|
|||
|
01:36 |
Sensory modality.
|
|
|
|||
|
01:39 |
Have multiple of these and modalities and as we have been building these technology some new modalities have emerged some of the most prominent modalities include language vision and also speech language the word you say how you phrase your sentence and the intent behind the words.
|
|
|
|||
|
02:02 |
Acoustic are these prosody the intonation behind the spoken words but also these vocal expressions beyond the words like a laughter, moaning or pause filler.
|
|
|
|||
|
02:17 |
visual being a strong modality the gestures the body language how close the proximics eye gaze is one of the important one that we often look at both as a communicative cue but also cognatively and facial expression as well these are related to face-to-face communication but Visual and morality also brings a lot more like the objects around the light and also the environment so there's a lot more to this morality and as things both involve technology but also you meant I was using tots or a venue seen all these days physiological sensors and mobile comes with many of these extra modalities like GPS accelerometers and lights and.
|
|
|
|||
|
03:06 |
So we have to ask you about the modalities sounds like what is a modality and that is at its most General sense modality is what it refers to the way in which something is.
|
|
|
|||
|
03:18 |
Best of Percy.
|
|
|
|||
|
03:20 |
And so when we will lose study modalities I want to introduce one dimension for these modalities and one of them is how close you are from this answer in the Westwood call rahmad alitiz all the way to the more abstracted modality further from the science department xample like speech thing they are the image is very close to the Sanford right now from something like language object and object.
|
|
|
|||
|
04:00 |
There is a debate internally like in in multimodal of what can we call a modality for the purpose of this tutorial today we will be inclusive and include all of them as modalities as you can imagine the one closest to the sensor will be offering the most interesting the the more that shows the most diversity.
|
|
|
|||
|
04:23 |
So when we study multimodal now what is multimodal the dictionary definition is just from multiple or with multiple modalities perspective we want to suggest the definition of multimodal is the science of hitting originals and interconnected data and I will find those two times he charges is probably the first one that comes to mind is the fact that we have hate Originators modalities the information present in different model will Air France show diverse quality structured and representations of like Model T A N B for example and you ask yourself are these Originals or homogeneous and that's another asks we will Define for modalities you have to mow genius they might be very similar to each other all the way to 80. Very similar to each other and example could be just two images from the same camera that these are very homogeneous.
|
|
|
|||
|
05:31 |
When different camera me very different views and psyche last text from two different languages language and vision and then you can imagine these becoming more and more difference and one thing to remember is that day when we talk about abstract modalities we often will see these modalities being more homogeneous closer to each other while the ones that are closer to the sensor are usually the one more hits region is because you will have very different qualities and so I want to share with you what do you mean by nothing exhaustive list this is 6 example of it and and work and we also bring some other example was very key one when we look at her true identity is the structure of the day like we have two modalities that are heterogeneous and they will have maybe. Very different structure so one of them may be more because of spatial information like an image some one of them may have a very strong a structural temporally some of them will not or some of them will have a hierarchy and when you look at the structure of the important word is often like the invariants like because for example LP me myself here or slightly down the left that's really a new LP or the same that's called Envision like translation in variance and so it's not just that there's a structure but is also in variance and some of these axes of that structure.
|
|
|
|||
|
07:12 |
The second one is representation space like do we have very discreet tokens like word very or we have very continuous signal that will be maybe Christmas in all like speech and our vendor frequency representation of it.
|
|
|
|||
|
07:30 |
And when I speak when you think of representation is how human interpret about these are which can having that effect in the kind of metal do with you.
|
|
|
|||
|
07:39 |
And then within indeed these modality where they had different structure and representation how much information is encoded how much variance is included within each unit and how much density are our range or overlap you see in your information related to that is the granularity of 1 units of information like like what is the sampling rate what is the resolution of the Precision so so you could you could think Evan like some time analyzing at the the word level the sentences about different granddaughter the same mentality will often have multiple levels of granularity.
|
|
|
|||
|
08:23 |
And then there's the Norris because there's a lot of uncertainty and sometimes mentioned sometimes you because of noise as I mentioned or missing data and so you want to be able to model that and I will become important when you do any kind of fusion and finally the relevant because all of this morality may be relevant to a certain tasks or be dependent on the contact the environment around it so as you can see each modalities has is on that mentions and when you when you do any kind of multimodal more than you need to take that into consideration.
|
|
|
|||
|
09:00 |
Great certainty is is part of it the second one which we believe is also very important is the interconnected the fact that model T have cross-modal interaction between element of each modalities and Saudis interconnection comes at 2:11 first the connection itself.
|
|
|
|||
|
09:20 |
The connection could be because there was a correspondence like there is an object and the same object exists in language modality because of a word and it corrects phone to it or could be a dependent because there is maybe at a Independence between one event and another one in the future this I'll disconnect and pray that other part of the structure.
|
|
|
|||
|
09:44 |
And finally for each of those connection you have to ask yourself specifically the connection between modality between the two modality and trapping.
|
|
|
|||
|
09:56 |
And when you start looking at Westmoreland traction this is one of the core technical Challenge and I like to always come a little bit discuss the aspect of it from a cognitive and Behavioral Science perspective because human has been studied and they there's a lot of very great work and kind of science brain and Behavioral Science on this I will highlight one of them here if I could give probably a whole tutorial on what are some people will say multi sensory integration that's often a term used in cognitive science related to when we will call in a I will tomorrow Fusion.
|
|
|
|||
|
10:39 |
So when you have and study human communication multiple communication first taxonomy and this is a good starting point for us a Computer Sciences 1 icepick is how much redundancy there is between the two modality and that's probably you can see it's almost dinner if you look at it from it it scientific or computer science with a prospective how much information very prospective how much information is redundant in both modalities and they suggest to have this as a first level of taxonomy is the modality and do they have redundant information or not so and if they have any information if we merge them together if we integrated. Will happen and in one case they will be what's call equivalence so the amount of information in bowed is is is redundant and the interaction within them in such a way that there's nothing that changes that they are a response is the same where the second case they would be an enhancement like either adjectives like example with the adjective which we'll discuss later as adjective Fusion.
|
|
|
|||
|
12:08 |
Subtractive would be another if they were.
|
|
|
|||
|
12:11 |
When we have to review fuse together is that was actually in the pan and then you just have the same information again now sometimes dominant over another and then the other will completely Gaynor interesting one is modulation and we'll discuss that later we're one modality like this to be morality has a way of modulating of changing of enhancing and December 11th you could say at the other modalities and then the Holy Grail of multimodal emergence like something completely different comes out in our world it could be a little bit simpler maybe just a normal inerrancy something that's more than just as I will be maybe a little bit of a start of emergent and then eventually goes much longer.
|
|
|
|||
|
13:06 |
So this is the initial text telling me from the variable science we made an effort to try to bolt and and Brace that I can let me bring the handset and Define it for it like six dimension of cross-modal interaction other the first one is adjectives multiplicative and non-addictive you can see it as as as an axis like so we could say I just have slightly more simpler interaction multiplicative or even going on at detective maybe even know where they call emergency.
|
|
|
|||
|
13:43 |
When you look at it I'll so it's it's it's some of the earlier texted me we're only two modalities but in our case will often have cases where they have 23 and a lot more modalities and so when we look at cross-modal instructions you have 2 hours to study how many modalities are involved in the interaction and it said that would be fine in fact you will see that offering eventually modalities there's a fair amount that maybe you need more I buy maroon and some of it would be trimodal and you want to be able to distinguish.
|
|
|
|||
|
14:20 |
When you have modalities the intricate rudiments about then you have this aspect of how much of the information is redundant is it equivalent is there a korrespondent this is very hard it's like it means that exactly the same and and and if if I choosed there's nothing more correspondence is when you have two equivalent but you bring together they bring their own aspect of each model and then the dependencies this is probably the understudy but very important how one morality can because maybe of temporal aspect will have a dependent on another.
|
|
|
|||
|
39:50 |
(End of video)
|
|
|
0:00 |
|
|
|
0:05 | ||
|
0:10 |
|
|
|
0:15 | ||
|
0:20 | ||
|
0:25 |
|
|
|
0:30 |