|
|||
|
|
|
|
|
|||
|
00:00 |
(Beginning of video)
|
|
|
|||
|
00:02 |
This is going to look at is that a transfer.
|
|
|
|||
|
00:06 |
Transference to go to transfer knowledge between modalities usually to help some problem with that even care about these may be noisy or limited resources.
|
|
|
|||
|
00:15 |
You look at two types of transference transfer station.
|
|
|
|||
|
00:23 |
The first appellate is okay to transfer bye Foundation models.
|
|
|
|||
|
00:27 |
In this case we're going to assume that we have lost your preaching models on one task involving some of that for example..
|
|
|
|||
|
00:36 |
Nicholas Pratt adapt this large-scale picture Model A large amounts of data.
|
|
|
|||
|
00:44 |
I assume some fine-tuning process or some adaptation process.
|
|
|
|||
|
00:48 |
A purchase through this one, posted as, recently is how can you take a language language data and adopting to work on.
|
|
|
|||
|
00:59 |
Pasta volleyball vision and language.
|
|
|
|||
|
01:03 |
And once you get your ass to do prefix.
|
|
|
|||
|
01:05 |
Can we take an image put it through a visual and quarter and price a tattoo as a prefix to dislodge them with models before the text.
|
|
|
|||
|
01:14 |
This has been shown to work pretty well once you start fine-tuning this model with Justus Vision quarter and freezing the rest of the day.
|
|
|
|||
|
01:21 |
You can start getting to work on certain types of create tasks tasks that are.
|
|
|
|||
|
01:29 |
One shot outside knowledge be too late because these large language models cutting some of my knowledge from both of us and the text that I'm that comes out.
|
|
|
|||
|
01:44 |
This is known as a prefix tuning adapting use language models whichever you need to be trained by the prefix and using a small amount of parameters were keeping most of the language model parameters fix.
|
|
|
|||
|
01:57 |
No one is going to go deeper if you don't want to just look at the prefix and go deeper the language model approaches that working at a patient by representation.
|
|
|
|||
|
02:07 |
Are these approaches essentially look at the stuff attention layers into tomorrow and adapt them so that they don't just take any language that several times to each other but also audio.
|
|
|
|||
|
02:20 |
When I get here is a taken audio visual information.
|
|
|
|||
|
02:25 |
What is do an attention that allows you to dynamically wait what is the contribution of audio and vision on top of language.
|
|
|
|||
|
02:32 |
And use that to shift language which allows you to obtain a shifter language in for me a representation.
|
|
|
|||
|
02:38 |
What is condition on audio visual data that comes.
|
|
|
|||
|
02:42 |
So you can think about it as language by yourself without audio and visual at the center of your point and other shift in a positive direction.
|
|
|
|||
|
02:50 |
Positive audio visual input or negative directions - audio visual input.
|
|
|
|||
|
02:56 |
Set at a Shiva representation.
|
|
|
|||
|
03:01 |
Another way of doing this is to essentially just train one more tomorrow morning tomorrow at the same time.
|
|
|
|||
|
03:06 |
Edison popularize be Sunday because of feeding possibility of using Transformers as encoding many different types of input.
|
|
|
|||
|
03:14 |
All in the form of a sequence.
|
|
|
|||
|
03:17 |
The one recent method that came out of the aims to build one unified model with private or sharing and able to do multitasking transfer going across multimodal.
|
|
|
|||
|
03:29 |
The core idea here is a given a series of tests used to find over several modalities.
|
|
|
|||
|
03:35 |
French for the standardized input modalities into a common format format here's a fact that they are all sequences language can be seen as a sequence of a audio Statigram.
|
|
|
|||
|
03:50 |
The disclaimer is that it is not clear whether this the best way to do this they keep left onto this works but it is still unclear whether standardizing everything is a sequence might lose information about the structure.
|
|
|
|||
|
04:03 |
Are you going to the Assumption of sanitizing everything in sequence.
|
|
|
|||
|
04:07 |
The sequence in a plane without a few specific in Benny's that allow you to identify these sequences.
|
|
|
|||
|
04:13 |
I want everything is sanitized can we attempt to put the same here tomorrow.
|
|
|
|||
|
04:22 |
Is it going to love us to train at a specific pacifiers three for each specific task and trained it over tomorrow through Monday talk learning updating a parameter.
|
|
|
|||
|
04:32 |
Fight asked whatever those modalities are used.
|
|
|
|||
|
04:37 |
Teacher's assistant loves you to have the same model architecture.
|
|
|
|||
|
04:41 |
And also the same almost the same parameters except the modality.
|
|
|
|||
|
04:48 |
To do a sweet Emoji Moto task.
|
|
|
|||
|
04:51 |
At the same time.
|
|
|
|||
|
04:53 |
Another calling card work also found that this is possible with a again ft of these input modalities.
|
|
|
|||
|
05:01 |
A language does sequence of words refer to the extent of this for reinforcement learning tasks as well as reinforcement and actions in one single mode with the same architecture and same primary.
|
|
|
|||
|
05:16 |
The line to quite a few tasks at the same time.
|
|
|
|||
|
05:19 |
This allows you to transfer information to share information.
|
|
|
|||
|
05:22 |
Using one either one single Foundation model across multiple tasks.
|
|
|
|||
|
05:33 |
The second subtronics.
|
|
|
|||
|
05:37 |
Instead of using a Model A Foundation model train.
|
|
|
|||
|
05:44 |
We transfer information from a second way to Prairie without representation spaces in the middle of both..
|
|
|
|||
|
05:52 |
Derek diesel injector during training that provide you additional information helping you to learn this, representation space between a and b.
|
|
|
|||
|
06:02 |
And is not seeking to use during test.
|
|
|
|||
|
06:04 |
Best comedy wedding space.
|
|
|
|||
|
06:07 |
Can they be used for subsequent production talks involving.
|
|
|
|||
|
06:17 |
Look at two types of this.
|
|
|
|||
|
06:19 |
Different depending on where without a TV the secondary modalities.
|
|
|
|||
|
06:24 |
Are they going to do set the input level in which case we'll call it in Richmond.
|
|
|
|||
|
06:29 |
Fusion TV can also be introduced as a production level in which will call in Richmond by Translate.
|
|
|
|||
|
06:38 |
The first the first case of coloring a coloring by Fusion perhaps this will work was that an easy word of any spaces to help you perform visual complication with no resort.
|
|
|
|||
|
06:49 |
A core idea here is to learn a common embedding space you can send text.
|
|
|
|||
|
06:54 |
We're images and text lion the same attic.
|
|
|
|||
|
06:57 |
In this case we want to essential oil for horse representing the word abetting a horse.
|
|
|
|||
|
07:02 |
And you bought all the image of bedding for various horses that you seen your day..
|
|
|
|||
|
07:08 |
And likewise.
|
|
|
|||
|
07:09 |
The one in building 4..
|
|
|
|||
|
07:11 |
The closest Center of all the image of earrings of dogs.
|
|
|
|||
|
07:15 |
And likewise the embedding for Auto.
|
|
|
|||
|
07:18 |
Off the water better for auto alongside as across the center of all the Imaging beddings of cars he's sitting there.
|
|
|
|||
|
07:26 |
How was your day before I this is essentially coordinating representations to represent a function in cutting places.
|
|
|
|||
|
07:42 |
A really good thing about this is that one.
|
|
|
|||
|
07:45 |
You've trained anybody State and they say you have a new testing is from an unknown cause we know this is a cat but I see you've never seen any images of cats.
|
|
|
|||
|
07:53 |
Representation coordination properly send the image and bedding of this new image of a cat should be nearby the water bearing of cat.
|
|
|
|||
|
08:02 |
Have you know by the structure would have been things that cats and dogs going to be here by they're going to share similar features are both animals their four legs.
|
|
|
|||
|
08:11 |
And then this will allow you two to fuchsia classification office image of wedding for looking up as nearest wedding petting cat.
|
|
|
|||
|
08:19 |
She's an example of amusing Fusion to learn and put a post images and text on uses and push me to Fusion.
|
|
|
|||
|
08:27 |
And a longing to achieve cold on invite allow you to use word of any spaces to help you do visual costume.
|
|
|
|||
|
08:35 |
Can I get a quad chairs that only images are going to be used to test time once you to destruction is word of any space doing.
|
|
|
|||
|
08:44 |
Example is simply to learn a joint model so there ain't raining I'm going to use mall tomorrow they do chemo tomorrow you said learn a common model to make some predictions.
|
|
|
|||
|
09:00 |
Testing simply replace whatever second second and just keep language.
|
|
|
|||
|
09:11 |
How does it go anywhere kcsm sister coloring right it has a supervisor affect the function functions by coordination but still.
|
|
|
|||
|
09:22 |
Can I use additional video and audio data during training and did not require during testing for the video provides a representation space so your model does better at level at class fine language.
|
|
|
|||
|
09:38 |
Let me find that on a unit only using text a test time.
|
|
|
|||
|
09:42 |
With more tomorrow training improves upon language only train.
|
|
|
|||
|
09:46 |
So there's some enrichment going on by introducing video and audio and trading Auto is not used.
|
|
|
|||
|
09:55 |
These are examples of Coronavirus.
|
|
|
|||
|
09:58 |
Those are some examples of cold Auntie Vie translation.
|
|
|
|||
|
10:01 |
So this case this example is that of using language.
|
|
|
|||
|
10:05 |
Learning a representation space that is then able to reconstruct a visual modality as a for predict.
|
|
|
|||
|
10:13 |
This joint representation can they be used for predicting Legos and of course instead of using language and vision as input.
|
|
|
|||
|
10:22 |
Literature a presentation representation by translating from language to the representation space to my visual.
|
|
|
|||
|
10:31 |
The information from this came from the fact that you could try to do machine translation and translation from English to French.
|
|
|
|||
|
10:45 |
Are there any testing.
|
|
|
|||
|
10:47 |
So you cross motor transmission to a training for the best thing to do is take a space and then do a 2:04 produced.
|
|
|
|||
|
11:02 |
The one small problem here is that how can I should ensure that both modalities are being used.
|
|
|
|||
|
11:07 |
English translation for trying to protect both sentiment and division modality easy just ignore any prediction of the visual modality in Jasper Texas.
|
|
|
|||
|
11:17 |
And that's when he's a bimodal translation coming to Clay in the oven for translation of languages visual visual language that make sure you're trying to put a maximal information about visual as possible so you can also be constructed language backwards.
|
|
|
|||
|
11:38 |
So this again only requires across Moto translation to in training.
|
|
|
|||
|
11:42 |
The only language required during testing can be seen as an example of color.
|
|
|
|||
|
11:48 |
With the second dream about it that you hope to enrich the representation space with the alpha level.
|
|
|
|||
|
11:57 |
Topeka need people to try to scale this up one good example of that is actually using.
|
|
|
|||
|
12:03 |
Images as a prediction object is to further improve the language models.
|
|
|
|||
|
12:14 |
So these folks provided is vulcanization idea which is two in addition of predicting amassed tokens also try to put in the images at the same time.
|
|
|
|||
|
12:25 |
Stuff so when I'm always looking to work contextualize with other words can I put the image correspond to humans when it's looking at the word us seeking contextualize with other words and it was approved.
|
|
|
|||
|
12:39 |
And again these images are used during training as a production object if they're not used.
|
|
|
|||
|
12:45 |
Funny text language is a test.
|
|
|
|||
|
12:48 |
What is steel find a training by predicting images improves upon training only language.
|
|
|
|||
|
12:55 |
Set an example of again call learning fire truck station.
|
|
|
|||
|
13:01 |
Is many many more. Mention to transfer we've only scratched the surface that's been attempting to build a single model that allows you to predict many modalities of the same time.
|
|
|
|||
|
13:11 |
Is it a lot of work at multimodal multitask.
|
|
|
|||
|
13:15 |
The same time as many open challenges.
|
|
|
|||
|
13:18 |
One big open challenge is that of law resource.
|
|
|
|||
|
13:22 |
Many of these models that work really well require large amounts of data especially large must a data how can you get it to work for little Downstream data with little.
|
|
|
|||
|
13:32 |
Another big challenges that are going to be on language and vision we have fun and data in the language how do we go to Yonder how do we get transfer to work to Lori Faust modalities where transfer is actually.
|
|
|
|||
|
13:46 |
They're also setting spray at least more in quarters and sweetest Casey might be much harder to transfer representations from one.
|
|
|
|||
|
13:56 |
Tina's place on deep learning with another modality which is not based on TV.
|
|
|
|||
|
14:01 |
If I named most of the work in transfer at in case I'm very large-scale to train models which creates complexity issues and also interpretability issues.
|
|
|
|||
|
14:15 |
Thank you.
|
|
|
|||
|
14:19 |
(End of video)
|
|
|
0:00 |
|
|
|
0:05 | ||
|
0:10 |
|
|
|
0:15 | ||
|
0:20 | ||
|
0:25 |
|
|
|
0:30 |