Multimodality powered by Vector Embeddings

A quick visual primer on how a basic data structure like vectors enables multimodality in machine learning

artificial intelligence
machine learning
deep learning
Author

Pranav Kural

Published

May 29, 2023

Keywords

machine learning, artificial intelligence, multimodal, multimodality

What is Multimodality?

Modality: mode + ability - the ability to represent data in different modes (i.e., types like text, image, audio etc.)

Unimodal Machine Learning: ML models and frameworks that deal with a single modality. Example: The revolutionary ChatGPT (based on GPT-3.5), which works with textual data only.

Multimodal Machine Learning: ML models and frameworks that work with multiple modalities or mediums of data representation like text, image, audio, video, etc.

Such models are capable of connecting the dots between objects of different kinds, or even generate contextually related content of different forms given an object of a certain supported kind. For example, by using a text describing the launch of a rocket into the space as an input to Meta AI’s ImageBind model, we can generate content in five different modalities, i.e., the model will automatically generate an image demonstrating the launch of rocket into space, an audio in the same context, even possibly a video displaying the motion, and more! 🤯

multiple mediums of data respresentation

multiple mediums of data respresentation

How we create multimodal ML models?

There are multiple approaches that unable the creation of efficient multimodal machine learning models and frameworks. Let’s start from the basics so we don’t leave anyone out.

One True Modality

To perform comparison and to establish some sort of similarity between data of different types, we need a kind of structure to which all these data types can be converted to, so we have a way of deducing similarity or dissimilarity.

Simple example: Below image shows textual and image data being converted to a bunch of numbers.

conversion from different mediums of data representation to one single medium of data representation

conversion from different mediums of data representation to one single medium of data representation

There are several ways of representing these bunch of numbers, and one of the most common ways is representing them as vectors.

Vectors

vectors

Vector Embedding

The conversion of data from one medium to a vector representation is known as generation of vector embeddings

vector embeddings generation

Making the connection

We can use this approach to create models which can generate embeddings for multiple types of data and then use a singular model (like a multimodal encoder shown below) to establish a relationship between these embeddings. Distance between two vectors in a vector space help us establish a magnitude of similarity or contrast between two objects. The closer the vector representations of two objects are the more contextually related they are likely to be.

multimodal encoding

Note that the above image is just for demonstration. The underlying process for achieving this could be much more complex. To see a more real world example, you can check this research paper on Meta AI’s FLAVA model: FLAVA: A Foundational Language And Vision Alignment Model

Contrastive Learning

Contrastive Learning is a common approach used when training models to work with multiple modalities at once.

Simply put: the contrastive implies contrast, i.e., we train model with both positive and negative examples (generally, more negative examples than positive).

The main idea is that given a vector space and vector embeddings generated by the model, vector embeddings for those data which are closely related (example: image of an elephant and a text containing the word ‘elephant’) should be placed closely, and vector embeddings for those data which are NOT closely related should be far away from each other.

Below image gives an idea of how such a 3-D vector space might look:

multimodal encoding

Taking example of a ML model which works with two modalities such as text and image, to enable such contrastive learning, we require large amounts of data in form of text and image pairs, which could be used to train the model.

multimodal encoding

For example, let’s say \(I_1\) is a image containg a car, and \(T_1\) is a text that mentions a car, and \(T_2\) is another text that doesn’t mention car anywhere, ideally, our model should learn over time to create vectors where the distance between vectors generated for \(I_1\) and \(T_1\) much lower, than the distance between vectors \(I_1\) and \(T_2\), therefore, letting us deduce (later on) which data objects maybe more closely related.

CLIP

The above mentioned approach and image was inspired from Open AI’s CLIP model. To learn more about it, you can check this wonderful article on Pinecone: Multi-modal ML with OpenAI’s CLIP

Conclusion

There is a lot you can do with simple things than what we can imagine of. A very basic data structure like a vector could be used to create such mammoth and robust multimodal machine learning models, empowering applications in domains unexplored till now.

I hope through this quick visual guide you were able to gain some basic familiarity with how things are working under the hood when using a multimodal machine learning tool. Happy exploring! 🫶🏼

Suggested further readings:

Support my work