Microsoft’s AI Turns Mona Lisa Into Rapping Sensation, Video Goes Viral | WATCH
This AI can take still photos of people's faces and turn them into animated characters that move and talk just like real humans.
In a stunning display of technological innovation, a video featuring the Mona Lisa rapping has taken the internet by storm. The viral clip, created using Microsoft's AI technology known as VASA-1, showcases the iconic painting singing along to a rap performed by actor Anne Hathaway.
This AI can take still photos of people's faces and turn them into animated characters that move and talk just like real humans. In this case, it transformed the famous Mona Lisa into a lively rapper, complete with synchronised lip movements and expressive facial features.
Microsoft just dropped VASA-1.
— Min Choi (@minchoi) April 18, 2024
This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba
10 wild examples:
1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD
The video quickly went viral after being shared on social media platforms. One post featuring the singing Mona Lisa clip has already racked up over seven million views and counting.
Users' comments on the viral video went from amusement to confusion. One user said, "This is crazy, weird, and spooky all together🤯", while another wondered, "I'm curious about what this technology will be like in a year and a half 🤯". Another user commented, "I think this rapping Mona Lisa has really messed with my brain."
The first one is unbelievable haha
— Shushant Lakhyani (@shushant_l) April 18, 2024
Oh man. If only Da Vinci could witness this ð
— DreamBits (@DreamBits_) April 18, 2024
This is wild, freaky and creepy all at once ð¤¯
— Nikita Lebedev (@artofnikita) April 18, 2024
What is Microsoft’s VASA?
According to Microsoft’s official website, VASA stands for Virtual Avatar Speech Animation, a revolutionary framework designed to generate lifelike talking faces from single static images and audio clips.
VASA-1, its flagship model, boasts the capability to synchronise lip movements with audio seamlessly while capturing a spectrum of facial nuances and natural head motions, lending authenticity and liveliness to virtual characters.
The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos.
"Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively," Microsoft said.
"Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors," it added.