title | paper | code | dataset | keywords | |
---|---|---|---|---|---|
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior | CVPR(23) | paper | code | BIWI, VOCA | |
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation | CVPR(23) | paper | HDTF | Diffusion | |
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction | CVPR(23) | paper | Multiface | 3D | |
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert | CVPR(23) | paper | code | LRS2 | |
LipFormer: High-fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook | CVPR(23) | paper | LRS2, FFHQ | ||
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment | CVPR(23) | paper | HDTF | ||
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors | CVPR(23) | paper | code | LRS2, LRS3 | |
High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning | CVPR(23) | paper | MEAD | emotion | |
Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks | InterSpeech(23) | paper | MEAD | emotion | |
EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation | ICCV(23) | paper | code(not yet) | emotion | |
Emotionally Enhanced Talking Face Generation | paper | code | CREMA-D | emotion | |
DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video | AAAI(23) | paper | code | ||
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior | paper | code | 3D | ||
GENEFACE: GENERALIZED AND HIGH-FIDELITY AUDIO-DRIVEN 3D TALKING FACE SYNTHESIS | ICLR (23) | paper | code | NeRF | |
OPT: ONE-SHOT POSE-CONTROLLABLE TALKING HEAD GENERATION | paper | ||||
LipNeRF: What is the right feature space to lip-sync a NeRF? | paper | NeRF | |||
Audio-Visual Face Reenactment | WACV (23) | paper | code | ||
Towards Generating Ultra-High Resolution Talking-Face Videos With Lip Synchronization | WACV (23) | paper | |||
StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles | AAAI(23) | paper | code | ||
DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis | paper | proj | Diffusion | ||
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation | paper | proj | Diffusion | ||
Speech Driven Video Editing via an Audio-Conditioned Diffusion Model | paper | code | Diffusion | ||
TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles | paper | Text-Annotated MEAD | Text |
title | paper | code | dataset | keywords | |
---|---|---|---|---|---|
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors | paper | proj | |||
SPACE: Speech-driven Portrait Animation with Controllable Expression | ICCV(23) | paper | Pose, Emotion | ||
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation | CVPR(23) | paper | code | ||
Compressing Video Calls using Synthetic Talking Heads | BMVC (22) | paper | application | ||
EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model | SIGGRAPH (22) | paper | emotion | ||
Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis | ECCV(22) | paper | code | ||
Expressive Talking Head Generation with Granular Audio-Visual Control | CVPR(22) | paper | |||
Talking Face Generation With Multilingual TTS | CVPR(22) | paper | code | - | |
Deep Learning for Visual Speech Analysis: A Survey | paper | survey | |||
StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN | paper | code | stylegan | ||
Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation | ECCV(22) | paper | code(coming soon) | NeRF | |
Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation | paper | ||||
SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory | AAAI(22) | paper(temp) | LRW, LRS2, BBC News | ||
DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering | paper | NeRF | |||
Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos | paper | ||||
Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions | paper | ||||
DialogueNeRF: Towards Realistic Avatar Face-to-face Conversation Video Generation | paper | ||||
Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion | paper | ||||
StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation | paper | - | |||
AUTOLV: AUTOMATIC LECTURE VIDEO GENERATOR | paper | ||||
Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement | paper |
title | paper | code | dataset | |
---|---|---|---|---|
Depth-Aware Generative Adversarial Network for Talking Head Video Generation | paper | code | ||
paper | code | |||
Parallel and High-Fidelity Text-to-Lip Generation | paper | |||
[Survey]Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis | - | paper | ||
FaceFormer: Speech-Driven 3D Facial Animation with Transformers | CVPR(22) | paper | code | |
Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices | paper | code | ||
FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning | ICCV | paper | code | |
Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis | paper | code | ||
Audio-Driven Emotional Video Portraits | CVPR | paper | code | MEAD, LRW |
LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization | CVPR | paper | ||
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation | CVPR | paper | code | VoxCeleb2, LRW |
Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset | CVPR | paper | code | HDTF |
MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement | ICCV | paper | code(coming soon) | |
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | ICCV | paper | code | |
Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation | AAAI | paper | code(coming soon) | Mocap dataset |
Visual Speech Enhancement Without A Real Visual Stream | paper | |||
Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary | paper | code | ||
Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion | IJCAI | paper | code | VoxCeleb, GRID, LRW |
3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head | paper | |||
AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person | paper | VoxCeleb2, Obama |
title | paper | code | dataset | |
---|---|---|---|---|
[Survey]What comprises a good talking-head video generation?: A survey and benchmark | paper | code | ||
One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing | CVPR(21) | paper | code | |
Speech Driven Talking Face Generation from a Single Image and an Emotion Condition | paper | code | CREMA-D | |
A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild | ACMMM | paper | code | LRS2 |
Talking-head Generation with Rhythmic Head Motion | ECCV | paper | code | Crema, Grid, Voxceleb, Lrs3 |
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation | ECCV | paper | code | VoxCeleb2, AffectNet |
Neural voice puppetry:Audio-driven facial reenactment | ECCV | paper | ||
Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars | ECCV | paper | code | |
HeadGAN:Video-and-Audio-Driven Talking Head Synthesis | paper | VoxCeleb2 | ||
MakeItTalk: Speaker-Aware Talking Head Animation | paper | code, code | VoxCeleb2, VCTK | |
Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose | - | paper | code | ImageNet, FaceWarehouse, LRW |
Photorealistic Lip Sync with Adversarial Temporal Convolutional Networks | paper | |||
SPEECH-DRIVEN FACIAL ANIMATION USING POLYNOMIAL FUSION OF FEATURES | paper | LRW | ||
Animating Face using Disentangled Audio Representations | WACV | paper | ||
Everybody’s Talkin’: Let Me Talk as You Want | paper | |||
Multimodal Inputs Driven Talking Face Generation With Spatial-Temporal Dependency | paper | |||
Speech Driven Talking Face Generation from a Single Image and an Emotion Condition | paper |
title | paper | code | dataset | |
---|---|---|---|---|
Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss | CVPR | paper | code | VGG Face, LRW |
- MEAD link
- HDTF link
- CREMA-D link
- VoxCeleb link
- LRS2 link
- LRW link
- GRID link
- SAVEE link
- BIWI(3D) link
- VOCA link
- Multiface(3D) link
- PSNR (peak signal-to-noise ratio)
- SSIM (structural similarity index measure)
- LMD (landmark distance error)
- LRA (lip-reading accuracy) -
- FID (Fréchet inception distance)
- LSE-D (Lip Sync Error - Distance)
- LSE-C (Lip Sync Error - Confidence)
- LPIPS (Learned Perceptual Image Patch Similarity) -
- NIQE (Natural Image Quality Evaluator) -