2D to 3D Human Pose Estimation

Motivation & Implementation Goals

Understanding Transformers in Human Pose Analysis

Implemented this project to gain hands-on understanding of how transformers process human motion data
Focused particularly on how transformer architectures can separately handle:
- Spatial relationships: Understanding joint correlations within each frame
- Temporal patterns: Learning motion dynamics across frame sequences
Used MotionBERT as inspiration while building a foundation for understanding pose transformers

Learning Objectives

Deep dive into transformer's capability to capture human pose dynamics:
- How self-attention mechanisms can model joint interdependencies
- How temporal attention layers process motion sequences
- The effectiveness of masked training in learning motion patterns
Practical implementation of pose-specific model components like:
- MPJPE loss with Procrustes alignment for pose evaluation
- Custom data processing for skeletal joint sequences
- Masked training strategy for motion understanding

Technical Implementation Insights

Built a flexible training pipeline supporting both:
- Pre-training with masked joint prediction
- Fine-tuning for specific pose estimation tasks
Implemented efficient data handling for motion sequences:
- Custom Dataset class handling variable-length sequences
- Frame thresholding and sequence splitting
- Batch collation with optional masking
Structured codebase for experimental iterations:
- Modular architecture design
- Configurable model parameters
- Comprehensive checkpointing system

Foundation for Future Research

This implementation serves as a learning platform for:
- Experimenting with different attention mechanisms for pose data
- Understanding trade-offs in temporal vs spatial feature processing
- Testing various architectural modifications for pose transformers
Code structure allows easy adaptation for:
- Different pose estimation tasks
- Various data representations
- New model architectures building on transformer fundamentals

Overview

DSTFormer, inspired by the MotionBERT paper (Zhu et al., 2022), is a transformer-based architecture for human pose estimation that leverages dual-stream attention mechanisms to capture both spatial and temporal dependencies in human motion sequences. The model implements a novel fusion approach between spatial-temporal (ST) and temporal-spatial (TS) attention streams.

Architecture

Core Components

Dual Stream Processing: Parallel processing of ST and TS attention streams
Attention Mechanism: Multi-head self-attention with separate spatial and temporal attention computations
Fusion Module: Learnable fusion mechanism between dual streams
Position Encoding: Joint-wise positional embeddings and temporal embeddings

Model Specifications

DSTFormer(
    dim_in=2,              # Input dimension per joint
    dim_out=2,             # Output dimension per joint
    embed_size=64,         # Embedding dimension
    heads=8,               # Number of attention heads
    max_len=5,             # Maximum sequence length
    num_joints=17,         # Number of joints (H36M format)
    fusion_depth=2,        # Depth of fusion layers
    attn_depth=2,          # Depth of attention layers
    fusion=True           # Enable fusion mechanism
)

Data Pipeline

Dataset Format

Input data: AMASS dataset converted to H36M format
Joint representation: 17 joints in 3D space
Sequence length: Variable (filtered based on threshold)

Data Processing

Frame filtering with customizable threshold
Sequence splitting into fixed-length windows
Optional masking mechanism (15% probability) for training
Batch collation with support for masked and unmasked data
Refer this data: AMASS
Unofficial pre-processed dataset: Huggingface

Training

Pre-training

Architecture Configuration:
- Embedding dimension: 64 with 8 attention heads
- Dual stream processing with 2-layer fusion depth
- Attention depth: 2 layers per stream
- Full parameter training: ~1.2M parameters
Training Protocol:
- Batch size: 32 with gradient accumulation every 4 steps
- Epochs: 201 with early stopping patience of 20
- Optimizer: AdamW (lr=1e-3, β1=0.9, β2=0.999, ε=1e-8)
- Weight decay: 1e-4 with gradient clipping at 1.0
- Loss: MPJPE
Data Processing:
- Masked sequence modeling (15% frame masking)
- Gaussian noise injection (μ=0, σ=1) for masked frames
- Dynamic sequence splitting with max_frames=10
- Mixed precision training (FP16/FP32)

Masking Strategy

The masking mechanism (15% probability) serves dual purposes in the spatiotemporal learning:

Temporal Attention

Masked frames force the model to:
- Learn continuous motion patterns by reconstructing missing frames
- Build connections between distant frames through attention weights
- Understand motion context from surrounding unmasked frames

Spatial Attention

Joint masking helps the model:
- Learn relationships between connected joints in the skeleton
- Reconstruct anatomically valid poses using visible joints
- Maintain pose consistency through joint-to-joint attention

The dual stream design (ST→TS and TS→ST) processes these masked inputs differently, allowing the model to learn both pose structure and motion dynamics simultaneously. This creates a strong foundation for transfer learning to downstream tasks.

Fusion Layer Design

The fusion layer in DSTFormer combines the outputs from both spatial-temporal (ST) and temporal-spatial (TS) streams using a learned weighting mechanism:

Architecture

Input: Two feature streams (ST and TS paths) of shape [B, J, C] each
Concatenation: Features concatenated along channel dimension to [B, J, 2C]
Learnable weights: Linear projection to 2D weights via fusion_model
Softmax normalization: Ensures weights sum to 1
Weighted combination: α1ST + α2TS where α1 + α2 = 1

Fine-tuning Transfer Learning

Model Adaptation:
- Feature extraction: Frozen backbone (~90% parameters)
- Trainable parameters: ~8K (head only)
- Head architecture redesign:
```
nn.Sequential(
    nn.Linear(embed_size, 128),
    nn.ReLU(),
    nn.Linear(128, 1)
)
```
Training Configuration:
- Batch size: 128 with automatic batch size scaling
- Epochs: 50 with validation-based early stopping
- Optimizer: AdamW with parameter-specific learning rates
  - Head layers: lr=1e-3
  - Layernorm: lr=5e-4
- Gradient accumulation steps: 2
- Loss: MPJPE with focal regularization (γ=2.0)
Performance Monitoring:
- Validation metrics: MPJPE, PCK@150mm
- Checkpoint management: Top-3 models preserved
- Memory-efficient gradient checkpointing
- Automatic mixed precision for inference

Training Features

Mixed precision training with CUDA support
AdamW optimizer
MPJPE (Mean Per Joint Position Error) loss function
Automatic checkpoint saving every 5 epochs

Model Performance

Parameters: Configurable based on embedding size and attention heads
Memory footprint: Varies with batch size and sequence length
Training time: Dependent on hardware configuration

Directory Structure

source/
├── DataLoaders.py     # Data loading and processing
├── DSTFormer.py       # Main model architecture
├── pre_train.py       # Pre-training script
├── train.py           # Fine-tuning script
└── loss_function.py   # Loss functions

Usage

Data Preparation

dataset = Dataset( data_path='path/to/data.pkl', frame_threshold=1000, max_frames=5, if_train=True )

Model Training

Initialize model

model = DSTFormer( dim_in=2, dim_out=2, embed_size=64, heads=8, max_len=5, num_joints=17 )

Technical Notes

Model uses custom attention mechanisms for both spatial and temporal dimensions
Implements skip connections and layer normalization
Supports both training from scratch and transfer learning
Uses gradient scaling for mixed precision training
Memory efficient implementation with batch processing

Model Performance Notes

Hardware Limitations & Performance

Training was performed on limited GPU resources
Used smaller batch sizes (32) and reduced embedding dimensions (64) compared to paper
Training time: ~10 hours for 150 epochs on single GPU (NVIDIA L40S 48GB)
Performance impacted by computational constraints

Implementation Comparison with Original Paper

Implemented Features

Dual-stream architecture with spatial and temporal attention
Fusion mechanism between streams using learnable weights
Position and temporal embeddings
Multi-head attention for both spatial and temporal dimensions
Skip connections and layer normalization
Transfer learning capability

Differences from Paper

Reduced model size (embedding dim 64 vs 256 in paper)
Fewer attention heads (8 vs 16)
Shorter sequence length (5-10 frames vs 81)
Simplified MLP structure
Focus only on pose uplifting (2D to 3D) vs full motion prediction
No curriculum learning strategy
No data augmentation techniques
Skipped motion discriminator component
No velocity prediction branch

Limitations

Performance gap due to hardware constraints
Limited sequence modeling capability from shorter sequences
Reduced model capacity from smaller architecture
No motion smoothness enforcement without discriminator
Training stability issues with small batch sizes

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
screenshot		screenshot
source		source
.DS_Store		.DS_Store
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2D to 3D Human Pose Estimation

Table of Contents

Motivation & Implementation Goals

Understanding Transformers in Human Pose Analysis

Learning Objectives

Technical Implementation Insights

Foundation for Future Research

Overview

Architecture

Core Components

Model Specifications

Data Pipeline

Dataset Format

Data Processing

Training

Pre-training

Masking Strategy

Fusion Layer Design

Fine-tuning Transfer Learning

Training Features

Model Performance

Directory Structure

Usage

Data Preparation

Model Training

Initialize model

Technical Notes

Model Performance Notes

Hardware Limitations & Performance

Implementation Comparison with Original Paper

Implemented Features

Differences from Paper

Limitations

About

Languages

Arshad221b/2d_to_3d_human_pose_uplift

Folders and files

Latest commit

History

Repository files navigation

2D to 3D Human Pose Estimation

Table of Contents

Motivation & Implementation Goals

Understanding Transformers in Human Pose Analysis

Learning Objectives

Technical Implementation Insights

Foundation for Future Research

Overview

Architecture

Core Components

Model Specifications

Data Pipeline

Dataset Format

Data Processing

Training

Pre-training

Masking Strategy

Fusion Layer Design

Fine-tuning Transfer Learning

Training Features

Model Performance

Directory Structure

Usage

Data Preparation

Model Training

Initialize model

Technical Notes

Model Performance Notes

Hardware Limitations & Performance

Implementation Comparison with Original Paper

Implemented Features

Differences from Paper

Limitations

About

Topics

Resources

Stars

Watchers

Forks

Languages