Backend implementing a multimodal AI model in PyTorch and serving it using FastAPI and Docker to automatically identify and label dance moves in videos to create an interactive learning platform. Features: video preprocessing, pose estimation, audio processing, and multimodal segmentation model.
- Advanced pose estimation and motion feature extraction using MediaPipe
- Audio feature extraction for enhanced move detection
- Real-time multimodal dance move segmentation model
- User-friendly learning interface with customizable speeds and segment sizes
- Side-by-side webcam/video option with recording functionality
- Similarity score calculation for comparing dance performances
- Python 3.11 or higher
- FFmpeg
- CUDA-compatible GPU (optional, for faster inference)
- Docker (optional, for containerized deployment)
-
Clone this repository:
git clone https://github.com/your-username/dance-bits-api.git cd dancebits
-
Create a Conda environment and activate it:
conda create --name dance-bits-api python conda activate dance-bits-api
-
Install the required packages:
pip install -r requirements.txt
Note: You can also install via Conda, but some packages may not be available:
conda install --file requirements.txt
-
Set up environment variables in a
.env
file:WANDB_API_KEY=your_key WANDB_ORG=your_org WANDB_PROJECT=your_project WANDB_MODEL_NAME=your_model WANDB_MODEL_VERSION=your_version
-
Install FFmpeg (required for video processing):
- On Ubuntu/Debian:
sudo apt-get update sudo apt-get install ffmpeg libsm6 libxext6
- On macOS:
brew install ffmpeg
- On Windows: Download from FFmpeg website and add to PATH
- On Ubuntu/Debian:
-
Start the FastAPI server:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8080
-
Access the API:
- API documentation: http://localhost:8080/docs
- Alternative API docs: http://localhost:8080/redoc
-
Test video segmentation:
curl -X POST "http://localhost:8080/predict/" \ -H "accept: application/json" \ -H "Content-Type: multipart/form-data" \ -F "video=@path/to/your/dance_video.mp4" \ -F "min_segmentation_prob=0.5"
-
Test video comparison:
curl -X POST "http://localhost:8080/compare/" \ -H "accept: application/json" \ -H "Content-Type: multipart/form-data" \ -F "user_video=@path/to/user_video.mp4" \ -F "teacher_video=@path/to/teacher_video.mp4"
-
Model Loading Issues:
- Ensure all environment variables are set correctly
- Check if the model weights are downloaded properly
- Verify CUDA availability if using GPU
-
Video Processing Issues:
- Verify FFmpeg installation:
ffmpeg -version
- Check video format compatibility (MP4, AVI, MOV supported)
- Ensure sufficient disk space for temporary files
- Verify FFmpeg installation:
-
Memory Issues:
- Reduce video resolution if experiencing OOM errors
- Consider using CPU inference if GPU memory is limited
- Monitor system resources during processing
-
Build the Docker image:
docker build -t dancebits-api .
-
Run the container:
docker run -d --name dancebits-api \ -p 8080:8080 \ -e WANDB_API_KEY=your_key \ -e WANDB_ORG=your_org \ -e WANDB_PROJECT=your_project \ -e WANDB_MODEL_NAME=your_model \ -e WANDB_MODEL_VERSION=your_version \ dancebits-api
The following environment variables are required for the application:
WANDB_API_KEY
: Weights & Biases API keyWANDB_ORG
: Weights & Biases organization nameWANDB_PROJECT
: Weights & Biases project nameWANDB_MODEL_NAME
: Name of the model to useWANDB_MODEL_VERSION
: Version of the model to use
POST /predict/
Segments a dance video into individual moves.
Parameters:
video
: Video file (MP4, AVI, or MOV)min_segmentation_prob
: Minimum probability threshold for segmentation (default: 0.5)
Response:
{
"segmented_probs": [...],
"segmented_percentages": [...]
}
POST /compare/
Calculates similarity score between two dance videos.
Parameters:
user_video
: User's dance video fileteacher_video
: Teacher's reference video file
Response:
{
"similarity_score": float
}
- Frame Extraction: Videos are processed frame by frame using OpenCV
- Pose Estimation: MediaPipe Pose is used to extract 35 bone vectors per frame
- Audio Processing:
- Audio is extracted from video using MoviePy
- Mel spectrogram is generated using Librosa
- Tempo analysis for beat detection
- Model Inference:
- Processes both visual (pose) and audio features
- Returns frame-by-frame segmentation probabilities
- Post-processing:
- Smoothing of segmentation probabilities
- Dynamic adjustment based on beat detection
- Segment identification based on probability thresholds
- The API supports both CPU and GPU inference
- Video processing is optimized for real-time performance
- Temporary files are automatically cleaned up after processing
- CORS is enabled for all origins by default
We welcome contributions! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.