Skip to content
Alexandru Dinu edited this page Jan 3, 2019 · 27 revisions

Welcome to the cae wiki!

These models are inspired from [1, 2].

The input consists of 720p images from the YouTube-8M dataset (credit goes to gsssrao for the downloader and frame generator scripts). The dataset consists of 121,827 frames.

Training

The images are padded to 1280x768 (i.e. 24,24 height pad), so that they can be split into 60 128x128 patches. The model only gets to see a singular patch per forward pass (i.e. there are 60 forward passes and optimization steps for an image) The loss is computed (per patch) as MSELoss(orig_patch_ij, out_patch_ij), and we have an average loss per image.

[1] https://arxiv.org/abs/1703.00395 [2] http://arxiv.org/abs/1511.06085

Clone this wiki locally