Summary:
- I used 818 short videos for training. One or more 150-frame sequences were taken from each video at frame rate 30 and resolution 256 (width) x 160 (height), resulting in 7218 sequences in total.
- The contents of these videos are limited to plants moving in the wind. I collected them from YouTube by searching plant names, places ("mountain", "meadow", " arboretum"...), seasons, "wind", "tripod" (to avoid camera motion), "no looping" (to avoid duplicates), etc.
- Optical flow were estimated using pyflow with these parameters. For each sequence, the optical flow from the first frame to the other frames (rather than between consecutive frames) are needed for training the U-Net.
- FFT were normalized for the diffusion model by scaling the coefficients with frequency-specific factors, which are based on training data statistics data/labels/fft_std.npy.
The collected videos were cut into short videos that are single camera shots. These short videos were filtered to remove duplicates and camera motion.
The training code assumes that the estimations are stored as NPY files. I stored them in half-precision to save space.
Sequences with median optical flow magnitude larger than 0.5 were removed (excluded from data/labels/motion_synthesis_train_set.csv), to filter out incorrect optical flow estimations under this setting.
The paper applied "frequency adaptive normalization" to the FFT of optical flow, which is to first divide the coefficients at each frequency by the 95th percentile of the magnitudes and then take square-root while keeping the sign.
I used scaled standard deviations instead of percentiles for scaling and omitted the square-root transformation. At each frequency, FFT coefficients were divided by the standard deviation from zero calculated on the training set and by a scale
parameter. The standard deviations were saved in data/labels/fft_std.npy and scale
was set to 2.82, so that around 3% of the coefficients were out of the range [-1, 1] after normalization.
FrameSpectrumDataset
in utils/dataset.py with methods get_std_from_zero
and test_scales
can be used if you wish to recalculate the standard deviations or adjust the scale
parameter.
I had 818 training videos with at least one sequence that has valid optical flow estimation. They are videos of plants gently moving in the wind. Below are the first frames of 64 videos randomly sampled from the set.