Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

About music generation with perceiver-ar model #3

Open
feizc opened this issue Jun 29, 2022 · 6 comments
Open

About music generation with perceiver-ar model #3

feizc opened this issue Jun 29, 2022 · 6 comments

Comments

@feizc
Copy link

feizc commented Jun 29, 2022

Hi, @lucidrains

Thanks for the implementation of Perceiver-AR model.
We conduct the experiments on pop music generation at: https://github.com/feizc/Perceiver-Music-Generation.
The results are encouraging, be grateful to you : )

@lucidrains
Copy link
Owner

🎶🤖😄

@lucidrains
Copy link
Owner

@feizc how are you approaching the problem of generating starting from a length that is less than the prefix?

@feizc
Copy link
Author

feizc commented Jun 30, 2022

@feizc how are you approaching the problem of generating starting from a length that is less than the prefix?

Actually, I use a fixed length of conditional context, i.e., prefix length of prior music, to continue writing the next melody.

In my opinion, to start from zero, we can use special token like [pad] to supplement the prefix length, or only use decoder to generate an initial sentence then generate conditioned on latents.

I read the source code and find the author begin with zero :)


def gen_initial_events(): 

> events = np.zeros([device_count, batch_size, max_events_length], np.int32)

> events[:, :, 0] = dataset.SOS_ID 

> return events

@usryokousha
Copy link

After reviewing the current implementation (autoregressive_wrapper) it seems you generate each subsequent token one at a time as would be the case in most architectures. The authors of the perceiver-ar paper outlined a strided approach (typically the size of the self-attention sequence length) where the sampled tokens would be cached up to a certain size and then the buffer would be freed. Have you considered implementing this? The actual released implementation perceiver-ar is relatively easy to follow.

@lucidrains
Copy link
Owner

After reviewing the current implementation (autoregressive_wrapper) it seems you generate each subsequent token one at a time as would be the case in most architectures. The authors of the perceiver-ar paper outlined a strided approach (typically the size of the self-attention sequence length) where the sampled tokens would be cached up to a certain size and then the buffer would be freed. Have you considered implementing this? The actual released implementation perceiver-ar is relatively easy to follow.

noo not yet, i haven't implemented their special caching strategy at inference

but if i keep hearing more positive results, i may implement it! have to admit i was doubtful about the architecture initially

@usryokousha
Copy link

I’m curious to see how well this would work at inference, particularly when using a vqvae / vqgan to encode images. If you could decode in only several steps that would really speed up generation. I suspect quality would suffer, but the paper’s results seem promising w.r.t. to the ImageNet results.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants