Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[dataset] add shuffle at shards tar/raw file level #2424

Merged
merged 3 commits into from
Mar 20, 2024

Conversation

kakashidan
Copy link
Contributor

No description provided.

@xingchensong xingchensong requested a review from Mddct March 19, 2024 13:12
@Mddct
Copy link
Collaborator

Mddct commented Mar 19, 2024

raw 和 shard的source dataset需要加个shuffle的参数,原来是不shuffle的,要不然ut 过不了

@kakashidan
Copy link
Contributor Author

raw 和 shard的source dataset需要加个shuffle的参数,原来是不shuffle的,要不然ut 过不了

OK.

@kakashidan
Copy link
Contributor Author

raw 和 shard的source dataset需要加个shuffle的参数,原来是不shuffle的,要不然ut 过不了

增加了两个参数,list_shuffle控制tar or raw list level shuffle(区别于samples shuffle), list_shuffle_size控制shuffle buffer大小,默认为10000。如果多个data.list concat,shuffle size最好足够大来尽量保证数据全部随机

@Mddct
Copy link
Collaborator

Mddct commented Mar 19, 2024

默认值可以直接给个sys.max

@Mddct Mddct merged commit 605384a into wenet-e2e:main Mar 20, 2024
6 checks passed
Comment on lines 398 to +402
self.dp = TextLineDataPipe(filenames).repeat(cycle).prefetch(
prefetch).shard(partition)
prefetch)
if shuffle:
self.dp = self.dp.shuffle(buffer_size=shuffle_size)
self.dp = self.dp.shard(partition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个shuffle是不是应该在prefetch之前?@Mddct

@kakashidan kakashidan deleted the fix-first_stage_shuffle branch March 23, 2024 14:34
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants