Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

数据处理模块问题,train_texts长度为1 #11

Open
LightingFx opened this issue Jan 22, 2024 · 0 comments
Open

数据处理模块问题,train_texts长度为1 #11

LightingFx opened this issue Jan 22, 2024 · 0 comments

Comments

@LightingFx
Copy link

您好,咨询以下数据处理模块的问题。我的pt数据路径下共有5个txt文件,在加载阶段也都是可以正常加载,如下所示:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:13<00:00, 2.68s/it]
2024-01-22 17:31:05.728 | INFO | component.dataset:load_dataset:120 - Total num of training text: 5
2024-01-22 17:31:05.728 | INFO | component.dataset:load_dataset:123 - Start tokenizing data ...
0%| | 0/1 [00:00<?, ?it/s]
可以看到加载完数据后开始tokenizing的时候,总数据才是1。我看了下component的dataset.py,在load_datasets里打印了train_texts的长度,结果就是1,也就是说把数据全部添加到train_texts列表的时候,多了一层[],这样的话,在for i in tqdm(range(0, len(train_texts), self.tokenize_batch)) tokenizing时步长是不是不对?而且我测试的时候这一步经常会爆内存(内存320G),看起来是循环出问题了。不清楚为什么会多一层列表,是哪个环节读取txt的时候出的问题呢。
如果我把改为load_datasets里的代码改为train_texts = train_texts[0],可以看到tqdm显示数据总量似乎正确,但这样是会正常处理数据吗?希望可以得到解答。

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant