Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CAT训练数据集的问题 | Issue with datasets utilized in CAT-Protocol #65

Closed
lmdsx opened this issue Jan 17, 2025 · 9 comments
Closed

Comments

@lmdsx
Copy link

lmdsx commented Jan 17, 2025

大佬您好,我想尝试复现您的数据集设置,但是遇到了一点问题,在CAT协议中,想问问关于CASIA2.0是只有5123张篡改图像进行采样吗,因为您给出的格式如下:

[
[
"ManiDataset",
"/mnt/data0/public_datasets/IML/CASIA2.0"
],
[
"JsonDataset",
"/mnt/data0/public_datasets/IML/FantasticReality_v1/FantasticReality.json"
],
而在CATNET中的数据集组织格式如下:
class SplicingDataset(Dataset):
def init(self, crop_size, grid_crop, blocks=('RGB',), mode="train", DCT_channels=3, read_from_jpeg=False, class_weight=None):
self.dataset_list = []
if mode == "train":
self.dataset_list.append(FantasticReality(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/FR_train_list.txt"))
self.dataset_list.append(FantasticReality(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/FR_auth_train_list.txt", is_auth_list=True))
self.dataset_list.append(IMD2020(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/IMD_train_list.txt", read_from_jpeg=read_from_jpeg))
self.dataset_list.append(CASIA(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/CASIA_v2_train_list.txt", read_from_jpeg=read_from_jpeg))
self.dataset_list.append(CASIA(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/CASIA_v2_auth_train_list.txt", read_from_jpeg=read_from_jpeg))
# self.dataset_list.append(tampCOCO(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/cm_COCO_train_list.txt"))
# self.dataset_list.append(tampCOCO(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/sp_COCO_train_list.txt"))
# self.dataset_list.append(tampCOCO(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/bcm_COCO_train_list.txt"))
# self.dataset_list.append(tampCOCO(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/bcmc_COCO_train_list.txt"))
# self.dataset_list.append(compRAISE(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/compRAISE_train.txt"))
CATNet是在真假图像分别采样的情况,总共组织了10种类型,这就令我有点困惑,希望能得到您的解答!谢谢!

@SunnyHaze
Copy link
Contributor

SunnyHaze commented Jan 17, 2025

你好,感谢对我们工作的关注。

这里确实是我们的CASIAv2只使用了篡改图像,没有使用真图。可能带来了一些歧义。

论文中report的结果确实是不含真实casiav2的,但因为1800的sample策略,应该差别不会太大。我们会在后续的版本中注意clarify这点。

@lmdsx
Copy link
Author

lmdsx commented Jan 17, 2025

你好,感谢对我们工作的关注。

这里确实是我们的CASIAv2只使用了篡改图像,没有使用真图。可能带来了一些歧义。

论文中report的结果确实是不含真实casiav2的,但因为1800的sample策略,应该差别不会太大。我们会在后续的版本中注意clarify这点。

感谢回复,那请问关于FantasticReality这个数据集也是只用了篡改图像实现的吗?其实如果可以的话我很想了解各个数据集中json文件的使用图像数量情况,可以考虑把每个数据集的json文件开源吗?这样可以帮助更多的人使用同一个标准进行训练。

@lmdsx
Copy link
Author

lmdsx commented Jan 20, 2025

如果存在某些要求无法开放json文件的话,能否告知FantasticReality这个数据集是只用了篡改图像进行采样训练?还是使用所有的真假图像采样1800张进行训练呢?我希望能统一所有的标准进行比较我自己的模型以及复现相关模型。最后真诚的感谢您在图像篡改检测领域做出的贡献。

@SunnyHaze
Copy link
Contributor

SunnyHaze commented Jan 21, 2025

English Version

Hello, we have carefully reviewed the training scripts used in the Benco paper over the past few days and re-trained to verify the results of the checkpoint. We have identified some issues with the CAT-Net protocol and are making a statement here. We apologize for any inconvenience caused.

  1. The results we reported did not include the compRASIE real-image dataset. This modification was added in an internal version update, but the actual running version was never updated.
  2. We observed that the original CAT-Net repository used real images, but we did not introduce the CASIAv2 real images throughout the process.
  3. In the balanced_dataset, the number of samples per batch in each sub-dataset is 1840, not 2010. There is also a discrepancy with the original CAT-Net repository's 1869, possibly due to differences in data washing processes. However, the IMD20 dataset we ultimately used contains 2010 usable samples.

For the above three issues, we will emphasize and clarify them in updates to the GitHub homepage and the Arxiv version. We sincerely apologize for the inconvenience and confusion caused to all researchers.

In particular, after careful review and confirmation, we ensure that all models and ablation experiments under the CAT-Net protocol were conducted under the above-mentioned [unified standards], so the reported content in the paper remains fair and meaningful for reference.

Additionally, the IMDLBenco codebase aims to provide an easy, accurate, efficient, and fast way to reproduce previous models and develop one's own models using PyTorch. This purpose remains unchanged, and our clarification of the details is to ensure that everything aligns with our original intent.

No one is perfect. We should stand firm and learn from criticism. Open-source itself is intended to provide a platform for community oversight and error correction. We hope the community will understand and that future research can also embrace the spirit of open-source, supervision, and iteration.


中文版本(Chinese Version)

你好,我们这几天仔细排查了一下Benco论文中训练用的脚本,并重新训练核对了checkpoint的结果。我们发现了一些问题针对CAT-Net Protocol的问题并且在这里做出声明,由此带来的困扰十分抱歉。

  1. 我们report的结果没有引入compRASIE这个真图数据集,这个修改在一次内部版本更新中加入,但是实际运行的版本一直没有更新。
  2. 我们观察到CAT-Net原仓库使用了真图,但是我们从始至终没有引入CASIAv2的真图。
  3. 我们在balanced_dataset中每一个batch在每一个子数据集中采样的数量为1840,不是2010,这里与CAT-Net原仓库中的1869也有出入,可能是因为各自洗数据集的过程中有差异。但我们最终使用的IMD20中有2010张可用样本。

对于上述三个问题,我们会更新在Github主页和Arxiv版本中进行强调与澄清。由此带来的不便与困惑向各位Researcher致以诚挚的歉意。

特别的,我们经过仔细地检查确认,所有CAT-Net协议下的模型与消融实验,都可以确保是在上述的【统一标准下】进行的,所以文章report的内容仍是公平且具有参考意义的。

此外,IMDLBenco的codebase目的在于提供一个方便,准确,高效,快速复现前人模型,开发自己模型的PyTorch包,这个目的不会因此改变,我们澄清细节也是希望能保证一切实现符合我们的初衷。

人非完人,挨打要立正,开源本身也是希望提供社区监督与勘误的途径。也希望社区能予以理解。也希望后续的研究也能一同秉持着开源,监督,迭代的精神。

@SunnyHaze
Copy link
Contributor

另外,这些全部的json我们会include在仓库内的单独区域以供参考,但是因为都是绝对路径,所以暂时不考虑从benco的代码实现部署这些json的内容。但可以通过仓库去检查审视json包含的文件,确保所有的协议和我们的paper中report完全一致,统一,标准。

@SunnyHaze
Copy link
Contributor

祝好,如果有新的问题欢迎讨论与交流!

@SunnyHaze SunnyHaze changed the title CAT训练数据集的问题 CAT训练数据集的问题 | Issue with datasets utilized in CAT-Protocol Jan 21, 2025
@lmdsx
Copy link
Author

lmdsx commented Jan 21, 2025

非常感谢您的回复,在我目前遇到的各个篡改检测的仓库中,我从未遇到一个有如此详尽的训练过程,且对每个问题都迅速详细解答澄清的情况,在json文件开源后,我会close这个issue。最后真诚的祝贺大佬往后的科研之路越来越好!

@SunnyHaze
Copy link
Contributor

SunnyHaze commented Jan 22, 2025

Hi, Please check the samples of each json file we utilized for each dataset at here:

你好,请在这里查看对应的json文件,有问题或者讨论欢迎交流!

https://github.com/scu-zjz/IMDLBenCo/tree/main/IMDLBenCo/statics/dataset_samples

Due to the large file size, it is not suitable to store them in the GitHub repository. Instead, these sample JSON files will be stored on Google Drive.

因为文件大小过大,不适合放在github仓库,这里改用google云盘存放这些sample json:

https://drive.google.com/drive/folders/1EQJT9rkJWbDaoVUqHceIwHzBAF4a3jCm?usp=sharing

@lmdsx
Copy link
Author

lmdsx commented Jan 22, 2025

感谢您的分享!

@lmdsx lmdsx closed this as completed Jan 22, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants