CAT训练数据集的问题 | Issue with datasets utilized in CAT-Protocol #65

lmdsx · 2025-01-17T02:19:54Z

大佬您好，我想尝试复现您的数据集设置，但是遇到了一点问题，在CAT协议中，想问问关于CASIA2.0是只有5123张篡改图像进行采样吗，因为您给出的格式如下：

[
[
"ManiDataset",
"/mnt/data0/public_datasets/IML/CASIA2.0"
],
[
"JsonDataset",
"/mnt/data0/public_datasets/IML/FantasticReality_v1/FantasticReality.json"
],
而在CATNET中的数据集组织格式如下：
class SplicingDataset(Dataset):
def init(self, crop_size, grid_crop, blocks=('RGB',), mode="train", DCT_channels=3, read_from_jpeg=False, class_weight=None):
self.dataset_list = []
if mode == "train":
self.dataset_list.append(FantasticReality(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/FR_train_list.txt"))
self.dataset_list.append(FantasticReality(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/FR_auth_train_list.txt", is_auth_list=True))
self.dataset_list.append(IMD2020(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/IMD_train_list.txt", read_from_jpeg=read_from_jpeg))
self.dataset_list.append(CASIA(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/CASIA_v2_train_list.txt", read_from_jpeg=read_from_jpeg))
self.dataset_list.append(CASIA(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/CASIA_v2_auth_train_list.txt", read_from_jpeg=read_from_jpeg))
# self.dataset_list.append(tampCOCO(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/cm_COCO_train_list.txt"))
# self.dataset_list.append(tampCOCO(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/sp_COCO_train_list.txt"))
# self.dataset_list.append(tampCOCO(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/bcm_COCO_train_list.txt"))
# self.dataset_list.append(tampCOCO(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/bcmc_COCO_train_list.txt"))
# self.dataset_list.append(compRAISE(crop_size, grid_crop, blocks, DCT_channels, "Splicing/data/compRAISE_train.txt"))
CATNet是在真假图像分别采样的情况，总共组织了10种类型，这就令我有点困惑，希望能得到您的解答！谢谢！

SunnyHaze · 2025-01-17T05:39:53Z

你好，感谢对我们工作的关注。

这里确实是我们的CASIAv2只使用了篡改图像，没有使用真图。可能带来了一些歧义。

论文中report的结果确实是不含真实casiav2的，但因为1800的sample策略，应该差别不会太大。我们会在后续的版本中注意clarify这点。

lmdsx · 2025-01-17T06:33:46Z

你好，感谢对我们工作的关注。

这里确实是我们的CASIAv2只使用了篡改图像，没有使用真图。可能带来了一些歧义。

论文中report的结果确实是不含真实casiav2的，但因为1800的sample策略，应该差别不会太大。我们会在后续的版本中注意clarify这点。

感谢回复，那请问关于FantasticReality这个数据集也是只用了篡改图像实现的吗？其实如果可以的话我很想了解各个数据集中json文件的使用图像数量情况，可以考虑把每个数据集的json文件开源吗？这样可以帮助更多的人使用同一个标准进行训练。

lmdsx · 2025-01-20T05:45:49Z

如果存在某些要求无法开放json文件的话，能否告知FantasticReality这个数据集是只用了篡改图像进行采样训练？还是使用所有的真假图像采样1800张进行训练呢？我希望能统一所有的标准进行比较我自己的模型以及复现相关模型。最后真诚的感谢您在图像篡改检测领域做出的贡献。

SunnyHaze · 2025-01-21T14:18:21Z

English Version

Hello, we have carefully reviewed the training scripts used in the Benco paper over the past few days and re-trained to verify the results of the checkpoint. We have identified some issues with the CAT-Net protocol and are making a statement here. We apologize for any inconvenience caused.

The results we reported did not include the compRASIE real-image dataset. This modification was added in an internal version update, but the actual running version was never updated.
We observed that the original CAT-Net repository used real images, but we did not introduce the CASIAv2 real images throughout the process.
In the balanced_dataset, the number of samples per batch in each sub-dataset is 1840, not 2010. There is also a discrepancy with the original CAT-Net repository's 1869, possibly due to differences in data washing processes. However, the IMD20 dataset we ultimately used contains 2010 usable samples.

For the above three issues, we will emphasize and clarify them in updates to the GitHub homepage and the Arxiv version. We sincerely apologize for the inconvenience and confusion caused to all researchers.

In particular, after careful review and confirmation, we ensure that all models and ablation experiments under the CAT-Net protocol were conducted under the above-mentioned [unified standards], so the reported content in the paper remains fair and meaningful for reference.

Additionally, the IMDLBenco codebase aims to provide an easy, accurate, efficient, and fast way to reproduce previous models and develop one's own models using PyTorch. This purpose remains unchanged, and our clarification of the details is to ensure that everything aligns with our original intent.

No one is perfect. We should stand firm and learn from criticism. Open-source itself is intended to provide a platform for community oversight and error correction. We hope the community will understand and that future research can also embrace the spirit of open-source, supervision, and iteration.

中文版本（Chinese Version）

你好，我们这几天仔细排查了一下Benco论文中训练用的脚本，并重新训练核对了checkpoint的结果。我们发现了一些问题针对CAT-Net Protocol的问题并且在这里做出声明，由此带来的困扰十分抱歉。

我们report的结果没有引入compRASIE这个真图数据集，这个修改在一次内部版本更新中加入，但是实际运行的版本一直没有更新。
我们观察到CAT-Net原仓库使用了真图，但是我们从始至终没有引入CASIAv2的真图。
我们在balanced_dataset中每一个batch在每一个子数据集中采样的数量为1840，不是2010，这里与CAT-Net原仓库中的1869也有出入，可能是因为各自洗数据集的过程中有差异。但我们最终使用的IMD20中有2010张可用样本。

对于上述三个问题，我们会更新在Github主页和Arxiv版本中进行强调与澄清。由此带来的不便与困惑向各位Researcher致以诚挚的歉意。

特别的，我们经过仔细地检查确认，所有CAT-Net协议下的模型与消融实验，都可以确保是在上述的【统一标准下】进行的，所以文章report的内容仍是公平且具有参考意义的。

此外，IMDLBenco的codebase目的在于提供一个方便，准确，高效，快速复现前人模型，开发自己模型的PyTorch包，这个目的不会因此改变，我们澄清细节也是希望能保证一切实现符合我们的初衷。

人非完人，挨打要立正，开源本身也是希望提供社区监督与勘误的途径。也希望社区能予以理解。也希望后续的研究也能一同秉持着开源，监督，迭代的精神。

SunnyHaze · 2025-01-21T14:19:44Z

另外，这些全部的json我们会include在仓库内的单独区域以供参考，但是因为都是绝对路径，所以暂时不考虑从benco的代码实现部署这些json的内容。但可以通过仓库去检查审视json包含的文件，确保所有的协议和我们的paper中report完全一致，统一，标准。

SunnyHaze · 2025-01-21T14:19:59Z

祝好，如果有新的问题欢迎讨论与交流！

lmdsx · 2025-01-21T15:00:26Z

非常感谢您的回复，在我目前遇到的各个篡改检测的仓库中，我从未遇到一个有如此详尽的训练过程，且对每个问题都迅速详细解答澄清的情况，在json文件开源后，我会close这个issue。最后真诚的祝贺大佬往后的科研之路越来越好！

SunnyHaze · 2025-01-22T06:21:21Z

Hi, Please check the samples of each json file we utilized for each dataset at here:

你好，请在这里查看对应的json文件，有问题或者讨论欢迎交流！

~~https://github.com/scu-zjz/IMDLBenCo/tree/main/IMDLBenCo/statics/dataset_samples~~

Due to the large file size, it is not suitable to store them in the GitHub repository. Instead, these sample JSON files will be stored on Google Drive.

因为文件大小过大，不适合放在github仓库，这里改用google云盘存放这些sample json：

https://drive.google.com/drive/folders/1EQJT9rkJWbDaoVUqHceIwHzBAF4a3jCm?usp=sharing

lmdsx · 2025-01-22T06:34:29Z

感谢您的分享！

SunnyHaze changed the title ~~CAT训练数据集的问题~~ CAT训练数据集的问题 | Issue with datasets utilized in CAT-Protocol Jan 21, 2025

This was referenced Jan 21, 2025

about dataset question scu-zjz/Mesorch#4

Closed

I load some datasets in balanced_dataset.json way and the data doesn't load completely scu-zjz/Mesorch#3

Closed

lmdsx closed this as completed Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAT训练数据集的问题 | Issue with datasets utilized in CAT-Protocol #65

CAT训练数据集的问题 | Issue with datasets utilized in CAT-Protocol #65

lmdsx commented Jan 17, 2025

SunnyHaze commented Jan 17, 2025 •

edited

Loading

lmdsx commented Jan 17, 2025

lmdsx commented Jan 20, 2025

SunnyHaze commented Jan 21, 2025 •

edited

Loading

SunnyHaze commented Jan 21, 2025

SunnyHaze commented Jan 21, 2025

lmdsx commented Jan 21, 2025

SunnyHaze commented Jan 22, 2025 •

edited

Loading

lmdsx commented Jan 22, 2025

CAT训练数据集的问题 | Issue with datasets utilized in CAT-Protocol #65

CAT训练数据集的问题 | Issue with datasets utilized in CAT-Protocol #65

Comments

lmdsx commented Jan 17, 2025

SunnyHaze commented Jan 17, 2025 • edited Loading

lmdsx commented Jan 17, 2025

lmdsx commented Jan 20, 2025

SunnyHaze commented Jan 21, 2025 • edited Loading

English Version

中文版本（Chinese Version）

SunnyHaze commented Jan 21, 2025

SunnyHaze commented Jan 21, 2025

lmdsx commented Jan 21, 2025

SunnyHaze commented Jan 22, 2025 • edited Loading

lmdsx commented Jan 22, 2025

SunnyHaze commented Jan 17, 2025 •

edited

Loading

SunnyHaze commented Jan 21, 2025 •

edited

Loading

SunnyHaze commented Jan 22, 2025 •

edited

Loading