ABSA datasets for PyABSA
-
A Stand-alone browser based tool to help process data for the training set. here
Once data saved, 3 files will be created:
- a CSV file training set for classic sentiment analysis
- a TXT file training set for PyABSA
- a JSON file for saving unfinished work
There is an experimental feature which allows you to auto-build APC dataset and ATEPC datasets, see the usage here:
from pyabsa import make_ABSA_dataset
# refer to the comments in this function for detailed usage
make_ABSA_dataset(dataset_name_or_path='integrated_datasets/review', checkpoint='english')
To augment your datasets, please refer to BoostTextAugmentation
-
Format your APC dataset according to our dataset format. (Recommended. Once you finished this step, we can help you to finish other steps)
-
Generate the inference dataset for APC / ATEPC task (Optional. The example is available here)
-
Convert the APC dataset to ATEPC dataset, and move the transformed ATEPC datasets from apc_dataset to corresponding atepc_datasets. (Optional. The example is available here )
-
Register your dataset in PyABSA. (Optional. Register here)
It is recommended to assign an id for your dataset, which will avoid some potential problem (e.g., dataset mis-loading) while PyABSA detects the dataset. Merging your datasets into ABSADatasets, please keep the id remained.
- For a custom APC dataset, its name should be {id}.{dataset name}.{type}.dat.apc,
datasets
├── apc_datasets
│ ├── 101.restaurant
│ │ ├── restaurant.train.dat.apc # train_dataset
│ │ ├── restaurant.test.dat.apc # test_dataset
│ │ └── restaurant.valid.dat.apc # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
│ └── others
├── atepc_datasets
- ATEPC dataset files should be {id}.{dataset name}.{type}.dat.atepc, e.g.,
datasets
├── 101.restaurant
│ ├── restaurant.train.dat.atepc # train_dataset
│ ├── restaurant.test.dat.atepc # test_dataset
│ └── restaurant.valid.dat.atepc # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
└── others
I prepare a demo custom APC/ATEPC dataset which is based on third-party annotated Yelp dataset. Iif you got problem in dataset renaming, please put your data into the prepared dataset files. Check datasets/apc_datasets/100.CustomDataset and datasets/atepc_datasets/100.CustomDatasetto view or rewrite the custom dataset.
Then, use the {id}.{dataset name} to locate your dataset, e.g.,
from pyabsa.functional import APCConfigManager
from pyabsa.functional import Trainer
from autocuda import auto_cuda
config = APCConfigManager.get_apc_config_english() # APC task
dataset = '101.restaurant'
# dataset = '100.CustomDataset'
Trainer(config=config,
dataset=dataset, # train set and test set will be automatically detected
checkpoint_save_mode=1,
auto_device=auto_cuda() # automatic choose CUDA or CPU
)
All datasets provided are for research only, we do not hold any Copyright of any datasets. These datasets follow their original licenses (if any).
MAMS https://github.com/siat-nlp/MAMS-for-ABSA
SemEval 2014: https://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools
SemEval 2015: https://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools
SemEval 2016: https://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools
Chinese: https://www.sciencedirect.com/science/article/abs/pii/S0950705118300972?via%3Dihub
Shampoo: brightgems@GitHub
MOOC: jmc-123@GitHub with GPL License
Twitter: https://dl.acm.org/doi/10.5555/2832415.2832437
Television & TShirt: https://github.com/rajdeep345/ABSA-Reproducibility
Yelp: WeiLi9811@GitHub
SemEval2016Task5: YaxinCui@GitHub
- Arabic Hotel Reviews
- Dutch Restaurant Reviews
- English Restaurant Reviews
- French Restaurant Reviews
- Russian Restaurant Reviews
- Spanish Restaurant Reviews
- Turkish Restaurant Reviews
English-MOOC github: aparnavalli@GitHub
We hope you can share your custom dataset or an available public dataset. If you are willing to, please follow the instruction to process your data and open a PR.