Skip to content

Code and data for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" (LREC-COLING 2024)

License

Notifications You must be signed in to change notification settings

sb-jang/kodialogbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

KoDialogBench

This is the official repository for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" accepted at LREC-COLING 2024.

Data

KoDialogBench is a benchmark designed to assess the conversational capabilities of language models in Korean language. To this end, we collected native Korean dialogues on daily topics from public sources (e.g., AI Hub), or translated dialogues from other languages such as English and Chinese. We then structured these conversations into diverse test datasets, spanning from dialogue comprehension to response selection tasks. This benchmark consists of 21 test sets, encompassing various aspects of open-domain colloquial dialogues (e.g., topic, emotion, dialog act).

We uploaded the datasets on 🤗Hugging Face Hub.

Sources

We collected native Korean dialogues from AI Hub:

  • K-SNS stands for Korean SNS (한국어 SNS)
  • K-TDD stands for Thematic Daily Dialogues (주제별 텍스트 일상 대화 데이터)
  • K-ED stands for Emotional Dialogues (감성 대화 말뭉치)
  • K-DS stands for Dialogue Summary (한국어 대화 요약)

We translated public datasets from other languages:

Statistics

The dataset has 82,962 examples in total.

Task Subtask Source # Options # Examples
Dialogue Comprehension Topic Classification K-SNS 6 1200
Dialogue Comprehension Topic Classification K-TDD 19 1900
Dialogue Comprehension Topic Classification SocialDial 4 400
Dialogue Comprehension Emotion Recognition K-ED 6 1200
Dialogue Comprehension Emotion Recognition DailyDialog 5 470
Dialogue Comprehension Emotion Recognition Empathetic Dialogues 2 2000
Dialogue Comprehension Relation Classification SocialDial (Distance) 4 524
Dialogue Comprehension Relation Classification SocialDial (Relation) 3 330
Dialogue Comprehension Location Classification SocialDial 4 376
Dialogue Comprehension Dialog Act Classification K-TDD 4 520
Dialogue Comprehension Dialog Act Classification DailyDialog 4 1000
Dialogue Comprehension Fact Identification K-DS 4 1200
Dialogue Comprehension Fact Identification PersonaChat 4 1000
Dialogue Comprehension Fact Identification Empathetic Dialogues 4 2394
Response Selection K-SNS 5 10295
Response Selection K-TDD 5 10616
Response Selection K-ED 5 17818
Response Selection PersonaChat 5 7801
Response Selection DailyDialog 5 6740
Response Selection Empathetic Dialogues 5 7941
Response Selection SocialDial 5 7237

Usage

lm-evaluation-harness is used for zero-shot and few-shot evaluation.

TODO: merge the KoDialogBench task to lm-evaluation-harness

Installation

Install lm-eval first before cloning this repo.

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
python -m venv venv
pip install -e .
pip install -e ".[multilingual]"
pip install sentencepiece

Task registration

After cloning this repo, copy task configs to lm-eval

cp -r kodialogbench ../lm-evaluation-harness/lm_eval/tasks

Evaluation

You can evaluate the subsets using the following arguments to --tasks:

  • kodialogbench_dc: 14 dialogue comprehension tasks
  • kodialogbench_rs: 7 response selection tasks
  • kodialogbench_dc_topic: 3 topic classification tasks
  • kodialogbench_dc_emotion: 3 emotion classification tasks
  • kodialogbench_dc_relation: 2 relation classification tasks
  • kodialogbench_dc_dialog_act: 2 dialog act classification tasks
  • kodialogbench_dc_fact: 3 fact identification tasks
lm_eval --model hf \
    --model_args pretrained=EleutherAI/polyglot-ko-1.3b \
    --tasks kodialogbench \
    --device cuda:0 \
    --batch_size auto \
    --num_fewshot 0

If you want to change prompts, modify doc_to_text functions in utils.py.

Limitations

Our benchmark may suffer from a chronic problem of benchmark contamination. Due to the scarcity of Korean language resources, there is a possibility that the held-out sources utilized to construct the benchmark might overlap with training data used for some language models.

Ethics Statement

Our benchmark dataset is designed to assess capabilities related to various situations and aspects of conversations in Korean language. To achieve this, we utilized conversational content from publicly available datasets from various sources, either without modification or with translation if necessary. During this process, there is a possibility that harmful content or inappropriate biases existing in the original data may have been conveyed, or may have arisen due to limitations of translation tools. We reject any form of violence, discrimination, or offensive language, and our benchmark dataset and experimental results does not represent such values. If any harmful content or privacy infringement is identified within the dataset, we kindly request immediate notification to the authors. In the event of such cases being reported, we will apply the highest ethical standards and take appropriate actions.

Citation

@misc{jang2024kodialogbench,
      title={KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark}, 
      author={Seongbo Jang and Seonghyeon Lee and Hwanjo Yu},
      year={2024},
      eprint={2402.17377},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Point of Contact

Seongbo Jang

About

Code and data for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" (LREC-COLING 2024)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages