You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
这部分提到的标注后的数据集有公开吗:Dataset. We randomly sample a subset of 400
queries from the complete ALIGNBENCH dataset.
To make sure each category consists of enough
samples to produce reliable results, smaller cat-
egories are upsampled. To cover LLMs with a
wider levels of capability, we adopt answers from
8 LLMs, including GPT-4 (OpenAI, 2023), three
versions of ChatGLM series (Zeng et al., 2022; Du
et al., 2022), Sparkdesk, Qwen-plus-v1-search(Bai
et al., 2023a), InternLM-7B-Chat (Team, 2023)
and Chinese-Llama2-7B-Chat, producing a total
of 3200 question-answer pairings. Subsequent to
the compilation of the evaluation set, the question-
answer-reference triples are delivered to human
annotators, tasked with assigning quality ratings to
the answers according to the references. Given the
inherent limitations bound to human cognition, an-
notators are instructed to employ a rating on a scale
from 1 to 5. The scores are indicative of response
quality, with higher scores epitomizing superior
quality and profound satisfaction. In particular, a
score of 1 marks irrelevant, incorrect, or potentially
harmful responses.
The text was updated successfully, but these errors were encountered:
这部分提到的标注后的数据集有公开吗:Dataset. We randomly sample a subset of 400
queries from the complete ALIGNBENCH dataset.
To make sure each category consists of enough
samples to produce reliable results, smaller cat-
egories are upsampled. To cover LLMs with a
wider levels of capability, we adopt answers from
8 LLMs, including GPT-4 (OpenAI, 2023), three
versions of ChatGLM series (Zeng et al., 2022; Du
et al., 2022), Sparkdesk, Qwen-plus-v1-search(Bai
et al., 2023a), InternLM-7B-Chat (Team, 2023)
and Chinese-Llama2-7B-Chat, producing a total
of 3200 question-answer pairings. Subsequent to
the compilation of the evaluation set, the question-
answer-reference triples are delivered to human
annotators, tasked with assigning quality ratings to
the answers according to the references. Given the
inherent limitations bound to human cognition, an-
notators are instructed to employ a rating on a scale
from 1 to 5. The scores are indicative of response
quality, with higher scores epitomizing superior
quality and profound satisfaction. In particular, a
score of 1 marks irrelevant, incorrect, or potentially
harmful responses.
The text was updated successfully, but these errors were encountered: