Skip to content

Data for the ACL SRW 2020 paper "Understanding Points of Correspondence between Sentences for Abstractive Summarization"

License

Notifications You must be signed in to change notification settings

ucfnlp/points-of-correspondence

Repository files navigation

Understanding Points of Correspondence between Sentences for Abstractive Summarization

Dataset for our ACL SRW 2020 paper Understanding Points of Correspondence between Sentences for Abstractive Summarization

Citation

@inproceedings{lebanoff-etal-2020-understanding,
    title = "Understanding Points of Correspondence between Sentences for Abstractive Summarization",
    author = "Lebanoff, Logan and Muchovej, John and Dernoncourt, Franck and Kim, Doo Soon and Wang, Lidan and Chang, Walter and Liu, Fei",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-srw.26",
    pages = "191--198",
}

Presentation Video

Watch our presentation given virtually at ACL:

Watch our presentation given virtually at ACL:

Dataset

Fusing sentences containing disparate content is a remarkable human ability that helps create informative and succinct summaries. Such a simple task for humans has remained challenging for modern abstractive summarizers, substantially restricting their applicability in real-world scenarios.

We present a dataset that contains 1,599 sentence fusion examples (taken from 1,174 documents) with fine-grained Points of Correspondence annotations. Points of correspondence (PoC) are cohesive devices that tie two sentences together into a coherent text. The types of points of correspondence are delineated by text cohesion theory, covering pronominal and nominal referencing, repetition and beyond.

A point of correspondence is represented as a span of text from each sentence. Our dataset is in JSON format in the file PoC_dataset.json. Each example has the following attributes:

Attribute Content
Sentence_1 Tokenized input sentence 1
Sentence_2 Tokenized input sentence 2
Sentence_Fused Fused sentence created by merging Sentence_1 and Sentence_2
Sentence_1_Index Position of sentence in Full_Article
Sentence_2_Index Position of sentence in Full_Article
Sentence_Fused_Index Position of fused sentence in Full_Summary
Full_Article Full CNN news article. Each sentence is separated by tabs
Full_Summary Summary of the article. Each sentence is separated by tabs
PoCs List of Points of Correspondence

Each PoC has the following attributes:

Attribute Content
Sentence_1_Selection Token indices for beginning and end of the PoC in input sentence
Sentence_2_Selection Token indices for beginning and end of the PoC in input sentence
Sentence_Fused_Selection Token indices for beginning and end of the PoC in fused sentence
PoC_Type Can be any of Nominal, Pronominal, Common-Noun, Repetition and Event

Example Visualizations

We provide visualizations of every dataset example in the directory PoC_visualizations/, which can be opened in any browser, along with the code used to create them in visualize_poc.py.

The process is easy and can be seen below:

Example visualization

Model Outputs

The outputs of our models can be downloaded here: https://www.dropbox.com/sh/g34aj101oauwlx3/AABIdqbBXMAa8RFpb-I6Auh7a/Understanding%20Points%20of%20Correspondence%20between%20Sentences%20for%20Abstractive%20Summarization?dl=0

*Note: We tested only on the examples that had at least one point of correspondence, so there are 1494 outputs for each model.

About

Data for the ACL SRW 2020 paper "Understanding Points of Correspondence between Sentences for Abstractive Summarization"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages