parsing evaluation metrics #2

mheilman · 2014-06-03T13:38:57Z

We need some methods/scripts to evaluate parsing performance. We probably want to do two things: a) replicate previous work that uses parseval so that we can easily report previous results (see table 3 in http://www.cc.gatech.edu/~jeisenst/papers/ji-acl-2014.pdf), and b) implement a more appropriate metric based on precision/recall of relations between spans, not just precision/recall of (labeled or unlabled) spans as in parseval. See discussion from @sagae below.

The metrics should report unlabeled and labeled performance
The metrics should use the 18 coarse relations from Carlson et al.'s (2001) "Building a Discourse-tagged Corpus in the Framework of Rhetorical Structure Theory."

Discussion from @sagae

Looking at Fig 1 in http://www.isi.edu/~marcu/papers/sigdialbook2002.pdf, there are nine rhetorical relations, represented by the labeled directed arcs (same-unit is just a side effect of the annotation, and not a discourse relation). We really should be looking at precision and recall of the relations represented in these labeled arcs. So we would be looking for:

16 <- 17-26 : example
17-21 <- 22-26 : elaboration-additional
17-18 <- 19-21 : explanation-argumentative
22-25 <- 26 : consequence-s
17 <- 18 : attribution
19-20 <- 21 : attribution
19 <- 20 : elaboration-object-attribute-embedded
22 <- 23 : attribution-embedded
24 <- 25 : purpose

and precision and recall would be computed in the usual way, and successful identification of a relation requires the correct spans, the correct direction of the arrow, and the correct label. The list doesn't include 22-23 <- 24-25 : same-unit, but the parser does need to get this right to form the 22-25 span, so it's taken into account
implicitly, which I think is the right way.

The text was updated successfully, but these errors were encountered:

mheilman · 2014-07-02T16:23:06Z

Commit 12c5b59 implements the basic functionality for doing parseval, but it's not complete. Some edge cases still need to be dealt with (e.g., same-unit relations). See the TODO comments in the code.

mheilman · 2014-07-10T17:09:02Z

The paper about the HILDA system (http://dad.uni-bielefeld.de/index.php/dad/article/viewFile/591/1187) says to see Marcu, 2000, 143–144 for a discussion of how PARSEVAL was adapted. (I'm still waiting to get the book from interlibrary loan.)

Marcu, 2000 = The Theory and Practice of Discourse Parsing and Summarization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing evaluation metrics #2

parsing evaluation metrics #2

mheilman commented Jun 3, 2014

mheilman commented Jul 2, 2014

mheilman commented Jul 10, 2014

parsing evaluation metrics #2

parsing evaluation metrics #2

Comments

mheilman commented Jun 3, 2014

Discussion from @sagae

mheilman commented Jul 2, 2014

mheilman commented Jul 10, 2014