Skip to content

LSX-UniWue/scene-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code and Data for the Paper "Assessing the State of the Art in Scene Segmentation"

Code

Code is provided for Sequential Sentence Classification (ssc), LLM prompting (prompting) and LLM fine-tuning (llama).

The prompting code requires providing an API key for OpenAI and potentially OpenRouter (for additional models) as well as a BaseURL for a running Ollama server in prompting/classify.py.

Data

The data folder contains full annotated files for the public domain texts in our corpus and stand-off annotations for the other texts.

Standoff Annotations

The standoff annotations are simple json files containing the character indices of scene boundaries as well as detected sentence boundaries. Additionally, each file contains a hash of the text (md5(doc.text.encode("utf-8")).hexdigest()), which can be used to ensure that the text you are using matches our annotations. The format is as follows:

{
  "scenes": [
    {
      "start": 0,
      "end": 100,
      "reason_for_change": "Zeit, Handlung",
      "scene_type": "Szene"
    },
    ...
  ],
  "sentences": [
    {
      "start": 0,
      "end": 10
    },
    ...
  ],
  "md5": "hash"
}

Full Annotated Files

The full annotated files are in UIMA XMI format and can be viewed most easily by pulling them into the editor window of WebATHEN. For automatic processing, the easiest way is the use of the WueNLP python library:

from wuenlp.impl.UIMANLPStructs import UIMADocument

doc = UIMADocument.from_xmi("path/to/file.xmi")

for scene in doc.scenes:
    print(scene.text)
    ...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published