Code and Data for the Paper "Assessing the State of the Art in Scene Segmentation"

Code

Code is provided for Sequential Sentence Classification (ssc), LLM prompting (prompting) and LLM fine-tuning (llama).

The prompting code requires providing an API key for OpenAI and potentially OpenRouter (for additional models) as well as a BaseURL for a running Ollama server in prompting/classify.py.

Data

The data folder contains full annotated files for the public domain texts in our corpus and stand-off annotations for the other texts.

Standoff Annotations

The standoff annotations are simple json files containing the character indices of scene boundaries as well as detected sentence boundaries. Additionally, each file contains a hash of the text (md5(doc.text.encode("utf-8")).hexdigest()), which can be used to ensure that the text you are using matches our annotations. The format is as follows:

{
  "scenes": [
    {
      "start": 0,
      "end": 100,
      "reason_for_change": "Zeit, Handlung",
      "scene_type": "Szene"
    },
    ...
  ],
  "sentences": [
    {
      "start": 0,
      "end": 10
    },
    ...
  ],
  "md5": "hash"
}

Full Annotated Files

The full annotated files are in UIMA XMI format and can be viewed most easily by pulling them into the editor window of WebATHEN. For automatic processing, the easiest way is the use of the WueNLP python library:

from wuenlp.impl.UIMANLPStructs import UIMADocument

doc = UIMADocument.from_xmi("path/to/file.xmi")

for scene in doc.scenes:
    print(scene.text)
    ...

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
llama		llama
prompting		prompting
ssc		ssc
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code and Data for the Paper "Assessing the State of the Art in Scene Segmentation"

Code

Data

Standoff Annotations

Full Annotated Files

About

Releases

Packages

Languages

LSX-UniWue/scene-segmentation

Folders and files

Latest commit

History

Repository files navigation

Code and Data for the Paper "Assessing the State of the Art in Scene Segmentation"

Code

Data

Standoff Annotations

Full Annotated Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages