Summarization of CTI texts is a typical use-case for CTI teams: creating threat landscape reports, quarterly or monthly reports etc. all require us to summarize relevant information while highlighting specific things which are relevant for the organization. While the latter will probably still for some time need human editing and corrections, the former - pure summarization of vast amounts of CTI reports, blog posts and bulletins - is a use-case where NLP / AI can shine. Summarization can usually be done in an automatic fashion with mere human oversight.
If you plan to create your own CTI report summarization AI model (or fine tune an existing one), or if you merely want to compare existing models, you will naturally arrive at the problem of how to compare the quality of summarizations. And this is anything but trivial. After all, one summarization might be written in a different style, with different words and differently structured from another one, but both might be equally "good".
Here, we follow Human-like Summarization Evaluation with ChatGPT, Mingqi Gao et. al. By following their paper, we attempt to create an LLM summarization evaluator via GPT-3.5/GPT-4.
The paper nicely states the problem this way:
Evaluating text summarization, like other text generation tasks, is a challenging problem. While human evaluation is considered the gold standard, it is expensive and time-consuming. As a result, automatic evaluation metrics play a crucial role.
But first, let's talk about these existing automatic evaluation metrics to compare summaries. (Also, this blog post does an excellent job at explaining the existing metrics)
Link: https://aclanthology.org/W04-1013.pdf
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of simple methods for automatically calculating the quality of a summarization in comparison to a reference summary by measuring the overlapping units (n-grams, word sequences, word pairs).
The obvious downside to ROUGE is that it does not compare semantics in summaries.
From the paper:
"BERTSCORE computes the similarity of two sentences as a sum of cosine similarities between their tokens’ embeddings"
(Ben Strickson)
@Ben please add a one paragraph description
(tables below taken from https://github.com/aaronkaplan/cti-llm/blob/main/summarization)
- install postgresql server (if not done yet). Note: XXX FIXME: this will be replaced by Docker stuff
- create a 'ctillm' user in postgres (you might have to do this as the
postgres
shell user):
$ createuser ctillm
- create the initial DB:
$ psql postgres < db.sql
- Check if the connection works:
$ psql summarydb
summarydb=# \d
List of relations
Schema | Name | Type | Owner
--------+----------------+----------+--------
public | summary | table | ctillm
public | summary_id_seq | sequence | ctillm
(2 rows)
Edit the file summarization/test_data/urls.txt
Add one URL per line.
$ python summarization/summarizer.py
Then check the contents of the postgresql DB:
psql summarydb
SELECT * from summary;