This project aimed to investigate how good modern language models are at summarizing text from application logs. When compared to previous work LogSummary, which used a complex pipeline based on TextRank, language models outperformed LogSummary on their own dataset with only a little fine-tuning. Skim through the accompanying thesis for more details on the datasets, model architecture and results.
- clone the repository via
git clone https://github.com/Impelon/log-rca-summarization.git
- optionally use
virtualenv
to make a virtual environment for python-packages - open
code
-folder in terminal - make sure system dependencies are installed;
the required packages can be found in
code/apt_requirements.txt
andcode/apt_optional_requirements.txt
; if you use theapt
package manager you can directly install these viasudo apt-get update xargs -o sudo apt-get install < apt_requirements.txt xargs -o sudo apt-get install < apt_optional_requirements.txt
- install dependencies via
pip3 install -r requirements.txt pip3 install -r optional_requirements.txt
- if you want to compare with LogSummary,
you also need to install it; you can initialize it here as a submodule:
The dependencies are not well kept for LogSummary, and
git submodule init git submodule update pip3 install -r LogSummary/requirements.txt
requirements.txt
misses some of them. Furthermore some code changes are necessary to include missing functions. For more details check LogSummary's project page. Additionally a word2vec model is required, which can be trained for log-data with their other framework Log2Vec. Note: Paths longer than 100 characters cause buffer overflows when training Log2Vec, if the limit is not changed manually. (LogSummary is only used for the purpose of comparison and accessing their dataset; it is not needed otherwise.)
- install repository
- download a suitable dataset e.g. Hadoop dataset from
- open
code
-folder in terminal - run dataset pre-processing via
e.g.
python3 -m "preprocess_dataset" <dataset-type> "<original-dataset-path>" "<destination-path>"
This also works for preprocessing LogSummary's dataset:python3 -m "preprocess_dataset" hadoop "../data/Hadoop/raw" "../data/Hadoop/processed"
python3 -m "preprocess_dataset" logsummary "../LogSummary/data/summary/logs" "../data/LogSummary/processed"
- install repository and prepare datasets
- open
code
-folder in terminal - run log-analysis via
e.g.
python3 -m "util.loganalysis"
python3 -m "util.loganalysis" ../data/Hadoop/processed --category "Disk full" --group-by-columns "File" --partition-minimum-size 200 --output-type common-events
- optionally prepare a preset for log-analysis, see the different configurations used in the
code/log_analysis_configs
-folder; the analysis directly with such a configuration, e.g.python3 -m "util.loganalysis" ../data/Hadoop/processed @../data/Hadoop/loganalysis_configs/disk-full.json --output-type common-events
- install repository and prepare datasets
- open
code
-folder in terminal - run pre-training via
e.g.
python3 -m "pretrain_model"
python3 -m "pretrain_model" --model_class "BartForConditionalGeneration" --model_name_or_path "facebook/bart-base" --masking_algorithm_preset "text-infilling" --csv_paths "../data/Hadoop/processed/log-instances"/* --input_field_name "SimplifiedMessage" --output_dir "../models/bart-base.hadoop/trained" --do_train
- optionally prepare a preset for pretraining-configuration, see
pretraining_config.py
used in themodel
-folder; the pre-training can then be run directly with such a configuration, e.g.python3 -m "pretrain_model" @../models/bart-base.hadoop/pretraining_config.py --do_train
- under an environment with multiple GPUs you may want to use DDP for training. (DDP referes to PyTorch's DistributedDataParallel);
in that case the pre-training should be started with
torchrun
For example like this:torchrun --nproc_per_node <number_of_gpu_you_have> -m "pretrain_model" [arguments_for_pretrain_model]...
torchrun --nproc_per_node 2 -m "pretrain_model" @../models/bart-base.hadoop/pretraining_config.py --do_train
- install repository, prepare datasets and optionally pre-train the model
- open
code
-folder in terminal - run pre-training via
e.g.
python3 -m "finetune_model"
python3 -m "finetune_model" --model_name_or_path "facebook/bart-base" --dataset_path "../data/Hadoop/processed" --loganalysis_arguments="@../data/Hadoop/loganalysis_configs/disk-full.json" --excluded_events_loganalysis_arguments="@../data/Hadoop/loganalysis_configs/normal.json" --input_field_name "SimplifiedMessage" --output_dir "../models/bart-base.hadoop/finetuned" --do_train
- optionally prepare a preset for finetuning-configuration, see
finetuning_config.py
used in themodel
-folder; the pre-training can then be run directly with such a configuration, e.g.python3 -m "finetune_model" @../models/bart-base.hadoop/finetuning_config.py --do_train
- under an environment with multiple GPUs you may want to use DDP for training. (DDP referes to PyTorch's DistributedDataParallel);
in that case the fine-tuning should be started with
torchrun
For example like this:torchrun --nproc_per_node <number_of_gpu_you_have> -m "finetune_model" [arguments_for_finetune_model]...
torchrun --nproc_per_node 2 -m "finetune_model" @../models/bart-base.hadoop/finetuning_config.py --do_train
Given a suitable platform for monitoring of training is installed
that is compatible with 🤗 Transformers
,
the training-scripts will automatically create logs for those platforms.
The default location is at runs
in the output_dir
, but can be controlled with the --logging_dir
option of training-scripts.
For example with tensorboard
installed,
visualizing previous training-runs could look like this:
tensorboard --logdir models/bart-base.hadoop/trained/runs
Trained models can comfortably be used for making predictions using pipelines.
To use a pre-trained model for mask-filling (the pipeline is limited to a mask-length of 1):
>>> import transformers
>>> mask_filler = transformers.pipeline("fill-mask", "models/bart-base.hadoop/trained")
>>> sentence = "Scheduled snapshot <mask> at 10 second(s)."
>>> for prediction in mask_filler(sentence):
print("{:6.2%} {}".format(prediction["score"], prediction["sequence"]))
9.89% Scheduled snapshot count at 10 second(s).
5.18% Scheduled snapshot snapshot at 10 second(s).
4.78% Scheduled snapshot size at 10 second(s).
2.88% Scheduled snapshot update at 10 second(s).
2.74% Scheduled snapshot start at 10 second(s).
To use a fine-tuned model for summarization:
>>> import transformers
>>> summarizer = transformers.pipeline("summarization", "models/bart-base.hadoop/finetuned")
>>> summary = summarizer("...")
- install repository
- open
code
-folder in terminal - run via
python3 -m "tracesviz"
- open any csv-file that contains structured log-data with traces;
logs can be directly structured into csv via
python3 -m "util.logparsing" structure