diff --git a/.Rbuildignore b/.Rbuildignore index ec84911..702258d 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -5,3 +5,4 @@ ^LICENSE\.md$ ^README\.Rmd$ ^cran-comments\.md$ +^test\.R$ diff --git a/DESCRIPTION b/DESCRIPTION index 33b4080..3f693fe 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: minutemaker Title: GenAI-based meeting and conferences minutes generator -Version: 0.6.0 +Version: 0.8.0 Authors@R: person("Angelo", "D'Ambrosio", , "a.dambrosioMD@gmail.com", role = c("aut", "cre"), comment = c(ORCID = "0000-0002-2045-5155")) diff --git a/NAMESPACE b/NAMESPACE index 3ba4b26..321db63 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -4,6 +4,7 @@ export(add_chat_transcript) export(clean_transcript) export(entity_extractor) export(extract_text_from_transcript) +export(format_agenda) export(format_summary_tree) export(generate_recording_details) export(get_prompts) @@ -19,6 +20,7 @@ export(speech_to_summary_workflow) export(split_audio) export(summarise_full_meeting) export(summarise_transcript) +export(validate_agenda) import(dplyr) importFrom(stats,setNames) importFrom(utils,hasName) diff --git a/NEWS.md b/NEWS.md new file mode 100644 index 0000000..da60704 --- /dev/null +++ b/NEWS.md @@ -0,0 +1,100 @@ +# minutemaker 0.8.0 + +### Enhanced Agenda Management and Utilization + +#### Enhancements: +- Added a new `multipart_summary` argument in `speech_to_summary_workflow()` to allow users to choose between summarizing each agenda item separately (the previous approach, now the default) or as a single summary just using the agenda to focus the model, offering greater flexibility in the summarization process (Commit: 99168d4168e6394e1789d7ae0dadb1e3b37b006d). +- Introduced `format_agenda()` function to convert machine-readable agendas into human-readable text, improving the usability of agenda-driven summarization (Commit: 0d27980f21ea3ab1df2ed289df421b7034b2fd22). +- Added `validate_agenda()` function to ensure the validity of agenda structures before summarization, enhancing the reliability of the summarization process (Commit: 5e943af522529edea6166ab62da41cbfa7a08ebb). +- Added the ability for users to proceed with the summarization workflow after agenda generation without re-running the entire workflow function, streamlining the user experience (Commit: 8056ed6763368347efa4c1f79ae9497ca7ff1597). +- Changed the summarization workflow logic to not ask whether the user wants to overwrite the summarization output if `overwrite_formatted_output` is `FALSE` (Commit: 99168d4168e6394e1789d7ae0dadb1e3b37b006d). +- Implemented global configuration for the language model (LLM) provider via `getOption("minutemaker_llm_provider")`, allowing for more flexible and centralized LLM provider management (Commit: 159335d2d413462c93f5dd3f60c1919e1b2f8918). +- Updated `interrogate_llm()` to retrieve the LLM provider setting from global options, providing a more dynamic and user-friendly approach to specifying the LLM provider (Commit: 15723d6f8c03c648deb00b1705837eadef04b609). + +#### Fixes: +- Addressed an issue where the summarization process could fail due to invalid agendas by implementing the `validate_agenda()` function (Commit: 6bdabadbb3bbee66e20907f5f7dec4330842ed39). + +# minutemaker 0.7.0 + +### Manage events without agendas in the summarisation workflow + +This pull request includes a series of enhancements and fixes that improve the transcript summarization workflow, add new functionality for entity extraction, and ensure better support for various transcript formats. The changes also address code quality and documentation to provide a more robust and user-friendly experience. + +#### Breaking: +- Replaced `event_audience` with `audience` as argument name for consistency across the framework. Before, some functions used the first and some the second term (Commit: 644fb2982f8d83420736382c75a89ee231464eef). + +#### Enhancements: +- **Workflow Enhancement**: Added support for summarizing meetings without an agenda in the workflow. Before, the full workflow function was designed to only work with long meetings organized in sub-talks described by an agenda. (Commit: 644fb2982f8d83420736382c75a89ee231464eef). +- **Entity Extraction Functionality**: Introduced the `entity_extractor` function to identify and extract entities such as people, acronyms, organizations, and concepts from a text, which can be particularly useful for building vocabularies for LLMs from event descriptions or transcripts (Commit: ae4fc3cebf025f0331ffd6eeb82e3be47d4cf3af). +- **Agenda Management**: Added the ability to manage deviations from the expected agenda, allowing the LLM to add talks not described in the expected agenda, enhancing the flexibility of the summarization process (Commit: 40f7620a43684ace41b0aa44e20c8ed1dc8eab00). +- **Support for MS Teams VTT Files**: Implemented support for importing transcripts from MS Teams VTT files, which do not follow the standard VTT specification (Commit: cfa96733d86879ca4977c65a8d8b58eace108af2). +- **Output Quality Improvements**: Utilized the `styler` package to enhance the readability of generated agendas and unformatted summary outputs, contributing to better readability and user experience (Commit: 194b8c8c45bf09e1f8f3cabec6c6d362ee950f0f). + +#### Fixes: +- **Agenda Generation Bug**: Resolved an issue where the agenda generation was creating infinite unnamed speaker lists, exhausting the context window (Commit: bfc5597bd453518960b6208268725ee0a3157dba). + +#### Dependencies: +- **Styler Package Addition**: Added the `styler` package and its dependencies to the project, which is used to improve the formatting of the generated outputs (Commit: e88a6bdd76ff0fa51902894d820ff9461addedb9). + +# minutemaker 0.6.0 + +#### Major Enhancements: +- Introduced the `infer_agenda_from_transcript` function to automate the generation of an event agenda by analyzing a given transcript, aiming to identify and extract key segments and construct a structured overview of the event's proceedings. This enhancement significantly streamlines the workflow for summarizing meetings and conferences. (Commit: c458b0d9f9ebe7b20ad1775c44beb69712cfa933) + +#### Minor Improvements and Fixes: +- Enhanced error handling for transcription processes, including managing empty transcription JSON files and transcription files with no speaker information. (Commits: 3c4e877f5d953abd93c54f8e6ae04d94f51843ba, 41b823add34d18f0feeaed54d45a00a601a9b8e0) +- Improved the summarization process by adding checks to handle cases where a transcript subset for a talk is empty and ensuring the final result tree is not empty. (Commit: b66b912cd0c7e802c4b9b33e2d9f279e64db172d) +- Addressed various minor issues, including dependency installation, handling of integers as agenda times, and managing fatal Whisper API errors. (Commits: b1daf88cb68969c1bfb2c14cbfed2736306150ec, 4a2d159575f5a4e1ebb504820f92387e863af8a9, b66b912cd0c7e802c4b9b33e2d9f279e64db172d) + +#### Development and Maintenance: +- Cleaned up unused code and improved the robustness of the LLM prompt function. (Commits: e9afb2d44b4f9de1c3ab3dc30905a9295954edc5, 2e7abbc8249c85d1e3055ac65ad7a679fbfe5628) +- Started using `renv` for dev reproducibility. (Commit: 3b18519190ae185022ee7023ac02aae9b77c7148) + +# minutemaker 0.5.0 + +### Time management improvements + +This release provides many extra features in time management. + +The agenda times now can be inserted in a more natural format, e.g 10:30, 10:30:25, 17:00, 09:00 PM, etc +Also the transcript will have clock times in addition to seconds and time from the event start. +Finally, the formatted output will show the start and end times of each talk in a format chosen by the user (defaults to HH:MM). + +The user will need to provide the starting time of the event to harness such functionalities. + +# minutemaker 0.4.0 + +The addition of the "rolling window" summarization method, which, instead of performing summarization of a long transcript in one go, splits it into chunks and each is summarised. Then, those summaries are aggregated into one comprehensive summary. +Inclusion of the audio splitting feature in the workflow function which now really works as a one-stop solution from a source audio file to the formatted summary. Also added a folder argument to the function, so that users can point it to a folder with the right input files and have the function do the rest, with very few lines of code. +Improved prompt templates + +#### New Features +- Added the new "rolling window" summarization method. + +#### Improvements +- Implemented the audio splitting into speech_to_summary_workflow. +- speech_to_summary_workflow now takes a target_dir argument which allows to encapsulate and perform the whole workflow in one folder. Very few arguments are necessary to run the whole workflow now. +- Updated the interrogate_llm function with a log_request parameter to hide/show LLM technical messages. +- Enhanced the default prompt, the prompt fine-tuning and the prompt generation system. +-Updated the outputs0 nomenclature. +- Updated default segment_duration for the split_audio function to 40 minutes. It improves transcriptions. + +# minutemaker 0.3.0 + +#### New Features +- Allows to import transcripts and chat files generated from WebEx. +- Transcript merging uses the Glove text embedding model from the `text2vec` package to match segments between transcripts. +- Implemented a full speech-to-summary workflow, handling audio files to generate human-readable summaries. + +#### Enhancements +- Improved speech-to-text functionality to handle folder inputs and initial prompts. +- Updated summarization functions to manage different speakers. +- Expanded import functionality to support subtitle file formats with speaker information. +- Provided utility functions to handle silent segments in transcripts in a coherent way. +- Implemented a series of heuristics to remove Whisper hallucinations from transcripts. + +Plus various bug fixes. + +# minutemaker 0.2.0 + +First stable version diff --git a/R/LLM_calls.R b/R/LLM_calls.R index 5e946fa..3a034d1 100644 --- a/R/LLM_calls.R +++ b/R/LLM_calls.R @@ -156,7 +156,7 @@ process_messages <- function(messages) { #' interrogate_llm <- function( messages = NULL, - provider = c("openai", "azure", "local"), + provider = getOption("minutemaker_llm_provider"), params = list( temperature = 0 ), @@ -167,6 +167,12 @@ interrogate_llm <- function( messages <- process_messages(messages) provider <- match.arg(provider) + if (is.null(provider)) { + stop("Language model provider is not set. ", + "You can use the following option to set it globally:\n", + "minutemaker_llm_provider.") + } + if (log_request) { check_and_install_dependencies("tictoc") } diff --git a/R/data_management.R b/R/data_management.R index af715db..8410537 100644 --- a/R/data_management.R +++ b/R/data_management.R @@ -580,7 +580,61 @@ format_summary_tree <- function( invisible(output) } +#' Formats an agenda into a human-readable text +#' +#' The agenda functions returns a machine-readable agenda. This function takes +#' an agenda in R list format as generated by the `infer_agenda_from_transcript` +#' function and formats it into a human-readable text. +#' +#' @param agenda A list containing the agenda items. It is contains a number of +#' information about each talk, such as the session, title, speakers, +#' moderators, start and ending times. +#' @param event_start_time If agenda timings are in seconds, the starting time +#' is needed to convert them to actual clock time. If `NULL` it will use the +#' timing as reported in the agenda. +#' +#' @return A string containing the formatted agenda, invisibly. +#' +#' @export +format_agenda <- function( + agenda, + event_start_time = getOption("minutemaker_event_start_time") +) { + + # Import agenda from file + if (is.character(agenda)) { + agenda <- dget(agenda) + } + + # Initialize the output string + output <- "" + + # Covert times from second to clock time if possible + agenda <- convert_agenda_times( + agenda, convert_to = "clock", + event_start_time = event_start_time, conversion_format = "%R" + ) + # Generate a text version of the summary, with session, title, speakers, + # moderators and summary, if not NULL/empty + purrr::map_chr(agenda, ~ { + .x$speakers <- stringr::str_flatten_comma(.x$speakers) + .x$moderators <- stringr::str_flatten_comma(.x$moderators) + .x$session <- ifelse(is.null(.x$session), "", .x$session) + .x$title <- ifelse(is.null(.x$title), "", .x$title) + .x$description <- ifelse(is.null(.x$description), "", .x$description) + + stringr::str_glue_data(.x, + "Session: {session}; + Title: {title}; + Description: {description}; + Speakers: {speakers}; + Moderators: {moderators}; + Time: {from} - {to};") |> + stringr::str_remove_all("\\n?\\w+:\\s*;") |> # Remove empty fields + stringr::str_replace_all("\\.;", ";") + }) |> paste(collapse = "\n\n####################\n\n") +} #' Import transcript from subtitle file @@ -1014,16 +1068,22 @@ add_chat_transcript <- function( #' file where the automatically generated agenda will be written. Should be a #' .R file. See `infer_agenda_from_transcript` for more details. #' @param extra_agenda_generation_args Additional arguments passed to the -#' `infer_agenda_from_transcript` function. See `infer_agenda_from_transcript` -#' for more details. Note that the `diarization_instructions` argument for this -#' function will be taken from the `extra_agenda_generation_args` argument. +#' `infer_agenda_from_transcript` function. See `infer_agenda_from_transcript` +#' for more details. Note that the `diarization_instructions` argument for +#' this function will be taken from the `extra_agenda_generation_args` +#' argument. #' @param summarization_method A string indicating the summarization method to #' use. See `summarise_full_meeting` for more details. +#' @param multipart_summary If a valid agenda is provided, this argument allows +#' the user to specify whether the summarisation should be done in parts, one +#' for each agenda element using the `summarise_full_meeting` function, or in +#' one go using the `summarise_transcript` function. See the respective +#' functions for more details. #' @param event_description A string containing a description of the meeting. #' See `summarise_transcript` for more details. -#' @param audience A string containing a description of the audience of -#' the meeting and what to focus on in the summary. See `summarise_transcript` -#' for more details. +#' @param audience A string containing a description of the audience of the +#' meeting and what to focus on in the summary. See `summarise_transcript` for +#' more details. #' @param vocabulary A character vector of specific vocabulary words, names, #' definitions, to help the LLM recognise misspellings and abbreviations. See #' `summarise_transcript` for more details. @@ -1031,9 +1091,9 @@ add_chat_transcript <- function( #' should take into account the diarization of the transcript. See #' `summarise_transcript` for more details. #' @param summary_structure,extra_diarization_instructions,extra_output_instructions -#' Specific instructions necessary to build the summarisation prompt. See -#' `summarise_transcript` for more details and run `get_prompts()` to see the -#' defaults. See `summarise_transcript` for more details. +#' Specific instructions necessary to build the summarisation prompt. See +#' `summarise_transcript` for more details and run `get_prompts()` to see the +#' defaults. See `summarise_transcript` for more details. #' @param llm_provider A string indicating the LLM provider to use for the #' summarization. See `summarise_transcript` for more details. #' @param extra_summarise_args Additional arguments passed to the @@ -1101,6 +1161,8 @@ speech_to_summary_workflow <- function( agenda_generation_output_file = file.path(target_dir, "agenda.R"), extra_agenda_generation_args = NULL, + # Arguments for the actual summarization + multipart_summary = validate_agenda(agenda), event_description = NULL, audience = "An audience with understanding of the topic", vocabulary = NULL, @@ -1108,7 +1170,7 @@ speech_to_summary_workflow <- function( summary_structure = get_prompts("summary_structure"), extra_diarization_instructions = NULL, extra_output_instructions = NULL, - llm_provider = NULL, + llm_provider = getOption("minutemaker_llm_provider"), extra_summarise_args = NULL, summarization_window_size = 15, summarization_output_length = 3, @@ -1204,7 +1266,7 @@ speech_to_summary_workflow <- function( ## Create the transcript file ## # Check if the transcript file doesn't exists or overwrite is TRUE - if (overwrite_transcript || !file.exists(transcript_file)) { + if (isTRUE(overwrite_transcript) || !file.exists(transcript_file)) { # Generate the trascript from the json output data transcript_data <- parse_transcript_json( @@ -1326,8 +1388,20 @@ speech_to_summary_workflow <- function( agenda <- do.call(infer_agenda_from_transcript, agenda_infer_args) + # Ask the user if they want to proceed with the generated agenda or review + # it first message("Agenda generated. Please review it before proceeding.") - return(invisible(transcript_data)) + + # Don't ask the user if the process is not interactive, just stop + if (!interactive()) { + return(invisible(transcript_data)) + } + + choice <- utils::menu(c("Yes", "No"), title = "Do you want to proceed?") + + if (choice == 2) { + stop("Aborted by user.", call. = FALSE) + } } } @@ -1338,31 +1412,46 @@ speech_to_summary_workflow <- function( } # Manage situations where the formatted output file exists - if (!is.null(formatted_output_file) && - isFALSE(overwrite_formatted_output) && + if (!purrr::is_empty(formatted_output_file) && file.exists(formatted_output_file)) { - if (interactive()) { - choice <- utils::menu( - choices = c( - "Overwrite the existing formatted summary file", - "Abort the process" - ), - title = "The formatted summary output file already exists and overwrite is FALSE. What do you want to do?" - ) - - if (choice == 2) { - message("Aborted by user.") - return(invisible(transcript_data)) - - } else { - message("Overwriting the existing formatted summary file.") - } - } else { - message("The formatted summary output file already exists and overwrite is FALSE.\nSet overwrite_formatted_output = TRUE to overwrite it or remove it.") + if (isTRUE(overwrite_formatted_output)) { + message("WARNING: Overwriting the existing summary output.\n", + "Stop the process if you want to keep the existing file.") + } else if (isFALSE(overwrite_formatted_output)) { + message( + "The formatted summary output file already exists and", + "overwrite is FALSE.\n", + "Set overwrite_formatted_output = TRUE to overwrite it or remove it.") return(invisible(transcript_data)) + } else { + stop("The overwrite_formatted_output argument must be TRUE or FALSE") } + # isFALSE(overwrite_formatted_output) && + # file.exists(formatted_output_file)) { + + # if (interactive()) { + # choice <- utils::menu( + # choices = c( + # "Overwrite the existing formatted summary file", + # "Abort the process" + # ), + # title = "The formatted summary output file already exists and overwrite is FALSE. What do you want to do?" + # ) + # + # if (choice == 2) { + # message("Aborted by user.") + # return(invisible(transcript_data)) + # + # } else { + # message("Overwriting the existing formatted summary file.") + # } + # } else { + # message("The formatted summary output file already exists and overwrite is FALSE.\nSet overwrite_formatted_output = TRUE to overwrite it or remove it.") + # return(invisible(transcript_data)) + # } + } # Common summarization arguments @@ -1385,9 +1474,20 @@ speech_to_summary_workflow <- function( provider = llm_provider ), extra_summarise_args) - if (isFALSE(agenda)) { + if (isFALSE(agenda) || isFALSE(multipart_summary)) { # Summarize as single talk + if (validate_agenda(agenda)) { + agenda <- format_agenda(agenda) + + #TODO: put this prompt in the set_prompts function + summarization_args$summary_structure <- stringr::str_glue(" + {summary_structure} + Here is an agenda of the event to keep into account while summarizing: + {agenda} + Stricly follow the agenda to understand which information is worth summarizing.") + } + formatted_summary <- do.call(summarise_transcript, summarization_args) return_vec <- c("transcript_data", "formatted_summary") diff --git a/R/summarization.R b/R/summarization.R index 507dfb3..3dffe9d 100644 --- a/R/summarization.R +++ b/R/summarization.R @@ -358,6 +358,10 @@ summarise_full_meeting <- function( ... ) { + if (!validate_agenda(agenda)) { + stop("The agenda is not valid.") + } + # Import agenda from file if (is.character(agenda)) { agenda <- dget(agenda) diff --git a/R/validation.R b/R/validation.R index 0738dad..3e2ae30 100644 --- a/R/validation.R +++ b/R/validation.R @@ -81,6 +81,59 @@ validate_agenda_element <- function( is_valid } + +#' Validates an agenda +#' +#' Checks if the agenda is a list (or a file path to a list) and if it is not +#' empty and if all its elements are valid. +#' +#' @param agenda A list containing the agenda or a file path to it. +#' @param ... Additional arguments to be passed to `validate_agenda_element`. +#' +#' @return A boolean indicating whether the agenda is valid. +#' +#' @export +#' +#' @examples +#' validate_agenda(list( +#' list( +#' session = "Session 1", +#' title = "Opening Session", +#' speakers = "John Doe", +#' moderators = "Jane Doe", +#' type = "conference talk", +#' from = "09:00 AM", +#' to = "10:00 AM" +#' ), +#' list() +#' ), session = TRUE, title = TRUE, speakers = TRUE, moderators = TRUE, +#' type = TRUE, from = TRUE, to = TRUE) +#' #> [1] FALSE # Because the second element is empty +#' +validate_agenda <- function( + agenda, + ... +) { + + # Check if the agenda is a file path + if (!purrr::is_empty(agenda) && is.character(agenda) && file.exists(agenda)){ + agenda <- dget(agenda) + } + + # Check if the agenda is a list + if (!is.list(agenda)) { + return(FALSE) + } + + # Check if the agenda is empty + if (length(agenda) == 0) { + return(FALSE) + } + + # Check if the agenda elements are valid + purrr::map_lgl(agenda, ~ validate_agenda_element(.x, ...)) |> all() +} + #' Validate summary tree id consistency #' #' @param summary_tree A list containing the summary tree or a file path to it. diff --git a/README.Rmd b/README.Rmd index 6795020..203c9ee 100644 --- a/README.Rmd +++ b/README.Rmd @@ -103,6 +103,12 @@ options( minutemaker_local_endpoint_gpt = "local-host-path-to-model" ) +# Set the preferred LLM globally + +options( + minutemaker_llm_provider = "***" # E.g. "openai", "azure", "local" or custom +) + ``` These setting can be also passed manually to the various functions, but the @@ -339,9 +345,13 @@ produce a summary of the given length, but the actual length may vary depending on the transcript content (and the LLM idiosyncrasies). The following code shows how to set all the arguments for the summarization -process; but it's possible to -use the function with only the transcript data to get a summary. The other -arguments only provide additional contextual information to improve the summary. +process; but it's possible to use the function with only the transcript data to +get a summary. The other arguments only provide additional contextual +information to improve the summary. + +For example an agenda generated via `infer_agenda_from_transcript()` can be +formatted into text using `format_agenda()` and then added to +`summary_structure` argument (see example below). ```{r, eval = FALSE} @@ -394,7 +404,7 @@ recording_details <- generate_recording_details( ## The audience of the meeting/conference, which helps the summarisation to ## focus on specific topics and helps setting the tone of the summary. -event_audience <- "An audience with some expertise in the topic +audience <- "An audience with some expertise in the topic of the conference and with a particular interest in this and that." # Not mandatory but may help, for example with hybrid events where the room @@ -410,9 +420,21 @@ context if not reported" # summarisation section with: summary_structure <- paste0( get_prompts("summary_structure"), - "\n- My Extra section" + "\n- My extra summarisation instruction" ) +# The use can also use the summarisation instruction to add and agenda to drive +# the summarisation focus: +agenda <- format_agenda(agenda) +summary_structure <- get_prompts("summary_structure") + +summary_structure <- stringr::str_glue(" + {summary_structure} + Here is an agenda of the event to keep into account while summarizing: + {agenda} + Stricly follow the agenda to understand which information is worth summarizing. +") + # Finally, the user can add extra output instructions to the default ones (check # them using get_prompts("output_summarisation") for the summarisation and # get_prompts("output_rolling_aggregation") for the rolling aggregation). For @@ -427,7 +449,7 @@ talk_summary <- summarise_transcript( method = method, event_description = event_description, recording_details = recording_details, - audience = event_audience, + audience = audience, vocabulary = vocabulary, summary_structure = summary_structure, # Or don't pass it to use the default @@ -466,7 +488,7 @@ conference_summaries <- summarise_full_meeting( output_file = "path-to-output-file.R", event_description = event_description, - event_audience = event_audience, + audience = audience, vocabulary = vocabulary, summary_structure = summary_structure, @@ -519,7 +541,7 @@ used to perform all the steps in one go. ```{r, eval = FALSE} -# initial_prompt, event_description, event_audience, +# initial_prompt, event_description, audience, # vocabulary, diarization_instructions are defined in the previous examples # Perform the whole audio files to formatted summary workflow. Most arguments @@ -560,17 +582,23 @@ speech_to_summary_workflow( agenda_generation_output_file = file.path(target_dir, "agenda.R"), extra_agenda_generation_args = NULL, + # Whether to produce a summary for each agenda items (TRUE) or just an overall + # summary considering keeping the agenda into account to focus the + # summarization (FALSE). By default is TRUE if the agenda exists and is in the + # correct format. + multipart_summary = validate_agenda(agenda), + summarization_output_file = "event_summary.R", event_description = event_description, - event_audience = event_audience, + audience = audience, vocabulary = vocabulary, # you can pass summary_sections, diarization_instructions, or # output_instructions to override the default prompts diarization_instructions = diarization_instructions, - llm_provider = "my-LLM-provider-of-choice", + llm_provider = getOption("minutemaker_llm_provider"), overwrite_summary_tree = FALSE, diff --git a/README.md b/README.md index 4e01e8c..4b4a49a 100644 --- a/README.md +++ b/README.md @@ -95,6 +95,12 @@ options( # Local LLM model (for text summary) minutemaker_local_endpoint_gpt = "local-host-path-to-model" ) + +# Set the preferred LLM globally + +options( + minutemaker_llm_provider = "***" # E.g. "openai", "azure", "local" or custom +) ``` These setting can be also passed manually to the various functions, but @@ -335,6 +341,10 @@ summarization process; but it’s possible to use the function with only the transcript data to get a summary. The other arguments only provide additional contextual information to improve the summary. +For example an agenda generated via `infer_agenda_from_transcript()` can +be formatted into text using `format_agenda()` and then added to +`summary_structure` argument (see example below). + ``` r # Construct the path to the transcript file @@ -386,7 +396,7 @@ recording_details <- generate_recording_details( ## The audience of the meeting/conference, which helps the summarisation to ## focus on specific topics and helps setting the tone of the summary. -event_audience <- "An audience with some expertise in the topic +audience <- "An audience with some expertise in the topic of the conference and with a particular interest in this and that." # Not mandatory but may help, for example with hybrid events where the room @@ -402,9 +412,21 @@ context if not reported" # summarisation section with: summary_structure <- paste0( get_prompts("summary_structure"), - "\n- My Extra section" + "\n- My extra summarisation instruction" ) +# The use can also use the summarisation instruction to add and agenda to drive +# the summarisation focus: +agenda <- format_agenda(agenda) +summary_structure <- get_prompts("summary_structure") + +summary_structure <- stringr::str_glue(" + {summary_structure} + Here is an agenda of the event to keep into account while summarizing: + {agenda} + Stricly follow the agenda to understand which information is worth summarizing. +") + # Finally, the user can add extra output instructions to the default ones (check # them using get_prompts("output_summarisation") for the summarisation and # get_prompts("output_rolling_aggregation") for the rolling aggregation). For @@ -419,7 +441,7 @@ talk_summary <- summarise_transcript( method = method, event_description = event_description, recording_details = recording_details, - audience = event_audience, + audience = audience, vocabulary = vocabulary, summary_structure = summary_structure, # Or don't pass it to use the default @@ -458,7 +480,7 @@ conference_summaries <- summarise_full_meeting( output_file = "path-to-output-file.R", event_description = event_description, - event_audience = event_audience, + audience = audience, vocabulary = vocabulary, summary_structure = summary_structure, @@ -509,7 +531,7 @@ can be used to perform all the steps in one go. ``` r -# initial_prompt, event_description, event_audience, +# initial_prompt, event_description, audience, # vocabulary, diarization_instructions are defined in the previous examples # Perform the whole audio files to formatted summary workflow. Most arguments @@ -550,17 +572,23 @@ speech_to_summary_workflow( agenda_generation_output_file = file.path(target_dir, "agenda.R"), extra_agenda_generation_args = NULL, + # Whether to produce a summary for each agenda items (TRUE) or just an overall + # summary considering keeping the agenda into account to focus the + # summarization (FALSE). By default is TRUE if the agenda exists and is in the + # correct format. + multipart_summary = validate_agenda(agenda), + summarization_output_file = "event_summary.R", event_description = event_description, - event_audience = event_audience, + audience = audience, vocabulary = vocabulary, # you can pass summary_sections, diarization_instructions, or # output_instructions to override the default prompts diarization_instructions = diarization_instructions, - llm_provider = "my-LLM-provider-of-choice", + llm_provider = getOption("minutemaker_llm_provider"), overwrite_summary_tree = FALSE, diff --git a/man/format_agenda.Rd b/man/format_agenda.Rd new file mode 100644 index 0000000..bf29d71 --- /dev/null +++ b/man/format_agenda.Rd @@ -0,0 +1,28 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/data_management.R +\name{format_agenda} +\alias{format_agenda} +\title{Formats an agenda into a human-readable text} +\usage{ +format_agenda( + agenda, + event_start_time = getOption("minutemaker_event_start_time") +) +} +\arguments{ +\item{agenda}{A list containing the agenda items. It is contains a number of +information about each talk, such as the session, title, speakers, +moderators, start and ending times.} + +\item{event_start_time}{If agenda timings are in seconds, the starting time +is needed to convert them to actual clock time. If \code{NULL} it will use the +timing as reported in the agenda.} +} +\value{ +A string containing the formatted agenda, invisibly. +} +\description{ +The agenda functions returns a machine-readable agenda. This function takes +an agenda in R list format as generated by the \code{infer_agenda_from_transcript} +function and formats it into a human-readable text. +} diff --git a/man/interrogate_llm.Rd b/man/interrogate_llm.Rd index e6a56be..15de8e5 100644 --- a/man/interrogate_llm.Rd +++ b/man/interrogate_llm.Rd @@ -6,7 +6,7 @@ \usage{ interrogate_llm( messages = NULL, - provider = c("openai", "azure", "local"), + provider = getOption("minutemaker_llm_provider"), params = list(temperature = 0), force_json = FALSE, log_request = getOption("minutemaker_log_requests", TRUE), diff --git a/man/speech_to_summary_workflow.Rd b/man/speech_to_summary_workflow.Rd index 5449836..aae3c6e 100644 --- a/man/speech_to_summary_workflow.Rd +++ b/man/speech_to_summary_workflow.Rd @@ -31,6 +31,7 @@ speech_to_summary_workflow( agenda_generation_window_size = 3600, agenda_generation_output_file = file.path(target_dir, "agenda.R"), extra_agenda_generation_args = NULL, + multipart_summary = validate_agenda(agenda), event_description = NULL, audience = "An audience with understanding of the topic", vocabulary = NULL, @@ -38,7 +39,7 @@ speech_to_summary_workflow( summary_structure = get_prompts("summary_structure"), extra_diarization_instructions = NULL, extra_output_instructions = NULL, - llm_provider = NULL, + llm_provider = getOption("minutemaker_llm_provider"), extra_summarise_args = NULL, summarization_window_size = 15, summarization_output_length = 3, @@ -150,15 +151,22 @@ file where the automatically generated agenda will be written. Should be a \item{extra_agenda_generation_args}{Additional arguments passed to the \code{infer_agenda_from_transcript} function. See \code{infer_agenda_from_transcript} -for more details. Note that the \code{diarization_instructions} argument for this -function will be taken from the \code{extra_agenda_generation_args} argument.} +for more details. Note that the \code{diarization_instructions} argument for +this function will be taken from the \code{extra_agenda_generation_args} +argument.} + +\item{multipart_summary}{If a valid agenda is provided, this argument allows +the user to specify whether the summarisation should be done in parts, one +for each agenda element using the \code{summarise_full_meeting} function, or in +one go using the \code{summarise_transcript} function. See the respective +functions for more details.} \item{event_description}{A string containing a description of the meeting. See \code{summarise_transcript} for more details.} -\item{audience}{A string containing a description of the audience of -the meeting and what to focus on in the summary. See \code{summarise_transcript} -for more details.} +\item{audience}{A string containing a description of the audience of the +meeting and what to focus on in the summary. See \code{summarise_transcript} for +more details.} \item{vocabulary}{A character vector of specific vocabulary words, names, definitions, to help the LLM recognise misspellings and abbreviations. See diff --git a/man/validate_agenda.Rd b/man/validate_agenda.Rd new file mode 100644 index 0000000..88b1581 --- /dev/null +++ b/man/validate_agenda.Rd @@ -0,0 +1,37 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/validation.R +\name{validate_agenda} +\alias{validate_agenda} +\title{Validates an agenda} +\usage{ +validate_agenda(agenda, ...) +} +\arguments{ +\item{agenda}{A list containing the agenda or a file path to it.} + +\item{...}{Additional arguments to be passed to \code{validate_agenda_element}.} +} +\value{ +A boolean indicating whether the agenda is valid. +} +\description{ +Checks if the agenda is a list (or a file path to a list) and if it is not +empty and if all its elements are valid. +} +\examples{ +validate_agenda(list( + list( + session = "Session 1", + title = "Opening Session", + speakers = "John Doe", + moderators = "Jane Doe", + type = "conference talk", + from = "09:00 AM", + to = "10:00 AM" + ), + list() + ), session = TRUE, title = TRUE, speakers = TRUE, moderators = TRUE, + type = TRUE, from = TRUE, to = TRUE) + #> [1] FALSE # Because the second element is empty + +}