Merge pull request #20 from bakaburg1/Dev

Enhanced Agenda Management and Utilization
bakaburg1 · Apr 16, 2024 · 5fd6494 · 5fd6494
2 parents b970cd3 + 6cc8b32
commit 5fd6494
Show file tree

Hide file tree

Showing 14 changed files with 454 additions and 59 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -5,3 +5,4 @@
 ^LICENSE\.md$
 ^README\.Rmd$
 ^cran-comments\.md$
+^test\.R$
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: minutemaker
 Title: GenAI-based meeting and conferences minutes generator
-Version: 0.6.0
+Version: 0.8.0
 Authors@R: 
     person("Angelo", "D'Ambrosio", , "a.dambrosioMD@gmail.com", role = c("aut", "cre"),
            comment = c(ORCID = "0000-0002-2045-5155"))

diff --git a/NAMESPACE b/NAMESPACE
@@ -4,6 +4,7 @@ export(add_chat_transcript)
 export(clean_transcript)
 export(entity_extractor)
 export(extract_text_from_transcript)
+export(format_agenda)
 export(format_summary_tree)
 export(generate_recording_details)
 export(get_prompts)
@@ -19,6 +20,7 @@ export(speech_to_summary_workflow)
 export(split_audio)
 export(summarise_full_meeting)
 export(summarise_transcript)
+export(validate_agenda)
 import(dplyr)
 importFrom(stats,setNames)
 importFrom(utils,hasName)

diff --git a/NEWS.md b/NEWS.md
@@ -0,0 +1,100 @@
+# minutemaker 0.8.0
+
+### Enhanced Agenda Management and Utilization
+
+#### Enhancements:
+- Added a new `multipart_summary` argument in `speech_to_summary_workflow()` to allow users to choose between summarizing each agenda item separately (the previous approach, now the default) or as a single summary just using the agenda to focus the model, offering greater flexibility in the summarization process (Commit: 99168d4168e6394e1789d7ae0dadb1e3b37b006d).
+- Introduced `format_agenda()` function to convert machine-readable agendas into human-readable text, improving the usability of agenda-driven summarization (Commit: 0d27980f21ea3ab1df2ed289df421b7034b2fd22).
+- Added `validate_agenda()` function to ensure the validity of agenda structures before summarization, enhancing the reliability of the summarization process (Commit: 5e943af522529edea6166ab62da41cbfa7a08ebb).
+- Added the ability for users to proceed with the summarization workflow after agenda generation without re-running the entire workflow function, streamlining the user experience (Commit: 8056ed6763368347efa4c1f79ae9497ca7ff1597).
+- Changed the summarization workflow logic to not ask whether the user wants to overwrite the summarization output if `overwrite_formatted_output` is `FALSE` (Commit: 99168d4168e6394e1789d7ae0dadb1e3b37b006d).
+- Implemented global configuration for the language model (LLM) provider via `getOption("minutemaker_llm_provider")`, allowing for more flexible and centralized LLM provider management (Commit: 159335d2d413462c93f5dd3f60c1919e1b2f8918).
+- Updated `interrogate_llm()` to retrieve the LLM provider setting from global options, providing a more dynamic and user-friendly approach to specifying the LLM provider (Commit: 15723d6f8c03c648deb00b1705837eadef04b609).
+
+#### Fixes:
+- Addressed an issue where the summarization process could fail due to invalid agendas by implementing the `validate_agenda()` function (Commit: 6bdabadbb3bbee66e20907f5f7dec4330842ed39).
+
+# minutemaker 0.7.0
+
+### Manage events without agendas in the summarisation workflow
+
+This pull request includes a series of enhancements and fixes that improve the transcript summarization workflow, add new functionality for entity extraction, and ensure better support for various transcript formats. The changes also address code quality and documentation to provide a more robust and user-friendly experience.
+
+#### Breaking:
+- Replaced `event_audience` with `audience` as argument name for consistency across the framework. Before, some functions used the first and some the second term (Commit: 644fb2982f8d83420736382c75a89ee231464eef).
+
+#### Enhancements:
+- **Workflow Enhancement**: Added support for summarizing meetings without an agenda in the workflow. Before, the full workflow function was designed to only work with long meetings organized in sub-talks described by an agenda. (Commit: 644fb2982f8d83420736382c75a89ee231464eef).
+- **Entity Extraction Functionality**: Introduced the `entity_extractor` function to identify and extract entities such as people, acronyms, organizations, and concepts from a text, which can be particularly useful for building vocabularies for LLMs from event descriptions or transcripts (Commit: ae4fc3cebf025f0331ffd6eeb82e3be47d4cf3af).
+- **Agenda Management**: Added the ability to manage deviations from the expected agenda, allowing the LLM to add talks not described in the expected agenda, enhancing the flexibility of the summarization process (Commit: 40f7620a43684ace41b0aa44e20c8ed1dc8eab00).
+- **Support for MS Teams VTT Files**: Implemented support for importing transcripts from MS Teams VTT files, which do not follow the standard VTT specification (Commit: cfa96733d86879ca4977c65a8d8b58eace108af2).
+- **Output Quality Improvements**: Utilized the `styler` package to enhance the readability of generated agendas and unformatted summary outputs, contributing to better readability and user experience (Commit: 194b8c8c45bf09e1f8f3cabec6c6d362ee950f0f).
+
+#### Fixes:
+- **Agenda Generation Bug**: Resolved an issue where the agenda generation was creating infinite unnamed speaker lists, exhausting the context window (Commit: bfc5597bd453518960b6208268725ee0a3157dba).
+
+#### Dependencies:
+- **Styler Package Addition**: Added the `styler` package and its dependencies to the project, which is used to improve the formatting of the generated outputs (Commit: e88a6bdd76ff0fa51902894d820ff9461addedb9).
+
+# minutemaker 0.6.0
+
+#### Major Enhancements:
+- Introduced the `infer_agenda_from_transcript` function to automate the generation of an event agenda by analyzing a given transcript, aiming to identify and extract key segments and construct a structured overview of the event's proceedings. This enhancement significantly streamlines the workflow for summarizing meetings and conferences. (Commit: c458b0d9f9ebe7b20ad1775c44beb69712cfa933)
+
+#### Minor Improvements and Fixes:
+- Enhanced error handling for transcription processes, including managing empty transcription JSON files and transcription files with no speaker information. (Commits: 3c4e877f5d953abd93c54f8e6ae04d94f51843ba, 41b823add34d18f0feeaed54d45a00a601a9b8e0)
+- Improved the summarization process by adding checks to handle cases where a transcript subset for a talk is empty and ensuring the final result tree is not empty. (Commit: b66b912cd0c7e802c4b9b33e2d9f279e64db172d)
+- Addressed various minor issues, including dependency installation, handling of integers as agenda times, and managing fatal Whisper API errors. (Commits: b1daf88cb68969c1bfb2c14cbfed2736306150ec, 4a2d159575f5a4e1ebb504820f92387e863af8a9, b66b912cd0c7e802c4b9b33e2d9f279e64db172d)
+
+#### Development and Maintenance:
+- Cleaned up unused code and improved the robustness of the LLM prompt function. (Commits: e9afb2d44b4f9de1c3ab3dc30905a9295954edc5, 2e7abbc8249c85d1e3055ac65ad7a679fbfe5628)
+- Started using `renv` for dev reproducibility. (Commit: 3b18519190ae185022ee7023ac02aae9b77c7148)
+
+# minutemaker 0.5.0
+
+### Time management improvements
+
+This release provides many extra features in time management.
+
+The agenda times now can be inserted in a more natural format, e.g 10:30, 10:30:25, 17:00, 09:00 PM, etc
+Also the transcript will have clock times in addition to seconds and time from the event start.
+Finally, the formatted output will show the start and end times of each talk in a format chosen by the user (defaults to HH:MM).
+
+The user will need to provide the starting time of the event to harness such functionalities.
+
+# minutemaker 0.4.0
+
+The addition of the "rolling window" summarization method, which, instead of performing summarization of a long transcript in one go, splits it into chunks and each is summarised. Then, those summaries are aggregated into one comprehensive summary.
+Inclusion of the audio splitting feature in the workflow function which now really works as a one-stop solution from a source audio file to the formatted summary. Also added a folder argument to the function, so that users can point it to a folder with the right input files and have the function do the rest, with very few lines of code.
+Improved prompt templates
+
+#### New Features
+- Added the new "rolling window" summarization method.
+
+#### Improvements
+- Implemented the audio splitting into speech_to_summary_workflow.
+- speech_to_summary_workflow now takes a target_dir argument which allows to encapsulate and perform the whole workflow in one folder. Very few arguments are necessary to run the whole workflow now.
+- Updated the interrogate_llm function with a log_request parameter to hide/show LLM technical messages.
+- Enhanced the default prompt, the prompt fine-tuning and the prompt generation system.
+-Updated the outputs0 nomenclature.
+- Updated default segment_duration for the split_audio function to 40 minutes. It improves transcriptions.
+
+# minutemaker 0.3.0
+
+#### New Features
+- Allows to import transcripts and chat files generated from WebEx.
+- Transcript merging uses the Glove text embedding model from the `text2vec` package to match segments between transcripts.
+- Implemented a full speech-to-summary workflow, handling audio files to generate human-readable summaries.
+
+#### Enhancements
+- Improved speech-to-text functionality to handle folder inputs and initial prompts.
+- Updated summarization functions to manage different speakers.
+- Expanded import functionality to support subtitle file formats with speaker information.
+- Provided utility functions to handle silent segments in transcripts in a coherent way.
+- Implemented a series of heuristics to remove Whisper hallucinations from transcripts.
+
+Plus various bug fixes.
+
+# minutemaker 0.2.0
+
+First stable version
diff --git a/R/LLM_calls.R b/R/LLM_calls.R
@@ -156,7 +156,7 @@ process_messages <- function(messages) {
 #'
 interrogate_llm <- function(
     messages = NULL,
-    provider = c("openai", "azure", "local"),
+    provider = getOption("minutemaker_llm_provider"),
     params = list(
       temperature = 0
     ),
@@ -167,6 +167,12 @@ interrogate_llm <- function(
   messages <- process_messages(messages)
   provider <- match.arg(provider)
 
+  if (is.null(provider)) {
+    stop("Language model provider is not set. ",
+         "You can use the following option to set it globally:\n",
+         "minutemaker_llm_provider.")
+  }
+
   if (log_request) {
     check_and_install_dependencies("tictoc")
   }