-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Implement Agenda Inference from Transcripts and Enhance Summarization Workflow #17
Conversation
… length This commit introduces additional checks in the summarization process to handle cases where a transcript subset for a talk is empty. It adds a warning to inform the user if the transcript subset is empty due to incorrect event start times or agenda times. Furthermore, it ensures that the final result tree is not empty, and if it is, the process is stopped with an error message indicating that no talks were summarized. This helps in preventing the generation of empty summaries and guides the user to check their input data. Additionally, the commit includes a minor fix to align the indentation of arguments in the `summarise_transcript` function for improved code readability.
they get converted to numeric first
Add a new function `infer_agenda_from_transcript` and related prompt generation functions. This function automates the generation of an event agenda by analyzing a given transcript. It tries to identify and extract key segments from the transcript, which are then used to construct an agenda. The process can be informed by contextual information such as event description, vocabulary, diarization instructions, and an expected agenda to guide the LLM in generating a more accurate and context-aware agenda. Additionally, the function handles JSON parsing errors and adjusts the processing window size dynamically to ensure valid JSON output from the LLM. The agenda inference process is designed to be robust, with the ability to resume from temporary data if the process is interrupted. This enhancement streamlines the workflow for summarizing meetings and conferences by providing a structured overview of the event's proceedings.
Warning Rate Limit Exceeded@bakaburg1 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 49 seconds before requesting another review. How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. WalkthroughThe update enhances the Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 9
Configuration used: CodeRabbit UI
Files ignored due to path filters (7)
man/generate_agenda_element_prompt.Rd
is excluded by:!man/*.Rd
man/generate_agenda_inference_prompt.Rd
is excluded by:!man/*.Rd
man/infer_agenda_from_transcript.Rd
is excluded by:!man/*.Rd
man/minutemaker.Rd
is excluded by:!man/*.Rd
man/speech_to_summary_workflow.Rd
is excluded by:!man/*.Rd
renv.lock
is excluded by:!**/*.lock
renv/settings.json
is excluded by:!**/*.json
Files selected for processing (14)
- .Rprofile (1 hunks)
- DESCRIPTION (2 hunks)
- NAMESPACE (1 hunks)
- R/LLM_calls.R (2 hunks)
- R/data_management.R (8 hunks)
- R/prompts.R (3 hunks)
- R/speech_to_text.R (1 hunks)
- R/summarization.R (3 hunks)
- R/utils.R (2 hunks)
- R/validation.R (1 hunks)
- README.Rmd (3 hunks)
- README.md (3 hunks)
- renv/.gitignore (1 hunks)
- renv/activate.R (1 hunks)
Files skipped from review due to trivial changes (1)
- renv/.gitignore
Additional comments: 25
.Rprofile (1)
- 1-1: The addition of
source("renv/activate.R")
is a good practice for ensuring project-specific R environment setup, which aligns with the PR objectives of adoptingrenv
for development reproducibility.NAMESPACE (1)
- 10-10: Exporting the
infer_agenda_from_transcript
function aligns with the PR objectives of enhancing the package's functionality for automated agenda generation from transcripts. This is a necessary step for making the function available to package users.DESCRIPTION (4)
- 3-3: Updating the package version to
0.6.0
is appropriate for reflecting the significant enhancements introduced in this update, including the newinfer_agenda_from_transcript
function.- 7-8: The updated
Description
field now accurately reflects the package's capability of generating meeting minutes from audio recordings or transcripts using speech-to-text and LLMs, aligning with the PR objectives.- 24-24: Updating
RoxygenNote
to7.3.1
ensures that the documentation is generated using the latest version ofroxygen2
, which can include improvements and bug fixes.- 27-32: Adding
devtools
andusethis
toSuggests
is a good practice for development and maintenance, as these packages are commonly used for package development tasks but are not required for the package's core functionality.R/utils.R (1)
- 88-88: The update to
time_to_numeric
to check for "integer" inheritance in addition to other types is a good improvement for handling different time formats more robustly.R/validation.R (1)
- 44-48: Converting integer times to numeric values in
validate_agenda_element
simplifies the validation process and ensures consistency in handling different time formats. This is a logical improvement.R/LLM_calls.R (2)
- 38-47: Adding checks for empty messages in
process_messages
and converting a single message into a named list are good improvements for handling different input scenarios more robustly. However, ensure that the conversion logic is clear and well-documented for future maintainability.Consider adding a comment explaining the rationale behind converting a single message into a named list for clarity.
- 209-209: The change in
interrogate_llm
to setcall.
toFALSE
in the warning function call is appropriate for suppressing the function call in the warning message, which can make the warning message cleaner and more user-friendly.R/speech_to_text.R (1)
- 388-391: Handling the specific HTTP response status code (424) in
use_azure_whisper_stt
with a clear error message is a good practice for improving error handling and user feedback. This enhances the robustness of the function by providing more informative error handling for specific failure scenarios.README.md (2)
- 300-310: The introduction of the
infer_agenda_from_transcript()
function is a significant enhancement. However, it's crucial to emphasize the potential limitations and accuracy of the inferred agenda. Suggest adding a note about the importance of providing as much contextual information as possible to improve the accuracy of the agenda generation.Consider adding a note on the importance of providing detailed contextual information (e.g., event description, expected agenda) to improve the accuracy of the
infer_agenda_from_transcript()
function's output.
- 543-552: The documentation for the
speech_to_summary_workflow()
function mentions the possibility of the LLM inferring the agenda if it doesn't exist. It's important to ensure that users are aware of the need to review and possibly correct the automatically generated agenda for accuracy. Additionally, providing examples or more detailed documentation on how to use theexpected_agenda
,agenda_generation_window_size
, and other related arguments could enhance user understanding and effectiveness of this feature.Enhance the documentation for the
speech_to_summary_workflow()
function, especially regarding the automatic agenda generation feature. Provide examples or detailed explanations for arguments related to agenda generation to help users effectively utilize this feature.README.Rmd (3)
- 54-55: The suggestion to use an LLM with a >32K long context window for better summarization quality is clear and directly addresses the need for handling large transcripts effectively. This advice is particularly useful for users working with long meetings or talks, ensuring they are aware of the limitations of smaller models and the benefits of larger context windows.
- 310-319: The introduction of the
infer_agenda_from_transcript()
function and the emphasis on reviewing and correcting the inferred agenda are significant enhancements. This functionality automates a previously manual and potentially time-consuming process, improving the package's usability. However, it's crucial that users are reminded to review the automatically generated agenda for accuracy, as the function may not always capture the correct structure of the meeting. This balance between automation and manual verification is well articulated.- 553-562: The addition of new arguments related to agenda generation in the
speech_to_summary_workflow()
function is a thoughtful integration, allowing users to leverage the newinfer_agenda_from_transcript()
functionality within a comprehensive workflow. This change enhances the package's flexibility and user experience by providing options for automatic agenda generation and customization. It's important to ensure that the documentation for these new arguments is clear and that examples are provided to help users understand how to use them effectively.R/prompts.R (1)
- 121-126: The
agenda_inference_template
added to theset_prompts
function provides a template for presenting the transcript in a structured format for agenda inference tasks. This addition is crucial for guiding the LLM model in processing transcripts for agenda generation. A few considerations for refinement:
- Format Consistency: Ensure that the transcript format mentioned ("csv with the start and end time of each segment and the segment text") is consistently used across all functions that process transcripts. This consistency is vital for avoiding confusion and ensuring smooth data handling.
- Clarification on CSV Format: Given the mention of a CSV format, it might be helpful to include an example or a more detailed description of the expected CSV structure. This clarification can aid developers and users in preparing their data correctly for the agenda inference process.
Overall, this update is well-aligned with the objectives of enhancing the
minutemaker
package by automating agenda generation. Just ensure that format consistency is maintained and consider providing additional clarification on the expected CSV structure.R/summarization.R (1)
- 415-424: The addition of a check for an empty transcript subset in the
summarise_full_meeting
function is a crucial improvement for robustness. However, the warning message could be more informative by including suggestions on how to resolve the issue or where to look for potential mistakes in the input data.
- Consider enhancing the warning message to guide users more effectively.
- Ensure that this check does not inadvertently skip processing valid data due to edge cases not considered here.
renv/activate.R (7)
- 5-6: The version of
renv
is hardcoded to "1.0.3". Consider making this configurable or automatically updated to ensure users can easily switch to newer versions ofrenv
without modifying the script.- 12-24: The diagnostics block uses
Sys.getenv
to check if diagnostics are enabled and then profiles the startup time. This is a useful feature for performance analysis. However, ensure that the profiling does not inadvertently become enabled in production environments, as it could impact performance.- 27-51: The logic for determining if the autoloader is enabled is clear and well-structured, checking configuration options and environment variables. However, consider documenting the precedence of these checks for clarity, especially for new users or contributors.
- 69-71: Eager loading of the 'utils' package to ensure
renv
shims come first on the search path is a smart approach. Just ensure that this does not lead to unexpected side effects or conflicts with other packages that might expect the original 'utils' package functions.- 115-148: The
bootstrap
function is critical for downloading and installingrenv
. It's well-structured and handles errors gracefully. However, consider adding more detailed error messages or suggestions for common issues that might occur during the bootstrap process to improve user experience.- 848-866: The
renv_bootstrap_load
function attempts to loadrenv
from the project library and validates the version. It's important to ensure that the version validation logic is robust and accounts for potential edge cases, such as development versions or custom builds ofrenv
.- 1028-1033: The warning message in the
renv_bootstrap_run
function provides clear instructions for recovery in caserenv
fails to load. This is good practice for error handling. Just ensure that the message remains accurate and helpful as the project evolves.
#' talk, talk description and start and end times. Alternatively, the path to | ||
#' an R file containing such a list. See `summarise_full_meeting` for more | ||
#' details. If NULL, the user will be asked if the system should try to | ||
#' generate the agenda automatically, using the `infer_agenda_from_transcript` | ||
#' function. | ||
#' @param expected_agenda A character string. Only used if the `agenda` argument | ||
#' is `NULL` and the user requests the automatic agenda generation. this | ||
#' string will be used to drive the LLM while generating the agenda. See | ||
#' `infer_agenda_from_transcript` for more details. | ||
#' @param agenda_generation_window_size The size of the window in seconds to | ||
#' analyze at once when generating the agenda. See | ||
#' `infer_agenda_from_transcript` for more details. | ||
#' @param agenda_generation_output_file A string with the path to the output | ||
#' file where the automatically generated agenda will be written. Should be a | ||
#' .R file. See `infer_agenda_from_transcript` for more details. | ||
#' @param extra_agenda_generation_args Additional arguments passed to the | ||
#' `infer_agenda_from_transcript` function. See `infer_agenda_from_transcript` | ||
#' for more details. Note that the `diarization_instructions` argument for this | ||
#' function will be taken from the `extra_agenda_generation_args` argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of new parameters in speech_to_summary_workflow
for automatic agenda generation (agenda
, expected_agenda
, agenda_generation_window_size
, agenda_generation_output_file
, extra_agenda_generation_args
) significantly enhances the package's functionality. However, it's crucial to ensure that these parameters are well-documented and validated before use. Specifically:
- Ensure that the documentation for each new parameter is clear and provides enough detail for users to understand its purpose and how to use it.
- Validate the inputs for these parameters to prevent runtime errors. For example, check that
agenda_generation_window_size
is a positive integer and thatagenda_generation_output_file
is a valid file path.
if (cur_bp == length(breakpoints)) stop <- TRUE | ||
|
||
cur_bp <- cur_bp + 1 | ||
|
||
} | ||
|
||
agenda_times <- getOption("minutemaker_temp_agenda", list()) | ||
|
||
if (length(agenda_times) == 0) { | ||
warning("No agenda was inferred from the transcript.", | ||
immediate. = T, call. = F) | ||
return(NULL) | ||
} | ||
|
||
# Remove segments that are too short or that precede the previous one. | ||
agenda_times <- agenda_times |> purrr::imap(\(x, i) { | ||
if (i == 1) return(agenda_times[[i]]) | ||
|
||
this_time <- agenda_times[[i]] | ||
prev_time <- agenda_times[[i - 1]] | ||
|
||
# segments should last at least 5 minutes and not be negative | ||
if (this_time - prev_time < 150) return(NULL) | ||
|
||
return(this_time) | ||
}) |> unlist() | ||
|
||
message("- Extracting agenda items details") | ||
|
||
# Extract the talks' details from the transcript | ||
agenda <- purrr::imap(agenda_times, \(start, i) { | ||
# if (i == 1) start <- 1 | ||
|
||
# Stop at the end of the transcript if there is no next agenda element | ||
end <- min( | ||
c(agenda_times[i + 1], max(transcript_data$end)), | ||
na.rm = TRUE) | ||
|
||
# Stop at the pause if there is one in the talk segment | ||
pauses <- pauses[between(pauses, start, end)] | ||
end <- min(c(end, pauses), na.rm = TRUE) | ||
|
||
element <- list( | ||
# Sometimes, int are produced, which creates problems when converting to | ||
# clocktime | ||
from = as.numeric(start), | ||
to = as.numeric(end) | ||
) | ||
|
||
transcript_segment <- transcript_data |> | ||
filter( | ||
.data$start >= element$from, | ||
.data$end <= element$to, | ||
) |> readr::format_csv() | ||
|
||
prompt <- generate_agenda_element_prompt( | ||
transcript_segment, | ||
# I cannot use mget here because the prompt function is not in the | ||
# environment of the calling function. Probably there's a way to use mget | ||
# also here | ||
args = list( | ||
event_description = event_description, | ||
vocabulary = vocabulary, | ||
diarization_instructions = diarization_instructions) | ||
) | ||
|
||
# Build the prompt set | ||
prompt_set <- c( | ||
system = get_prompts("persona"), | ||
user = prompt | ||
) | ||
|
||
result_json <- interrogate_llm( | ||
prompt_set, | ||
..., force_json = TRUE | ||
) | ||
|
||
jsonlite::fromJSON(result_json, simplifyDataFrame = F) |> | ||
c(element) | ||
}) | ||
|
||
if (!is.null(start_time)) { | ||
agenda <- agenda |> | ||
convert_agenda_times( | ||
convert_to = "clocktime", | ||
event_start_time = start_time) | ||
} | ||
|
||
if (!is.null(output_file)) { | ||
dput(agenda, file = output_file) | ||
} | ||
|
||
options( | ||
minutemaker_temp_agenda_last_bp = NULL, | ||
minutemaker_temp_agenda = NULL, | ||
minutemaker_temp_agenda_hash = NULL | ||
) | ||
|
||
agenda | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The infer_agenda_from_transcript
function introduces a significant new feature to the minutemaker
package, automating the generation of event agendas from transcripts. This function is complex, involving multiple steps to process the transcript, infer agenda items, and handle errors. Here are several points for improvement and verification:
- Error Handling and User Feedback: The function attempts to handle various error scenarios, such as invalid transcript formats and JSON parsing errors. Ensure that these errors are communicated clearly to the user, providing actionable advice where possible.
- Performance Considerations: The function processes the transcript in segments based on a window size and additional breakpoints for pauses. Review the choice of default values for
window_size
andpause_duration
to ensure they are sensible for typical use cases. - Complexity and Maintainability: The function's complexity is high, with nested loops and conditional branches. Consider refactoring to improve readability and maintainability. Extracting some logic into separate, well-named helper functions could make the code easier to follow and test.
- Documentation and Examples: Ensure that the function's documentation is comprehensive, including descriptions of all parameters and their expected formats. Providing examples of usage, especially for complex inputs like the transcript data frame, would be highly beneficial for users.
- Validation of Inputs: The function performs some validation on the input transcript, but additional checks might be necessary to ensure that the transcript data frame contains all required columns (
start
,end
,text
, and optionallyspeaker
) in the expected formats. - Testing and Edge Cases: Given the function's complexity, thorough testing is crucial. Consider adding unit tests covering various scenarios, including edge cases like transcripts with very short talks, overlapping talks, or long pauses that might affect agenda inference.
#' Generate the agenda inference prompt | ||
#' | ||
#' This function is used by `infer_agenda_from_transcript()` to generate a | ||
#' prompt for inferring the agenda from a transcript. | ||
#' | ||
#' @param transcript_segment A segment of the transcript to be used for | ||
#' inferring the agenda. Can be a character vector representing the data in CSV | ||
#' format or a data frame. | ||
#' @param args A list of arguments to be passed to the prompt template. They can | ||
#' include: event_description, vocabulary and expected_agenda. | ||
#' | ||
#' @return A prompt used by `infer_agenda_from_transcript()`. | ||
#' | ||
generate_agenda_inference_prompt <- function( | ||
transcript_segment, | ||
args | ||
) { | ||
|
||
if (is.data.frame(transcript_segment)) { | ||
transcript_segment <- readr::format_csv(transcript_segment) | ||
} | ||
|
||
if (!is.null(args$vocabulary)) { | ||
# Format the vocabulary argument if a vector is provided | ||
args$vocabulary <- paste0( | ||
"- ", | ||
args$vocabulary, | ||
collapse = "\n" | ||
) | ||
} | ||
|
||
# Aggregate instructions if length > 1 vectors and convert into the | ||
# extra_diarization_instructions argument | ||
if (length(args$diarization_instructions) > 0) { | ||
args$extra_diarization_instructions <- paste( | ||
args$diarization_instructions, collapse = "\n" | ||
) | ||
} | ||
|
||
long_arguments <- purrr::map_lgl(args, ~ length(.x) > 1) | ||
|
||
if (any(long_arguments)) { | ||
stop("All arguments in args should have length 1:\n", | ||
stringr::str_flatten_comma(names(args)[long_arguments])) | ||
} | ||
|
||
prompt <- paste( | ||
"Your task is to extract individual talks from a transcript, creating an agenda.", | ||
|
||
if (!is.null(args$event_description)) { | ||
# Uses the {event_description} argument | ||
get_prompts("event_description_template") | ||
}, | ||
|
||
if (!is.null(args$vocabulary)) { | ||
# Uses the {vocabulary} argument | ||
get_prompts("vocabulary_template") | ||
}, | ||
|
||
# Uses the {extra_diarization_instructions} argument | ||
if (!is.null(args$diarization_instructions)) { | ||
get_prompts("diarization_template") | ||
}, | ||
|
||
"This is the transcript of the event/meeting from which you need to infer the agenda items:\n<transcript>\n{transcript_segment}\n</transcript>\n\nThe transcript is formatted as a csv with the start and end time of each segment, the segment text and possibly, the speakers.", | ||
|
||
sep = "\n\n" | ||
) |> | ||
stringr::str_glue_data(.x = args, .null = NULL) |> | ||
paste( | ||
'You can identify the talks from a change of speakers, and or, a change of topic. Try to detect broad changes of topics so to avoid splitting the transcript into an excessively large number of small talks; a talk usually last at least 10-15 minutes to one hour, so join into one talk very short change of topics, even if the speaker change. Aggregate talks and the related Q&A sessions in the same talk. | ||
|
||
You wil be FIRST producing an INFORMATION DENSE, step by step reasoning of what could be a good subdivision of the transcript into different talks, considering different competing subdivisions, listing each identified talk start time and topics. THEN you will extract the starting times of each talk. | ||
|
||
Take speakers, topics, and timings into consideration in your reasoning. The reasoning doesn\'t have to be human readable. Favor a high information over length ratio.', | ||
|
||
if (!is.null(args$expected_agenda)) { | ||
stringr::str_glue_data( | ||
.x = args, | ||
.null = NULL, | ||
"The agenda is expected to have the following talks: ### | ||
{expected_agenda} | ||
### | ||
Try to match the agenda you generated to this structure.") | ||
}, | ||
|
||
'Your output will be a JSON object with two components: your reasoning and the start times of each identified talks. Here\'s an example of the output structure: | ||
### | ||
{ | ||
reasoning = "Your reasoning goes here", | ||
start_times = [1, 232, 1242, 2343, 5534, 7023, ...] | ||
} | ||
### | ||
|
||
Important: process the whole transcript, do not be lazy: your agenda WILL cover the entirety of the transcript, FROM START TO END WITHOUT TIME HOLES.', | ||
|
||
sep ="\n" | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generate_agenda_inference_prompt
function is a significant addition, facilitating the generation of prompts for inferring the agenda from a transcript. This function is well-structured and aligns with the PR objectives. However, there are areas that could benefit from further refinement:
- Error Handling: Consider adding error handling for the input
transcript_segment
. For instance, verify that the input is either a data frame or a character vector representing CSV data before proceeding with the formatting. This check can prevent unexpected errors during the function's execution. - Documentation and Examples: Enhance the function's documentation by providing examples of input formats and expected outputs. This addition can help users understand how to use the function effectively and what results to expect.
- Performance Considerations: If the function is expected to handle large transcripts, evaluate its performance and consider optimizations, such as minimizing data transformations or optimizing string operations, to ensure efficient processing.
Overall, this function is a valuable addition to the minutemaker
package, enhancing its capability to automate agenda generation from transcripts. Implementing the suggested refinements can further improve its usability, reliability, and performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (1)
- R/data_management.R (8 hunks)
Files skipped from review as they are similar to previous changes (1)
- R/data_management.R
Major Enhancements:
infer_agenda_from_transcript
function to automate the generation of an event agenda by analyzing a given transcript, aiming to identify and extract key segments and construct a structured overview of the event's proceedings. This enhancement significantly streamlines the workflow for summarizing meetings and conferences. (Commit: c458b0d)Minor Improvements and Fixes:
Development and Maintenance:
renv
for dev reproducibility. (Commit: 3b18519)Summary by CodeRabbit
New Features
speech_to_summary_workflow
.Documentation
add_chat_transcript
.Chores
data_management.R
to include new features and improvements.