From 627d3b3d5ce0450bc4179a26432b177a1deaea2b Mon Sep 17 00:00:00 2001 From: David Osipov Date: Fri, 28 Feb 2025 18:00:35 +0400 Subject: [PATCH] Update technical_details.md --- docs/technical_details.md | 171 +++++++++++++++++++++++--------------- 1 file changed, 103 insertions(+), 68 deletions(-) diff --git a/docs/technical_details.md b/docs/technical_details.md index d0ed1b7..9e1a12e 100644 --- a/docs/technical_details.md +++ b/docs/technical_details.md @@ -1,3 +1,5 @@ + + # Technical Details: AI Prompt Framework This document explains the algorithms and processes used in the AI Prompt Framework for generating actionable improvement lists. @@ -14,122 +16,155 @@ Input → Parsing → Consolidation → Prioritization → Assessment → Output The parsing stage extracts structured information from unstructured text: -- **Text Segmentation**: Divides feedback into logical units (sentences, paragraphs) -- **Source Attribution**: Identifies which AI provided each piece of feedback -- **Information Extraction**: - - Issues/problems - - Proposed solutions - - Code references - - Severity assessments - - Reasoning - - Uncertainties -- **Meta-Feedback Handling**: Links feedback-on-feedback to the initial feedback it references +- **Text Segmentation**: Divides feedback into logical units (sentences, paragraphs). +- **Source Attribution**: Identifies which AI provided each piece of feedback. +- **Information Extraction**: + - Issues/problems. + - Proposed solutions. + - Code references (file names, line numbers, and optionally code snippets). + - Severity assessments. + - Reasoning. + - Uncertainties. + - Potential false positives. +- **Meta-Feedback Handling**: Links feedback-on-feedback to the initial feedback it references. ### 2. Consolidation Stage The consolidation stage groups related feedback: -- **Similarity Detection**: Identifies feedback discussing the same issues -- **Contradiction Detection**: Identifies conflicting opinions -- **Evidence Validation**: Flags feedback missing code evidence +- **Similarity Detection**: Identifies feedback discussing the same issues (using techniques like keyword matching, semantic similarity analysis). +- **Contradiction Detection**: Identifies conflicting opinions or suggestions among the initial AIs and between initial AIs and meta-AIs. +- **Evidence Validation**: Flags feedback missing code evidence or with weak evidence. +- **False Positive Identification:** Flags potential false positives. ### 3. Prioritization Stage -The prioritization assigns initial priority levels based on: +The prioritization stage assigns initial priority levels based on: -- **Severity**: Impact on the system (crashes, data loss, security, etc.) -- **Consensus**: Level of agreement among AIs -- **Meta-Validation**: Whether feedback-on-feedback confirms or refutes -- **Evidence Strength**: Direct vs. indirect code references +- **Severity**: Impact on the system (crashes, data loss, security vulnerabilities, performance issues, maintainability problems). +- **Weighted Consensus**: Level of agreement among AIs, weighted by their feedback quality scores. +- **Meta-Validation**: Whether feedback-on-feedback confirms or refutes the initial feedback (and the strength of the confirmation/refutation). +- **Evidence Strength**: Direct code references (file and line number) are given higher weight than indirect references (file or function name) or no references. ### 4. Assessment Stage -The assessment stage evaluates feedback quality: +The assessment stage evaluates feedback quality for each AI and each improvement: -- **Accuracy**: Factual correctness (using feedback-on-feedback) -- **Usefulness**: Actionability of the feedback -- **Completeness**: Presence of all necessary information -- **Reasoning Quality**: Logical soundness of arguments -- **Specificity**: Detail level of the feedback +- **Accuracy**: Factual correctness (using feedback-on-feedback and, if available, code context). Rated on a 1-5 scale, with categorical labels (Accurate, Partially Accurate, Inaccurate, Unclear). +- **Usefulness**: Actionability of the feedback (how easily can the suggestions be implemented). Rated on a 1-5 scale, with categorical labels (Highly Useful, Moderately Useful, Limited Usefulness, Not Useful). +- **Completeness**: Presence of all necessary information (problem, solution, evidence, reasoning). Rated on a 1-5 scale, with categorical labels (Complete, Partially Complete, Incomplete). +- **Reasoning Quality**: Logical soundness of the arguments presented in the feedback. Rated on a 1-5 scale, with categorical labels (Strong, Moderate, Weak, None). +- **Specificity**: Detail level of the feedback (specific code locations vs. general concepts). Rated on a 1-5 scale, with categorical labels (Highly Specific, Moderately Specific, General, Vague). +- **False Positive**: Indicates whether the feedback is likely a false positive (Yes/No/Potential). +- **Conflicting Suggestion Analysis**: If conflicting code suggestions are present, this stage involves: + - Listing the conflicting suggestions. + - Comparing and contrasting the suggestions. + - Assessing the potential impact of each suggestion. + - Providing a recommendation (if possible) based on the analysis. ### 5. Output Generation Stage The output stage creates a standardized Markdown document with: -- **Summary Statistics**: Totals by priority -- **Detailed Improvement Entries**: Formatted according to the template -- **Cross-References**: Dependencies between improvements -- **Human Review Sections**: Placeholders for reviewer input +- **Summary Statistics**: Totals by priority and other relevant metrics. +- **Detailed Improvement Entries**: Formatted according to the template, including all extracted information, assessments, and analysis. +- **Cross-References**: Dependencies between improvements. +- **Human Review Sections**: Placeholders for reviewer input (approval, modification, comments). ## Confidence Score Algorithm -The confidence score (1-5) is calculated using the following algorithm: +The confidence score (1-5) is calculated using a weighted sum of several factors: ```python -def calculate_confidence(support, opposition, code_evidence, meta_feedback): +def calculate_confidence(support, opposition, code_evidence, meta_feedback, conflicting_suggestions): base_score = 3 - if len(support) >= 2: base_score += 1 - avg_strength = sum(s["strength"] for s in support) / len(support) if support else 0 - if avg_strength >= 2.5: base_score += 1 # Assuming "High"=3, "Medium"=2, "Low"=1 - if any(o["strength"] == "High" for o in opposition): base_score -= 1 + + # Weighted AI Support + supporting_weight = sum(ai["quality_score"] for ai in support) + opposing_weight = sum(ai["quality_score"] * ai["conflict_modifier"] for ai in opposition) + base_score += supporting_weight - opposing_weight + + # Code Evidence if code_evidence == "Direct Reference": base_score += 1 elif code_evidence == "No Reference": base_score -= 1 + + # Feedback-on-Feedback Impact for meta in meta_feedback: - if meta["strongly_refutes"]: - base_score -= 1 + if meta["strongly_refutes"]: # High negative feedback quality + base_score -= 2 break # Strong refutation takes precedence - elif meta["strongly_confirms"]: - base_score += 1 + elif meta["strongly_confirms"]: # High positive feedback quality + base_score += 2 + + # Conflicting Suggestions Penalty + base_score -= 0.5 * len(conflicting_suggestions) + return max(1, min(5, base_score)) + +# Helper functions (example) +def calculate_quality_score(feedback_quality): + return (feedback_quality["Accuracy"] + + feedback_quality["Usefulness"] + + feedback_quality["Completeness"] + + feedback_quality["Reasoning Quality"] + + feedback_quality["Specificity"]) / 5 + +def determine_conflict_modifier(conflict_severity): + if conflict_severity == "High": return 1 + elif conflict_severity == "Medium": return 0.5 + else: return 0 ``` +* **Weighted AI Support:** The average feedback quality score (1-5) for each AI is calculated and used as a weight. AIs with higher average quality scores have a greater influence on the confidence score. +* **Conflict Severity Modifier:** Opposition from AIs is weighted by the conflict severity (High=1, Medium=0.5, Low=0). +* **Code Evidence:** Direct references increase the score; no references decrease it. +* **Feedback-on-Feedback:** Strong refutations significantly decrease the score; strong confirmations significantly increase it. +* **Conflicting Suggestions:** Each conflicting code suggestion slightly decreases the score. + ## AI Consensus Strength Label Determination -The strength labels ("High," "Medium," "Low") in the AI Consensus section are determined using: +The strength labels ("High," "Medium," "Low") in the `AI Consensus` section are determined by combining an initial assessment based on the text of the feedback with the AI's feedback quality ratings: ```python def determine_strength_label(initial_ai_feedback, ai_feedback_quality): - # Initial assessment based on text (keywords, tone, etc.) initial_strength = assess_strength_from_text(initial_ai_feedback) # Returns "High", "Medium", or "Low" - - # Adjust based on AI Feedback Quality quality = ai_feedback_quality - if (quality["Accuracy"] == "Accurate" and - quality["Usefulness"] == "Highly Useful" and - quality["Completeness"] == "Complete" and - quality["Reasoning Quality"] == "Strong" and - quality["Specificity"] == "Highly Specific"): - # High Quality - if initial_strength == "Low": - return "Medium" # Consider upgrade - return initial_strength # Keep "High" or "Medium" - elif (quality["Accuracy"] == "Inaccurate" or - quality["Usefulness"] == "Not Useful" or - quality["Completeness"] == "Incomplete" or - quality["Reasoning Quality"] == "Weak" or - quality["Specificity"] == "Vague"): - # Low Quality + if (quality["Accuracy"] >= 4 and + quality["Usefulness"] >= 4 and + quality["Completeness"] >= 4 and + quality["Reasoning Quality"] >= 4 and + quality["Specificity"] >= 4): + if initial_strength == "Low": return "Medium" + return initial_strength + elif (quality["Accuracy"] <= 2 or + quality["Usefulness"] <= 2 or + quality["Completeness"] <= 2 or + quality["Reasoning Quality"] <= 2 or + quality["Specificity"] <= 2): return "Low" else: - # Mixed Quality - if initial_strength == "High": - return "Medium" # Downgrade - return initial_strength # Keep "Medium" or "Low" + if initial_strength == "High": return "Medium" + return initial_strength ``` +* **Initial Assessment:** The AI analyzes the text of the feedback to make an initial judgment of support/opposition strength ("High," "Medium," "Low"). +* **Quality Adjustment:** The initial assessment is adjusted based on the AI's feedback quality ratings. High-quality feedback strengthens the label; low-quality feedback weakens it. + ## Human Verification Trigger Logic An improvement is flagged for human verification if any of these conditions are met: -- Code evidence is "No Reference" or "Indirect Reference" -- Conflict severity is "High" -- There are unresolved ambiguities -- Any meta-AI raises significant concerns +- Code evidence is "No Reference" or "Indirect Reference." +- Conflict severity is "High." +- There are unresolved ambiguities. +- Any meta-AI raises significant concerns (e.g., strong refutation, low feedback quality ratings). +- There are conflicting code suggestions. +- There is a potential false positive. ## Performance Considerations For large feedback files, the framework employs: -- **Chunking**: Processes feedback in 500-word chunks -- **Caching**: Caches code locations and issues -- **Fallback Logic**: Includes raw, unparsed feedback if processing fails +- **Chunking**: Processes feedback in 500-word chunks to manage memory and processing time. +- **Caching**: Caches code locations and issue summaries to avoid redundant processing. +- **Fallback Logic**: Includes raw, unparsed feedback in the `Human Review Notes` section if processing fails for any reason, ensuring no information is lost.