deploy: 6338a3b

dwarvesf · Sep 26, 2024 · 80825ae · 80825ae
1 parent 268aefc
commit 80825ae
Show file tree

Hide file tree

Showing 3 changed files with 4 additions and 4 deletions.
diff --git a/db/vault.parquet b/db/vault.parquet
diff --git a/index.xml b/index.xml
diff --git a/...und/01_literature/building-llm-system/evaluation-guideline-for-llm-application/index.html b/...und/01_literature/building-llm-system/evaluation-guideline-for-llm-application/index.html
@@ -409,7 +409,7 @@
 Folder:
 </span>playground/01_literature/building llm system</li><div>Stats</div><li class=span><span>Words:
 </span>2819</li><li class=span><span>Characters:
-</span>25965</li><li class=span><span>Without Spaces:
+</span>25641</li><li class=span><span>Without Spaces:
 </span>15987</li><li class=span><span>Blocks:
 </span>16</li><li class=span><span>Est reading Time:
 </span>15m 0s</li></ul></div></div></div></div><main x-data="{
@@ -445,7 +445,7 @@
           this.loadText = '...'
         }
       },
-    }"><div x-show=!hideTitle x-bind:class="{ 'note-title': true, 'clear-title': isClearPage() }"><div class=title-index style=display:none>Evaluation Guidelines for LLM Applications</div><div class=tags-index style=display:none>tags: llm, evaluation</div><h1 class=pagetitle x-text="isTagPage() ? `#${title.replaceAll('-', ' ')}` : title">Evaluation Guidelines for LLM Applications</h1></div><img class=yggdrasil-tree src=https://memo.d.foundation/img/footer-bg.svg><h2 id=overview>Overview <a href=#overview></a></h2><p>Evaluation is a hard part of building an RAG system, especially for application-integrated LLM solving your business problem. This guide outlines a clear, step-by-step approach to effectively evaluating and optimizing the integration of a third-party Large Language Model (LLM) into your application. By following these articles, you&rsquo;ll make sure the model fits your business goals and technical needs.</p><h2 id=evaluation-checklist>Evaluation checklist <a href=#evaluation-checklist></a></h2><p>The evaluation checklist helps make sure that all important parts of the LLM are reviewed during integration. Each checklist item should address a key part of the system or model to confirm it meets technical, business, and user needs.</p><p>By providing a structured way to assess the system’s performance, the checklist helps we ensure that the model meets both technical and business needs while delivering a positive user experience. For additional insights, you can refer to the following articles: <a href=https://www.linkedin.com/pulse/llm-product-development-checklist-how-make-products-generative-pines/ target=_blank><strong>LLM Product Development Checklist</strong></a> and <a href=https://blog.kore.ai/cobus-greyling/understanding-llm-user-experience-expectation target=_blank><strong>Understanding LLM User Experience Expectations</strong></a>.</p><h3 id=product-evaluation-checklist>Product evaluation checklist <a href=#product-evaluation-checklist></a></h3><p><strong>In case RAG system:</strong></p><ul><li><input disabled type=checkbox> <strong>Search Engine</strong><ul><li>If a user searches for legal clauses related to &ldquo;contract termination&rdquo; the search engine should retrieve documents with high relevance (precision) and not miss any key documents (recall).</li><li><strong>Metric</strong>: Precision = 85%, Recall = 90% in test dataset.</li><li>For a legal query, the system should retrieve and highlight clauses on &ldquo;contract termination&rdquo; and ignore irrelevant sections, like &ldquo;payment terms.&rdquo;</li><li><strong>Task-Specific Accuracy</strong>: 95% task-specific match in legal datasets.</li></ul></li><li><input disabled type=checkbox> <strong>Latency</strong><ul><li>The system should retrieve documents within 2 seconds in a real-time customer support scenario.</li><li><strong>Expected Latency</strong>: &lt;2 seconds for 95% of queries.</li></ul></li><li><input disabled type=checkbox> <strong>Response Generation</strong><ul><li>For a customer query about a &ldquo;refund policy,&rdquo; the LLM should generate a response that directly references the correct clauses in the retrieved refund policy document.</li><li><strong>LLM Evaluation</strong>: Coherence score >80% using a library evaluation metric.</li><li><strong>Human in the loop:</strong> Annotate response of LLM.</li></ul></li><li><input disabled type=checkbox> <strong>Token Usage and Cost Efficiency</strong><ul><li>For a legal document retrieval and summarization task, the system should use fewer than 10,000 tokens per query to balance cost and performance.</li><li><strong>Max Token Usage</strong>: 10,000 tokens per query to maintain cost-effectiveness. Comparing each model together to find cost effectively.</li></ul></li></ul><pre><code class=language-mermaid>graph TD
+    }"><div x-show=!hideTitle x-bind:class="{ 'note-title': true, 'clear-title': isClearPage() }"><div class=title-index style=display:none>Evaluation Guidelines for LLM Applications</div><div class=tags-index style=display:none>tags: llm, evaluation</div><h1 class=pagetitle x-text="isTagPage() ? `#${title.replaceAll('-', ' ')}` : title">Evaluation Guidelines for LLM Applications</h1></div><img class=yggdrasil-tree src=https://memo.d.foundation/img/footer-bg.svg><h2 id=overview>Overview <a href=#overview></a></h2><p>Evaluation is a hard part of building an RAG system, especially for application-integrated LLM solving your business problem. This guide outlines a clear, step-by-step approach to effectively evaluating and optimizing the integration of a third-party Large Language Model (LLM) into your application. By following these articles, you&rsquo;ll make sure the model fits your business goals and technical needs.</p><h2 id=evaluation-checklist>Evaluation checklist <a href=#evaluation-checklist></a></h2><p>The evaluation checklist helps make sure that all important parts of the LLM are reviewed during integration. Each checklist item should address a key part of the system or model to confirm it meets technical, business, and user needs.</p><p>By providing a structured way to assess the system’s performance, the checklist helps we ensure that the model meets both technical and business needs while delivering a positive user experience. For additional insights, you can refer to the following articles: <a href=https://www.linkedin.com/pulse/llm-product-development-checklist-how-make-products-generative-pines/ target=_blank><strong>LLM Product Development Checklist</strong></a> and <a href=https://blog.kore.ai/cobus-greyling/understanding-llm-user-experience-expectation target=_blank><strong>Understanding LLM User Experience Expectations</strong></a>.</p><h3 id=product-evaluation-checklist>Product evaluation checklist <a href=#product-evaluation-checklist></a></h3><p><strong>In case RAG system:</strong></p><ul><li><strong>Search Engine</strong><ul><li>If a user searches for legal clauses related to &ldquo;contract termination&rdquo; the search engine should retrieve documents with high relevance (precision) and not miss any key documents (recall).</li><li><strong>Metric</strong>: Precision = 85%, Recall = 90% in test dataset.</li><li>For a legal query, the system should retrieve and highlight clauses on &ldquo;contract termination&rdquo; and ignore irrelevant sections, like &ldquo;payment terms.&rdquo;</li><li><strong>Task-Specific Accuracy</strong>: 95% task-specific match in legal datasets.</li></ul></li><li><strong>Latency</strong><ul><li>The system should retrieve documents within 2 seconds in a real-time customer support scenario.</li><li><strong>Expected Latency</strong>: &lt;2 seconds for 95% of queries.</li></ul></li><li><strong>Response Generation</strong><ul><li>For a customer query about a &ldquo;refund policy,&rdquo; the LLM should generate a response that directly references the correct clauses in the retrieved refund policy document.</li><li><strong>LLM Evaluation</strong>: Coherence score >80% using a library evaluation metric.</li><li><strong>Human in the loop:</strong> Annotate response of LLM.</li></ul></li><li><strong>Token Usage and Cost Efficiency</strong><ul><li>For a legal document retrieval and summarization task, the system should use fewer than 10,000 tokens per query to balance cost and performance.</li><li><strong>Max Token Usage</strong>: 10,000 tokens per query to maintain cost-effectiveness. Comparing each model together to find cost effectively.</li></ul></li></ul><pre><code class=language-mermaid>graph TD
     A[Retrieval System] --&gt; B[Search Engine]
     B --&gt; C[Metric Precision, Recall]
     C --&gt; F[How to Test: Compare Retrieved Docs]
@@ -471,7 +471,7 @@
     A --&gt; W[Cost Efficiency]
     W --&gt; X[Token Usage per Query]
     X --&gt; Y[How to Measure: Track Token Usage in API Calls]
-</code></pre><p><strong>In case of fine-tuning model:</strong></p><ul><li><input disabled type=checkbox> <strong>Fine-Tuning on Task-Specific Data</strong><ul><li><strong>Example</strong>: A financial chatbot should correctly identify and respond to &ldquo;interest rate change&rdquo; queries 90% of the time in a test set.</li><li><strong>Metric</strong>: Fine-tuning loss should decrease steadily, with an accuracy improvement of at least 5% compared to the base model.</li></ul></li><li><input disabled type=checkbox> <strong>Evaluate Performance Post-Fine-Tuning</strong><ul><li><strong>Example</strong>: In a legal document retrieval system, the fine-tuned model should correctly identify relevant clauses with 95% task-specific accuracy.</li><li><strong>Metric</strong>: Precision = 90%, Recall = 88% for post-fine-tuning tests.</li></ul></li><li><input disabled type=checkbox> <strong>Prevent Overfitting</strong><ul><li><strong>Example</strong>: If training accuracy is 95%, validation accuracy should be no lower than 93%. If the gap increases, early stopping should be applied.</li><li><strong>Metric</strong>: Validation loss should stay within 2% of the training loss.</li></ul></li><li><input disabled type=checkbox> <strong>Optimize Model Efficiency</strong><ul><li><strong>Example</strong>: A customer support model should deliver responses in less than 1.5 seconds while using fewer than 8,000 tokens.</li><li><strong>Expected Latency</strong>: The fine-tuned model should respond in under 1.5 seconds for 95% of queries.</li><li><strong>Max Token Usage</strong>: Limit token usage to under 8,000 tokens per query for cost-efficient operation.</li></ul></li><li><input disabled type=checkbox> <strong>Task-Specific Generalization and User Feedback</strong><ul><li><strong>Example</strong>: A medical chatbot, after fine-tuning, should correctly diagnose 90% of unseen cases based on the user feedback and test cases.</li><li><strong>Task-Specific Accuracy</strong>: Achieve 93% accuracy in task-specific domains like healthcare diagnostics or legal assistance.</li></ul></li></ul><pre><code class=language-mermaid>graph TD
+</code></pre><p><strong>In case of fine-tuning model:</strong></p><ul><li><strong>Fine-Tuning on Task-Specific Data</strong><ul><li><strong>Example</strong>: A financial chatbot should correctly identify and respond to &ldquo;interest rate change&rdquo; queries 90% of the time in a test set.</li><li><strong>Metric</strong>: Fine-tuning loss should decrease steadily, with an accuracy improvement of at least 5% compared to the base model.</li></ul></li><li><strong>Evaluate Performance Post-Fine-Tuning</strong><ul><li><strong>Example</strong>: In a legal document retrieval system, the fine-tuned model should correctly identify relevant clauses with 95% task-specific accuracy.</li><li><strong>Metric</strong>: Precision = 90%, Recall = 88% for post-fine-tuning tests.</li></ul></li><li><strong>Prevent Overfitting</strong><ul><li><strong>Example</strong>: If training accuracy is 95%, validation accuracy should be no lower than 93%. If the gap increases, early stopping should be applied.</li><li><strong>Metric</strong>: Validation loss should stay within 2% of the training loss.</li></ul></li><li><strong>Optimize Model Efficiency</strong><ul><li><strong>Example</strong>: A customer support model should deliver responses in less than 1.5 seconds while using fewer than 8,000 tokens.</li><li><strong>Expected Latency</strong>: The fine-tuned model should respond in under 1.5 seconds for 95% of queries.</li><li><strong>Max Token Usage</strong>: Limit token usage to under 8,000 tokens per query for cost-efficient operation.</li></ul></li><li><strong>Task-Specific Generalization and User Feedback</strong><ul><li><strong>Example</strong>: A medical chatbot, after fine-tuning, should correctly diagnose 90% of unseen cases based on the user feedback and test cases.</li><li><strong>Task-Specific Accuracy</strong>: Achieve 93% accuracy in task-specific domains like healthcare diagnostics or legal assistance.</li></ul></li></ul><pre><code class=language-mermaid>graph TD
     J[Fine-Tuning Model]
     J --&gt; K[Apply Fine-Tuning on Task-Specific Data]
     K --&gt; L[How to Measure: Monitor Loss, Accuracy During Fine-Tuning]