feat(results) included final tuned results, updated figure, table and

mikasenghaas · mikasenghaas · commit 2a348c54cfe5 · 2023-12-21T14:29:03.000+01:00
numbers in text
diff --git a/notebooks/analysis.ipynb b/notebooks/analysis.ipynb
diff --git a/report/figures/finetune-results.pdf b/report/figures/finetune-results.pdf
diff --git a/report/main.tex b/report/main.tex
@@ -92,6 +92,7 @@
 \bibliographystyle{plainnat}
 \bibliography{literature}
 
+\newpage
 \appendix
 
 \input{sections/appendix}
diff --git a/report/sections/abstract.tex b/report/sections/abstract.tex
@@ -1,6 +1,6 @@
 \thispagestyle{empty} % To prevent number on the first page
 \begin{abstract}
 
-Homepage2Vec~\cite{homepage2vec}, a state-of-the-art open-source model for multilingual, multilabel website classification, has proven powerful in accurately classifying website topics. However, it is limited by its initial training data, which on average only contains a single topic for a website. This study explores the use of Large Language Models (LLMs) for creating a high-quality finetuning dataset that more accurately reflects the topic diversity of a website. We assess various LLM-based labelers and select the best one through comparison to crowdsourced annotations. We generate two variants of a new 10,000-website dataset, \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k}, for finetuning \nobreak{Homepage2Vec}. We show that finetuning \nobreak{Homepage2Vec} with these datasets improves its macro F1 from 38\% to 42\%. We release both LLM-annotated datasets \cite{curlie-gpt-10k} publicly to encourage further research in this area.
+Homepage2Vec~\cite{homepage2vec}, a state-of-the-art open-source model for multilingual, multilabel website classification, has proven powerful in accurately classifying website topics. However, it is limited by its initial training data, which on average only contains a single topic for a website. This study explores the use of Large Language Models (LLMs) for creating a high-quality finetuning dataset that more accurately reflects the topic diversity of a website. We assess various LLM-based labelers and select the best one through comparison to crowdsourced annotations. We generate two variants of a new 10,000-website dataset, \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k}, for finetuning \nobreak{Homepage2Vec}. We show that finetuning \nobreak{Homepage2Vec} with these datasets improves its macro F1 from 39\% to 43\%. We release both LLM-annotated datasets \cite{curlie-gpt-10k} publicly to encourage further research in this area.
 
 \end{abstract}
diff --git a/report/sections/appendix.tex b/report/sections/appendix.tex
@@ -6,15 +6,14 @@ \subsection{Acknowledgements}\label{appendix:acknowledgements}
 
 % -------- Ethical considerations
 \subsection{Ethical Considerations}\label{appendix:ethical-considerations}
-This study employs the Curlie dataset, managed by dedicated volunteers and moderators ensuring its content remains legal and free from marketing schemes. 
-To further support these efforts, we are releasing the re-labeled datasets \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} to the public.
 
-Additionally, we employed the \texttt{crowdsourced} dataset, originally created by Amazon Mechanical Turk workers for the homepage2vec paper \cite{homepage2vec}. 
-These workers were compensated in accordance with ethical standards and minimum wage requirements set by the Fair Work platform \cite{ethics2}.
+This study employs the Curlie dataset, managed by dedicated volunteers and moderators ensuring its content remains legal and free from marketing schemes. To further support these efforts, we are releasing the re-labeled datasets \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} to the public.
+
+Additionally, we employed the \texttt{crowdsourced} dataset, originally created by Amazon Mechanical Turk workers for the homepage2vec paper~\cite{homepage2vec}. 
+These workers were compensated in accordance with ethical standards and minimum wage requirements set by the Fair Work platform~\cite{ethics2}.
 
 The use of LLMs for annotation, while efficient, raises concerns regarding the economic impact on human annotators who depend on such tasks for their livelihood. 
-It is imperative to ensure that this process supplements, rather than replaces, human annotators. In this context, providing platforms like Dynamo \cite{ethics1} for Amazon Mechanical Turk workers to communicate and organize is crucial.
-Additionally, it is crirical to maintain these principles and be cautious of influences from large entities that may hinder the efforts of workers to organize and advocate for their rights.
+It is imperative to ensure that this process supplements, rather than replaces, human annotators. In this context, providing platforms like Dynamo~\cite{ethics1} for Amazon Mechanical Turk workers to communicate and organize is crucial. Additionally, it is critical to maintain these principles and be cautious of influences from large entities that may hinder the efforts of workers to organize and advocate for their rights.
 
 Moreover, the extensive datasets training LLMs may contain biases, potentially influencing the labeling process and perpetuating stereotypes or inequalities. 
 It's essential to address these biases to maintain fairness and uphold ethical standards in automated systems.
@@ -82,3 +81,23 @@ \subsection{Example for a \texttt{1-shot} model}\label{app:example-1-shot}
     ...
 }
 \end{lstlisting}
+
+\subsection{Best Hyperparameters}\label{app:hyperparameters}
+
+Table~\ref{tab:best-hyperparameters} shows the best hyperparameters found for finetuning Homepage2Vec on labels from the GPT-3.5 and GPT-4 labeler.
+
+\begin{table}[h]
+    \centering
+    \caption{\textbf{Best Hyperparameters.} Details the 
+    best hyperparameters found for finetuning Homepage2Vec on labels from the GPT-3.5 and GPT-4 labeler. Notation follows as in Section~\ref{sec:methodology}}
+    \begin{tabular}{lcccc}
+        \toprule
+        \textbf{Model} & $\lambda$ & $\beta$ & $\gamma$ & $\delta$ \\
+        \midrule
+        GPT-3.5 & 1.6e-5 & 6.4e-2 & 3.7e-1 & 64 \\
+        GPT-4 & 1.5e-3 & 2.5e-4 & 4.6e-1 & 64 \\
+        \bottomrule
+
+    \end{tabular}
+    \label{tab:best-hyperparameters}
+\end{table}
diff --git a/report/sections/results.tex b/report/sections/results.tex
@@ -53,10 +53,16 @@ \subsection*{Phase 1: Identifying an Optimal LLM Labeler}
 
 \subsection*{Phase 2: Transferring Knowledge via Finetuning}
 
-% TODO: Include best hyperparameters
-% TODO: Update the table and figure 3
+% GPT-3.5: 
+% LR/ Weight Decay / Scheduler Factor/ Batch Size
+% 0.000016	0.064037	0.376673	64
+% 1.6e-05 / 6.40e-02 / 3.77e-01 / 64
 
-Table~\ref{tab:finetune-results} shows the results of the finetuning experiments. We observe that both models increase the recall from 39.4\% to 47.6\% and 49.1\% when finetuned on GPT-3.5 and GPT-4 labels, respectively. However, this comes at cost of a minor decreases in precision. Overall, the macro F1 score increases from 39.2\% to 42.6\% and 42.8\% - an improvement of 3.4 and 3.6 percentage points, respectively.
+% GPT 4:
+% 0.001535	0.000252	0.460896	64
+% 1.5e-03 / 2.52e-04 / 4.61e-01 / 64
+
+Table~\ref{tab:finetune-results} shows the results of the finetuning experiments. We report only the results for the model with the hyperparameter configuration with the best validation macro F1 score. The best hyperparameters are listed in the Appendix~\ref{app:hyperparameters} section. We observe that both models increase the recall from 39.4\% to 51.1\% and 46.4\% when finetuned on GPT-3.5 and GPT-4 labels, respectively. Overall, the macro F1 score increases from 39.2\% to 43.5\% and 43.1\% - an improvement of 4.3 and 3.9 percentage points, respectively.
 This improvement shows that we were able to transfer the superior labeling capabilities of the LLM to Homepage2Vec. Figure~\ref{fig:finetune-results} shows that the increase in macro F1 score is achieved consistently acrosss the classes, with 12 out of the 14 classes improving for both models.
 
 % 0.391610 = 39.2% (Pre-trained Homepage2Vec)
diff --git a/report/sections/summary.tex b/report/sections/summary.tex
@@ -1,3 +1,3 @@
 \section{Summary}\label{sec:summary}
 
-We have demonstrated that LLMs can provide consistent, cost-effective, and high-quality annotations for the complex task of multilingual, multilabel website topic classification. Our approach, which involved finetuning a pre-trained Homepage2vec model on LLM-generated labels, resulted in a improvement of 3.6 percentage points in the macro F1 score. Additionally, we are releasing the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} datasets \cite{curlie-gpt-10k} with the intention of supporting further research in the open-source community.
+We have demonstrated that LLMs can provide cost-effective, and high-quality annotations in the settign of multilingual, multilabel website topic classification. Our approach, which involved finetuning a pre-trained Homepage2vec model on LLM-generated labels, resulted in a improvement of 4.3 percentage points in the macro F1 score. Additionally, we are releasing the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} datasets \cite{curlie-gpt-10k} with the intention of supporting further research in the open-source community.
diff --git a/report/tables/finetune-results.tex b/report/tables/finetune-results.tex
@@ -1,15 +1,15 @@
 \begin{table}[!ht]
 \centering
-\caption{\textbf{Finetuning Results.} The table shows the precision, recall, macro F1 and and labels per page when evaluated on \texttt{crowdsourced}. We show results for the pre-trained baseline, as well as both finetuned variants.}
+\caption{\textbf{Finetune Results.}}
 \label{tab:finetune-results}
 \begin{tabular}{lrrrr}
 \toprule
- & \textbf{Pr.} & \textbf{Re.} & \textbf{M.-F1} & \textbf{LPP} \\
- & (\%) & (\%) & (\%) & ($\mu$) \\
+ & Pr. & Re. & M.-F1 & LPP \\
+ & (\%) & (\%) & (\%) & ($\sigma$) \\
 \midrule
-Pretrained & \textbf{40.97} & 39.44 & 39.16 & 1.84 \\
-GPT-3.5 & 40.19 & 47.55 & 42.63 & 1.93 \\
-GPT-4 & 39.92 & \textbf{49.07} & \textbf{42.87} & \textbf{3.07} \\
+Pretrained & 40.97 & 39.44 & 39.16 & 2.36 \\
+GPT-3.5 & 39.16 & \textbf{51.14} & \textbf{43.49} & \textbf{3.33} \\
+GPT-4 & \textbf{42.00} & 46.42 & 43.13 & 2.80 \\
 \bottomrule
 \end{tabular}
 \end{table}

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`	`1`	`\section{Summary}\label{sec:summary}`
`2`	`2`
`3`		-We have demonstrated that LLMs can provide consistent, cost-effective, and high-quality annotations for the complex task of multilingual, multilabel website topic classification. Our approach, which involved finetuning a pre-trained Homepage2vec model on LLM-generated labels, resulted in a improvement of 3.6 percentage points in the macro F1 score. Additionally, we are releasing the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} datasets \cite{curlie-gpt-10k} with the intention of supporting further research in the open-source community.
	`3`	+We have demonstrated that LLMs can provide cost-effective, and high-quality annotations in the settign of multilingual, multilabel website topic classification. Our approach, which involved finetuning a pre-trained Homepage2vec model on LLM-generated labels, resulted in a improvement of 4.3 percentage points in the macro F1 score. Additionally, we are releasing the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} datasets \cite{curlie-gpt-10k} with the intention of supporting further research in the open-source community.