Skip to content

Commit 2a348c5

Browse files
committed
feat(results) included final tuned results, updated figure, table and
numbers in text
1 parent c8d1033 commit 2a348c5

File tree

8 files changed

+112
-192
lines changed

8 files changed

+112
-192
lines changed

notebooks/analysis.ipynb

+69-175
Large diffs are not rendered by default.

report/figures/finetune-results.pdf

6 Bytes
Binary file not shown.

report/main.tex

+1
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@
9292
\bibliographystyle{plainnat}
9393
\bibliography{literature}
9494

95+
\newpage
9596
\appendix
9697

9798
\input{sections/appendix}

report/sections/abstract.tex

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
\thispagestyle{empty} % To prevent number on the first page
22
\begin{abstract}
33

4-
Homepage2Vec~\cite{homepage2vec}, a state-of-the-art open-source model for multilingual, multilabel website classification, has proven powerful in accurately classifying website topics. However, it is limited by its initial training data, which on average only contains a single topic for a website. This study explores the use of Large Language Models (LLMs) for creating a high-quality finetuning dataset that more accurately reflects the topic diversity of a website. We assess various LLM-based labelers and select the best one through comparison to crowdsourced annotations. We generate two variants of a new 10,000-website dataset, \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k}, for finetuning \nobreak{Homepage2Vec}. We show that finetuning \nobreak{Homepage2Vec} with these datasets improves its macro F1 from 38\% to 42\%. We release both LLM-annotated datasets \cite{curlie-gpt-10k} publicly to encourage further research in this area.
4+
Homepage2Vec~\cite{homepage2vec}, a state-of-the-art open-source model for multilingual, multilabel website classification, has proven powerful in accurately classifying website topics. However, it is limited by its initial training data, which on average only contains a single topic for a website. This study explores the use of Large Language Models (LLMs) for creating a high-quality finetuning dataset that more accurately reflects the topic diversity of a website. We assess various LLM-based labelers and select the best one through comparison to crowdsourced annotations. We generate two variants of a new 10,000-website dataset, \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k}, for finetuning \nobreak{Homepage2Vec}. We show that finetuning \nobreak{Homepage2Vec} with these datasets improves its macro F1 from 39\% to 43\%. We release both LLM-annotated datasets \cite{curlie-gpt-10k} publicly to encourage further research in this area.
55

66
\end{abstract}

report/sections/appendix.tex

+25-6
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,14 @@ \subsection{Acknowledgements}\label{appendix:acknowledgements}
66

77
% -------- Ethical considerations
88
\subsection{Ethical Considerations}\label{appendix:ethical-considerations}
9-
This study employs the Curlie dataset, managed by dedicated volunteers and moderators ensuring its content remains legal and free from marketing schemes.
10-
To further support these efforts, we are releasing the re-labeled datasets \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} to the public.
119

12-
Additionally, we employed the \texttt{crowdsourced} dataset, originally created by Amazon Mechanical Turk workers for the homepage2vec paper \cite{homepage2vec}.
13-
These workers were compensated in accordance with ethical standards and minimum wage requirements set by the Fair Work platform \cite{ethics2}.
10+
This study employs the Curlie dataset, managed by dedicated volunteers and moderators ensuring its content remains legal and free from marketing schemes. To further support these efforts, we are releasing the re-labeled datasets \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} to the public.
11+
12+
Additionally, we employed the \texttt{crowdsourced} dataset, originally created by Amazon Mechanical Turk workers for the homepage2vec paper~\cite{homepage2vec}.
13+
These workers were compensated in accordance with ethical standards and minimum wage requirements set by the Fair Work platform~\cite{ethics2}.
1414

1515
The use of LLMs for annotation, while efficient, raises concerns regarding the economic impact on human annotators who depend on such tasks for their livelihood.
16-
It is imperative to ensure that this process supplements, rather than replaces, human annotators. In this context, providing platforms like Dynamo \cite{ethics1} for Amazon Mechanical Turk workers to communicate and organize is crucial.
17-
Additionally, it is crirical to maintain these principles and be cautious of influences from large entities that may hinder the efforts of workers to organize and advocate for their rights.
16+
It is imperative to ensure that this process supplements, rather than replaces, human annotators. In this context, providing platforms like Dynamo~\cite{ethics1} for Amazon Mechanical Turk workers to communicate and organize is crucial. Additionally, it is critical to maintain these principles and be cautious of influences from large entities that may hinder the efforts of workers to organize and advocate for their rights.
1817

1918
Moreover, the extensive datasets training LLMs may contain biases, potentially influencing the labeling process and perpetuating stereotypes or inequalities.
2019
It's essential to address these biases to maintain fairness and uphold ethical standards in automated systems.
@@ -82,3 +81,23 @@ \subsection{Example for a \texttt{1-shot} model}\label{app:example-1-shot}
8281
...
8382
}
8483
\end{lstlisting}
84+
85+
\subsection{Best Hyperparameters}\label{app:hyperparameters}
86+
87+
Table~\ref{tab:best-hyperparameters} shows the best hyperparameters found for finetuning Homepage2Vec on labels from the GPT-3.5 and GPT-4 labeler.
88+
89+
\begin{table}[h]
90+
\centering
91+
\caption{\textbf{Best Hyperparameters.} Details the
92+
best hyperparameters found for finetuning Homepage2Vec on labels from the GPT-3.5 and GPT-4 labeler. Notation follows as in Section~\ref{sec:methodology}}
93+
\begin{tabular}{lcccc}
94+
\toprule
95+
\textbf{Model} & $\lambda$ & $\beta$ & $\gamma$ & $\delta$ \\
96+
\midrule
97+
GPT-3.5 & 1.6e-5 & 6.4e-2 & 3.7e-1 & 64 \\
98+
GPT-4 & 1.5e-3 & 2.5e-4 & 4.6e-1 & 64 \\
99+
\bottomrule
100+
101+
\end{tabular}
102+
\label{tab:best-hyperparameters}
103+
\end{table}

report/sections/results.tex

+9-3
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,16 @@ \subsection*{Phase 1: Identifying an Optimal LLM Labeler}
5353

5454
\subsection*{Phase 2: Transferring Knowledge via Finetuning}
5555

56-
% TODO: Include best hyperparameters
57-
% TODO: Update the table and figure 3
56+
% GPT-3.5:
57+
% LR/ Weight Decay / Scheduler Factor/ Batch Size
58+
% 0.000016 0.064037 0.376673 64
59+
% 1.6e-05 / 6.40e-02 / 3.77e-01 / 64
5860

59-
Table~\ref{tab:finetune-results} shows the results of the finetuning experiments. We observe that both models increase the recall from 39.4\% to 47.6\% and 49.1\% when finetuned on GPT-3.5 and GPT-4 labels, respectively. However, this comes at cost of a minor decreases in precision. Overall, the macro F1 score increases from 39.2\% to 42.6\% and 42.8\% - an improvement of 3.4 and 3.6 percentage points, respectively.
61+
% GPT 4:
62+
% 0.001535 0.000252 0.460896 64
63+
% 1.5e-03 / 2.52e-04 / 4.61e-01 / 64
64+
65+
Table~\ref{tab:finetune-results} shows the results of the finetuning experiments. We report only the results for the model with the hyperparameter configuration with the best validation macro F1 score. The best hyperparameters are listed in the Appendix~\ref{app:hyperparameters} section. We observe that both models increase the recall from 39.4\% to 51.1\% and 46.4\% when finetuned on GPT-3.5 and GPT-4 labels, respectively. Overall, the macro F1 score increases from 39.2\% to 43.5\% and 43.1\% - an improvement of 4.3 and 3.9 percentage points, respectively.
6066
This improvement shows that we were able to transfer the superior labeling capabilities of the LLM to Homepage2Vec. Figure~\ref{fig:finetune-results} shows that the increase in macro F1 score is achieved consistently acrosss the classes, with 12 out of the 14 classes improving for both models.
6167

6268
% 0.391610 = 39.2% (Pre-trained Homepage2Vec)

report/sections/summary.tex

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
\section{Summary}\label{sec:summary}
22

3-
We have demonstrated that LLMs can provide consistent, cost-effective, and high-quality annotations for the complex task of multilingual, multilabel website topic classification. Our approach, which involved finetuning a pre-trained Homepage2vec model on LLM-generated labels, resulted in a improvement of 3.6 percentage points in the macro F1 score. Additionally, we are releasing the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} datasets \cite{curlie-gpt-10k} with the intention of supporting further research in the open-source community.
3+
We have demonstrated that LLMs can provide cost-effective, and high-quality annotations in the settign of multilingual, multilabel website topic classification. Our approach, which involved finetuning a pre-trained Homepage2vec model on LLM-generated labels, resulted in a improvement of 4.3 percentage points in the macro F1 score. Additionally, we are releasing the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} datasets \cite{curlie-gpt-10k} with the intention of supporting further research in the open-source community.

report/tables/finetune-results.tex

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
\begin{table}[!ht]
22
\centering
3-
\caption{\textbf{Finetuning Results.} The table shows the precision, recall, macro F1 and and labels per page when evaluated on \texttt{crowdsourced}. We show results for the pre-trained baseline, as well as both finetuned variants.}
3+
\caption{\textbf{Finetune Results.}}
44
\label{tab:finetune-results}
55
\begin{tabular}{lrrrr}
66
\toprule
7-
& \textbf{Pr.} & \textbf{Re.} & \textbf{M.-F1} & \textbf{LPP} \\
8-
& (\%) & (\%) & (\%) & ($\mu$) \\
7+
& Pr. & Re. & M.-F1 & LPP \\
8+
& (\%) & (\%) & (\%) & ($\sigma$) \\
99
\midrule
10-
Pretrained & \textbf{40.97} & 39.44 & 39.16 & 1.84 \\
11-
GPT-3.5 & 40.19 & 47.55 & 42.63 & 1.93 \\
12-
GPT-4 & 39.92 & \textbf{49.07} & \textbf{42.87} & \textbf{3.07} \\
10+
Pretrained & 40.97 & 39.44 & 39.16 & 2.36 \\
11+
GPT-3.5 & 39.16 & \textbf{51.14} & \textbf{43.49} & \textbf{3.33} \\
12+
GPT-4 & \textbf{42.00} & 46.42 & 43.13 & 2.80 \\
1313
\bottomrule
1414
\end{tabular}
1515
\end{table}

0 commit comments

Comments
 (0)