- Context Vector is an expected value
- Transformer Model
T5, GPT-2, BERT
- The closer to 1, the better
- Lower temperature setting: More confident, conservative network
- Higher temperature setting: More excited, random network
- Problem:
- Penalizes long sequences, so you should normalize by the sentence length
- Computationally expensive and consumes a lot of memory