Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

The cache in bpe() may occupy a large amount of memory after long-time running. #35

Open
binjie09 opened this issue Apr 1, 2023 · 1 comment

Comments

@binjie09
Copy link

binjie09 commented Apr 1, 2023

I use a large amount of Chinese in the GPT service, and Chinese phrases here will occupy a significant amount of memory.

After running for one day, it occupies more than 1G of memory, which made me think there was a memory leak in my code for a moment.

GPT-3-Encoder/Encoder.js

Lines 87 to 153 in 9df47fc

function bpe(token) {
if (cache.has(token)) {
return cache.get(token)
}``
let word = token.split('')
let pairs = get_pairs(word)
if (!pairs) {
return token
}
while (true) {
const minPairs = {}
Array.from(pairs).map(pair => {
const rank = bpe_ranks[pair]
minPairs[(isNaN(rank) ? 10e10 : rank)] = pair
})
const bigram = minPairs[Math.min(...Object.keys(minPairs).map(x => {
return parseInt(x)
}
))]
if (!(bigram in bpe_ranks)) {
break
}
const first = bigram[0]
const second = bigram[1]
let new_word = []
let i = 0
while (i < word.length) {
const j = word.indexOf(first, i)
if (j === -1) {
new_word = new_word.concat(word.slice(i))
break
}
new_word = new_word.concat(word.slice(i, j))
i = j
if (word[i] === first && i < word.length - 1 && word[i + 1] === second) {
new_word.push(first + second)
i = i + 2
} else {
new_word.push(word[i])
i = i + 1
}
}
word = new_word
if (word.length === 1) {
break
} else {
pairs = get_pairs(word)
}
}
word = word.join(' ')
cache.set(token, word)
return word
}

niieani referenced this issue in niieani/gpt-tokenizer Apr 16, 2023
requires providing an additional argument with cache if you want to make it shared

fixes #35
@niieani
Copy link

niieani commented Apr 16, 2023

Hi @binjie09. My PR #38 fixes this by requiring the cache to be passed in explicitly. That way you can control it however you like. You could even implement a custom Map that removes old entries once some limit is reached.

If you can't wait for the PR to get merged, I've published my fork as gpt-tokenizer.

niieani referenced this issue in niieani/gpt-tokenizer Apr 16, 2023
requires providing an additional argument with cache if you want to make it shared

fixes #35
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants