The cache in bpe() may occupy a large amount of memory after long-time running. #35

binjie09 · 2023-04-01T14:49:18Z

I use a large amount of Chinese in the GPT service, and Chinese phrases here will occupy a significant amount of memory.

After running for one day, it occupies more than 1G of memory, which made me think there was a memory leak in my code for a moment.

GPT-3-Encoder/Encoder.js

Lines 87 to 153 in 9df47fc

    
           function bpe(token) { 
        
             if (cache.has(token)) { 
        
               return cache.get(token) 
        
             }`` 
        
             let word = token.split('') 
        
             let pairs = get_pairs(word) 
        
             if (!pairs) { 
        
               return token 
        
             } 
        
             while (true) { 
        
               const minPairs = {} 
        
               Array.from(pairs).map(pair => { 
        
                 const rank = bpe_ranks[pair] 
        
                 minPairs[(isNaN(rank) ? 10e10 : rank)] = pair 
        
               }) 
        
               const bigram = minPairs[Math.min(...Object.keys(minPairs).map(x => { 
        
                 return parseInt(x) 
        
               } 
        
               ))] 
        
               if (!(bigram in bpe_ranks)) { 
        
                 break 
        
               } 
        
               const first = bigram[0] 
        
               const second = bigram[1] 
        
               let new_word = [] 
        
               let i = 0 
        
               while (i < word.length) { 
        
                 const j = word.indexOf(first, i) 
        
                 if (j === -1) { 
        
                   new_word = new_word.concat(word.slice(i)) 
        
                   break 
        
                 } 
        
                 new_word = new_word.concat(word.slice(i, j)) 
        
                 i = j 
        
                 if (word[i] === first && i < word.length - 1 && word[i + 1] === second) { 
        
                   new_word.push(first + second) 
        
                   i = i + 2 
        
                 } else { 
        
                   new_word.push(word[i]) 
        
                   i = i + 1 
        
                 } 
        
               } 
        
               word = new_word 
        
               if (word.length === 1) { 
        
                 break 
        
               } else { 
        
                 pairs = get_pairs(word) 
        
               } 
        
             } 
        
             word = word.join(' ') 
        
             cache.set(token, word) 
        
             return word 
        
           }

requires providing an additional argument with cache if you want to make it shared fixes #35

niieani · 2023-04-16T00:46:13Z

Hi @binjie09. My PR #38 fixes this by requiring the cache to be passed in explicitly. That way you can control it however you like. You could even implement a custom Map that removes old entries once some limit is reached.

If you can't wait for the PR to get merged, I've published my fork as gpt-tokenizer.

requires providing an additional argument with cache if you want to make it shared fixes #35

niieani referenced this issue in niieani/gpt-tokenizer Apr 16, 2023

fix: remove global cache memory leak

22aed68

requires providing an additional argument with cache if you want to make it shared fixes #35

niieani referenced this issue in niieani/gpt-tokenizer Apr 16, 2023

fix: remove global cache memory leak

40a2985

requires providing an additional argument with cache if you want to make it shared fixes #35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The cache in bpe() may occupy a large amount of memory after long-time running. #35

The cache in bpe() may occupy a large amount of memory after long-time running. #35

binjie09 commented Apr 1, 2023 •

edited

Loading

niieani commented Apr 16, 2023 •

edited

Loading

The cache in bpe() may occupy a large amount of memory after long-time running. #35

The cache in bpe() may occupy a large amount of memory after long-time running. #35

Comments

binjie09 commented Apr 1, 2023 • edited Loading

niieani commented Apr 16, 2023 • edited Loading

binjie09 commented Apr 1, 2023 •

edited

Loading

niieani commented Apr 16, 2023 •

edited

Loading