Skip to content

boostcampaitech5/level2_nlp_datacentric-nlp-04

Repository files navigation

level2_nlp_datacentric-nlp-04

๐Ÿ“„ ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ

  • ์—ฐํ•ฉ๋‰ด์Šค์˜ ๋‰ด์Šค ์ œ๋ชฉ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ IT๊ณผํ•™, ๊ฒฝ์ œ, ์‚ฌํšŒ, ์ƒํ™œ๋ฌธํ™”, ์„ธ๊ณ„, ์Šคํฌ์ธ , ์ •์น˜ ์ด 7๊ฐœ์˜ ํด๋ž˜์Šค๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋กœ, ๊ธฐ์กด์— ์ฃผ์–ด์ง„ baseline ์ฝ”๋“œ์—์„œ ๋ชจ๋ธ ๊ตฌ์กฐ์™€ hyperparameter ๋ณ€๊ฒฝ ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ค‘์‹ฌ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.
  • ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 45,678๊ฐœ ์ด๋ฉฐ, Train Data / Validation Data๋ฅผ 7:3๋น„์œจ๋กœ ๋‚˜๋ˆ„์–ด์„œ, ํ•™์Šต์— ์ด์šฉํ•˜์˜€๋‹ค.
  • Train Data์˜ 12%(5,481๊ฐœ) ์—๋Š” g2p(grapheme to phoneme)๊ฐ€ ์ ์šฉ๋˜์–ด ๋‰ด์Šค ์ œ๋ชฉ์— ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ๊ณ , 3%(1,371๊ฐœ) ์—๋Š” target์ด text์™€๋Š” ๋งž์ง€ ์•Š๋Š” Miss Label ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ๋‹ค.
  • Test Data๋Š” ์ด 9,107๊ฐœ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.

๐Ÿ—“๏ธ ๊ฐœ๋ฐœ ๊ธฐ๊ฐ„

  • 23.05.22 - 23.06.01(์ด 11์ผ)

๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง ๋ฉค๋ฒ„ ๊ตฌ์„ฑ ๋ฐ ์—ญํ• 

๊ณฝ๋ฏผ์„ ์ด์ธ๊ท  ์ž„ํ•˜๋ฆผ ์ตœํœ˜๋ฏผ ํ™ฉ์œค๊ธฐ
  • ๊ณฝ๋ฏผ์„
    • Augmentation with Generative Model
    • Build data managing page
  • ์ด์ธ๊ท 
    • Data Filtering
    • Data Augmentation
  • ์ž„ํ•˜๋ฆผ
    • Data Filtering
    • Data Clearing
  • ์ตœํœ˜๋ฏผ
    • Back Translation Data Augmentation
    • Augmentation Data Filtering
  • ํ™ฉ์œค๊ธฐ
    • Data Filtering
    • Prediction, Miss Label Data Analysis Page
    • Synthetic Data Augmentation

๐Ÿ‘จโ€๐Ÿ”ฌ ์‹คํ—˜ ๋‚ด์šฉ

Augmentation

Augmentation with GPT

  • ์œ ๋ฃŒ ์„œ๋น„์Šค๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋Š” ๊ด€๊ณ„๋กœ โ€œkakaobrain/kogptโ€ ๋ชจ๋ธ์„ ์„œ๋ฒ„์—์„œ ์ž‘๋™์‹œ์ผœ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์˜€๋‹ค.
  • ํ”„๋กฌํ”„ํŠธ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๋ฌผ์˜ ์งˆ์ด ๋‹ฌ๋ผ ์—ฌ๋Ÿฌ ์‹œํ–‰์ฐฉ์˜ค ๋์— ์•„๋ž˜์™€ ๊ฐ™์€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค.
    ์•„๋ž˜์™€ ๊ฐ™์ด ํ‚ค์›Œ๋“œ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ธฐ์‚ฌ ์ œ๋ชฉ์„ ์ œ์ž‘ํ•ด์ค˜. 
    
    ์ž…๋ ฅ: IT๊ณผํ•™
    ์ถœ๋ ฅ: โ€˜AI ์‚ฐ์—…์˜ ์Œ€โ€™ GPU ์‹œ์žฅ ๋…์ ํ•œ ์—”๋น„๋””์•„
    
    ์ž…๋ ฅ: ์ •์น˜
    ์ถœ๋ ฅ: ๊ตฐ, ๋ถํ•œ ๋ฐœ์‚ฌ์ฒด ์ž”ํ•ด ์ธ์–‘์ž‘์ „ ๋ณธ๊ฒฉํ™”โ€ฆ์‹ฌํ•ด์ž ์ˆ˜์‚ฌ ํˆฌ์ž…
    
    ์ž…๋ ฅ: {keyword}
    ์ถœ๋ ฅ:
    
  • ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ์˜ ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
    ์ž…๋ ฅ: IT๊ณผํ•™
    ์ถœ๋ ฅ: KT, '๊ธฐ๊ฐ€ ์ธํ„ฐ๋„ท' ์„œ๋น„์Šค ์ถœ์‹œ
    
  • ์œ„์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ผ๋ฒจ๋ณ„๋กœ 1,200๊ฐœ ์ด 8,400๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€์œผ๋ฉฐ, ์ด์—๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฌผ์€ ๋งํฌ๋ฅผ ํ†ตํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Back Translation Data Augmentation

  • ๊ตฌ๊ธ€๋ฒˆ์—ญ๊ธฐ์™€ ํŒŒํŒŒ๊ณ  API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ noise๋กœ ์ธ์‹ํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์— ํ•œํ•˜์—ฌ Back Translation Data Augmentation์„ ํ•˜์˜€๋‹ค.
  • ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚œ SBERT ๋ชจ๋ธ๋กœ ์›๋ณธ๊ณผ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ์˜ ์œ ์‚ฌ๋„์— ๋”ฐ๋ผ filtering ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Data Noise Filtering

  • G2P๊ฐ€ ์ ์šฉ๋˜์ง€ ์•Š์€ ๋ฌธ์žฅ๊ณผ ์ ์šฉ๋œ ๋ฌธ์žฅ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ noise์˜ ์—ฌ๋ถ€๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ชจ๋ธ(Accuracy 93%)์„ ๋งŒ๋“ค์–ด ์ด๋ฅผ ํ™œ์šฉํ•ด ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๋ฅผ ์ œ๊ฑฐํ•˜์˜€๋‹ค.
  • g2p๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๋ฒˆ ๋” g2p๋ฅผ ์ ์šฉ์‹œ์ผœ์„œ ๋‹ฌ๋ผ์ง€์ง€ ์•Š๋Š”๋‹ค๋ฉด noise๊ฐ€ ์žˆ๋‹ค๊ณ  ํŒ๋‹จํ•ด์„œ ๋ถ„๋ฆฌ์‹œ์ผฐ๋‹ค.

Data Clearing

  • 10๋งŒ๊ฐœ์˜ ์›๋ณธ๋ฌธ์žฅ๊ณผ g2p๋œ ๋ฌธ์žฅ์Œ์„ Bart-base ๋ชจ๋ธ๋กœ ํ•™์Šต์‹œ์ผœ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ํ–ˆ์œผ๋‚˜ ์ƒ๊ฐ๋ณด๋‹ค ๋ฒˆ์—ญ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์ง€ ์•Š์•„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๊ณ , 50๋งŒ๊ฐœ๋กœ MT5-large๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ์ด์šฉํ•ด์„œ noise๋ฅผ ๋˜๋Œ๋ ค dataset์„ ๋งŒ๋“ค์–ด ํ•™์Šต์‹œํ‚จ ๊ฒƒ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.
  • ์›๋ณธ์œผ๋กœ ๋Œ๋ ค์ค€ ๋ฌธ์žฅ์—์„œ ํŠน์ˆ˜๋ฌธ์ž๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ์—๋Š” ๏ฟฝ ๋กœ ํ‘œ์‹œ๋˜์–ด ๊ณต๋ฐฑ์œผ๋กœ ์ œ๊ฑฐํ–ˆ๋‹ค.

Miss Label Filtering

  • ์ž˜๋ชป ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ์˜ ํ™•๋ฅ ์„ Class๋ณ„ ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ ๋‚˜์—ด, ๋ฐฑ๋ถ„์œ„ ๊ธฐ์ค€ Threshold์ด์ƒ์ด ๋˜๋Š” Prediction์„ Target์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•˜์—ฌ, Miss Label์„ Filtering
  • ์˜ˆ์ธก๊ฒฐ๊ณผ ํ™•์ธ๊ณผ, Filtering์„ ์†์‰ฝ๊ฒŒ ์ ์šฉํ•˜๊ณ , ํŒ€์›๋“ค๊ณผ์˜ ๊ณต์œ  ํŽธ์˜์„ฑ์„ ์œ„ํ•ด ์•„๋ž˜ ์ž‘์„ฑํ•œ Data Controll Center Page์— ํ•ด๋‹น ๊ธฐ๋Šฅ ์ถ”๊ฐ€.

๐ŸŽ›๏ธ Data Controll Center

1. ์‹คํ–‰ ๋ฐฉ๋ฒ•

  1. main.py ๋‚ด์— FILE_PATH ๋ณ€์ˆ˜๋ฅผ
  2. ๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ ์ด์šฉํ•˜์—ฌ dependency๋ฅผ ์„ค์น˜ ํ•ด์ฃผ์…”์•ผ ํ•ฉ๋‹ˆ๋‹ค.
pip install streamlit
pip install matplotlib
apt-get install fonts-nanum*
  1. ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹คํ–‰ํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
streamlit run main.py --server.port PORT_NUMBER

2. ์˜ค๋ฅ˜ ๋ฐœ์ƒ์‹œ

a. ํฐํŠธ๋ฅผ ์„ค์น˜ํ•˜์˜€์œผ๋‚˜ ํฐํŠธ๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†๋‹ค ํ•˜๋Š” ๊ฒฝ์šฐ.

  • upstage ์„œ๋ฒ„ ๊ธฐ์ค€์œผ๋กœ ๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•œ ํ›„ ๋‹ค์‹œ ์‹คํ–‰ ์‹œ์ผœ๋ณด์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.
rm -rf /opt/ml/.cache/matplotlib

b. "1. ์‹คํ–‰ ๋ฐฉ๋ฒ•" 3๋ฒˆ์—์„œ ์‹คํ–‰๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ.

  • ๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹คํ–‰ ์‹œ์ผœ๋ณด์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.
streamlit run main.py --server.port PORT_NUMBER --server.fileWatcherType none

3. Function

Easy Miss Label Filtering

Miss Label Filtering

  • Class Percentile Value๋ฅผ ์กฐ์ ˆํ•ด์„œ, Miss Label์˜ ๋ณ€๊ฒฝ์„ ๊ด€์ฐฐํ•˜์„ธ์š”!
  • OOD(Out of Distribution) ์„ ์ œ์™ธํ•˜๊ณ , Miss Label์ด Filtering๋œ Data๋ฅผ ์‰ฝ๊ฒŒ ๋‹ค์šด๋กœ๋“œํ•˜์„ธ์š”!

ํ•ด๋‹น ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„ , Miss Label Filtering์„ ์‹œํ–‰ํ•  ๋ฐ์ดํ„ฐ์˜ ์˜ˆ์ธกํ™•๋ฅ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
โ€ผ๏ธ Training.ipynb ๋ฅผ ๊ผญ ๋๊นŒ์ง€ ์‹คํ–‰ํ•ด์ฃผ์„ธ์š”

๐Ÿ‘‘ Leaderboard

f1 accuracy Rank
Public 0.8815 0.8792 7
Private 0.8650 0.8682 4

About

level2_nlp_datacentric-nlp-04 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published