forked from insightcampus/sesac-nlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
18 실습 - 통계기반 자연어처리 - 키워드 추출 (TF-IDF)
1 lines (1 loc) · 4.66 KB
/
18 실습 - 통계기반 자연어처리 - 키워드 추출 (TF-IDF)
1
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"18 실습 - 통계기반 자연어처리 - 키워드 추출 (TF-IDF)","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"q1Q9mWDrgMUn"},"source":["# 핵심 키워드 추출 (Keyword Extraction)"]},{"cell_type":"markdown","metadata":{"id":"PSZIeetL8Yym"},"source":["## 0 데이터 준비"]},{"cell_type":"markdown","metadata":{"id":"hpTxmnbx8v6m"},"source":["### Mecab 설치 (필요시)"]},{"cell_type":"code","metadata":{"id":"qsWnZkd78qCA"},"source":["!sudo apt-get install g++ openjdk-7-jdk # Install Java 1.7+\n","!sudo apt-get install python-dev; pip install konlpy # Python 2.x\n","!sudo apt-get install python3-dev; pip3 install konlpy # Python 3.x\n","!sudo apt-get install curl\n","!bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"fdsHZk1UBtRP"},"source":["import requests \n","from bs4 import BeautifulSoup\n","\n","def get_news_by_url(url):\n"," h = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}\n"," res = requests.get(url, headers=h)\n"," bs = BeautifulSoup(res.content, 'html.parser')\n","\n"," title = bs.select('h3#articleTitle')[0].text #제목\n"," content = bs.select('#articleBodyContents')[0].get_text().replace('\\n', \" \") #본문\n"," content = content.replace(\"// flash 오류를 우회하기 위한 함수 추가 function _flash_removeCallback() {}\", \"\")\n"," return content.strip()\n","\n","docs = []\n","docs.append( get_news_by_url('https://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=105&oid=018&aid=0004430108') )\n","docs.append( get_news_by_url('https://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=101&oid=001&aid=0011614790') )\n","docs.append( get_news_by_url('https://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=102&oid=014&aid=0004424362') )\n","docs.append( get_news_by_url('https://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=101&oid=119&aid=0002402191') )\n","docs.append( get_news_by_url('https://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=101&oid=030&aid=0002882728') )\n","len(docs)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"jxsBfWirA9ao"},"source":["## TF-IDF 활용 핵심키워드 추출"]},{"cell_type":"markdown","metadata":{"id":"u8UYUNHDNMVp"},"source":["### 실습 1. sklearn 활용\n"]},{"cell_type":"code","metadata":{"id":"u_YtvNebfpwn"},"source":["docs"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Of0dQ_ALpLw6"},"source":["#### 1) 전처리"]},{"cell_type":"code","metadata":{"id":"tfGX-_IxAbgt"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"JqZ9ONgLpOzv"},"source":["#### 2) TF-IDF 계산"]},{"cell_type":"code","metadata":{"id":"Okub07GQ-KJe"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"9oMnmGfp6PAb"},"source":[""]},{"cell_type":"code","metadata":{"id":"em3l3IS5kRP-"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"an6Cngd2Cwjg"},"source":["#### 3) 핵심키워드 추출"]},{"cell_type":"code","metadata":{"id":"yKrcZ9rh-5Rt"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"u_8wXmex-1gr"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"kUVTXBYrZMiU"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"lwQjPYuxNy4K"},"source":["\n","---\n"]},{"cell_type":"markdown","metadata":{"id":"XTNnFrNhOrA3"},"source":["### 실습 2. gensim 활용\n"]},{"cell_type":"markdown","metadata":{"id":"aE-YzDI6OrA5"},"source":["#### 1) 전처리"]},{"cell_type":"code","metadata":{"id":"edVMwQBuOrA6"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"cWGsCRJUOrA8"},"source":["#### 2) TF-IDF 계산"]},{"cell_type":"code","metadata":{"id":"NCUzeqp3OrA9"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"wBT2h57bOrBF"},"source":["#### 3) 핵심키워드 추출"]},{"cell_type":"code","metadata":{"id":"Xk0Tbo21RddA"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"_8T0QSKVRfUt"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"8gBGnkhCZSlw"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"hZpm1wtTqxL-"},"source":["\n","\n","---\n","\n"]}]}