-
Notifications
You must be signed in to change notification settings - Fork 15
/
12 실습 - 표현(Representation) - 문서 표현 (TF-IDF)
1 lines (1 loc) · 13.8 KB
/
12 실습 - 표현(Representation) - 문서 표현 (TF-IDF)
1
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"12 실습 - 표현(Representation) - 문서 표현 (TF-IDF)","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"zEFesPBvXe2C"},"source":["# 문서 표현 (Document Representation)"]},{"cell_type":"markdown","metadata":{"id":"TmnWDmSpBwBZ"},"source":["# 3 TF-IDF (Term Frequency-Inverse Document Frequency)"]},{"cell_type":"markdown","metadata":{"id":"NjKCX0atD4rM"},"source":["<img src=\"https://wikimedia.org/api/rest_v1/media/math/render/svg/10109d0e60cc9d50a1ea2f189bac0ac29a030a00\" />\n","\n","\n","\n","* TF(단어 빈도, Term Frequency) : 단어가 문서 내에 등장하는 빈도\n","* IDF(역문서 빈도, Inverse Document Frequency) : 단어가 여러 문서에 공통적으로 등장하는 빈도\n","* 한 문서 내에 자주 등장하고 다른 문서에 자주 등장하지 않는 단어를 주요 단어로 판별할 수 있음\n","\n","\n","https://en.wikipedia.org/wiki/Tf%E2%80%93idf"]},{"cell_type":"markdown","metadata":{"id":"-cwnRnyH5Ox6"},"source":["## 3.1 직접계산하기"]},{"cell_type":"markdown","metadata":{"id":"eBFqmuc-LevJ"},"source":["weighting schema|weight|설명\n","--|--|--\n","term frequency|<img src=\"https://wikimedia.org/api/rest_v1/media/math/render/svg/91699003abf4fe8bdf861bbce08e73e71acf5fd4\" />|=토큰빈도/문서내토큰빈도\n","inverse document frequency|<img src=\"https://wikimedia.org/api/rest_v1/media/math/render/svg/864fcfdc0c16344c11509f724f1aa7081cf9f657\" />|=log(총문서갯수/(토큰이 등장한 문서수))"]},{"cell_type":"code","metadata":{"id":"y56mwVir0L3a","executionInfo":{"status":"ok","timestamp":1636924536711,"user_tz":-540,"elapsed":308,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["d1 = \"The cat sat on my face I hate a cat\"\n","d2 = \"The dog sat on my bed I love a dog\" \n","doc_ls = [d1, d2]"],"execution_count":5,"outputs":[]},{"cell_type":"code","metadata":{"id":"eO1kEEmceE1P","executionInfo":{"status":"ok","timestamp":1636924722774,"user_tz":-540,"elapsed":339,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["import numpy as np\n","from collections import defaultdict\n","\n","def tf(t, d) :\n"," return d.count(t) / len(d)\n","\n","def idf (t, D) :\n"," N = len(D)\n"," n = len([True for d in D if t in d])\n"," return np.log(N/n)\n","\n","def tfidf (t,d,D) :\n"," return tf(t,d) * idf(t,D)\n","\n","def tokenizer(d) :\n"," return d.split()\n","\n","def tfidfScorer(D) :\n"," doc_ls = [tokenizer(d) for d in D]\n"," word2id = defaultdict(lambda:len(word2id))\n","\n"," [word2id[t] for d in doc_ls for t in d]\n","\n"," tfidf_mat = np.zeros((len(doc_ls), len(word2id)))\n"," for i, d in enumerate(doc_ls) :\n"," for t in d :\n"," tfidf_mat[i, word2id[t]] = tfidf(t, d, D)\n"," \n"," return tfidf_mat, word2id.keys()"],"execution_count":16,"outputs":[]},{"cell_type":"code","metadata":{"id":"dfAQm7ZgPYJa","executionInfo":{"status":"ok","timestamp":1636924767912,"user_tz":-540,"elapsed":395,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["mat, vocab = tfidfScorer(doc_ls)"],"execution_count":18,"outputs":[]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":111},"id":"Pa5we95FUTlI","executionInfo":{"status":"ok","timestamp":1636924781952,"user_tz":-540,"elapsed":403,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"03741e61-e5b4-4ba1-af92-39a729a697e5"},"source":["import pandas as pd\n","pd.DataFrame(mat, columns=vocab)"],"execution_count":20,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n"," .dataframe tbody tr th:only-of-type {\n"," vertical-align: middle;\n"," }\n","\n"," .dataframe tbody tr th {\n"," vertical-align: top;\n"," }\n","\n"," .dataframe thead th {\n"," text-align: right;\n"," }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n"," <thead>\n"," <tr style=\"text-align: right;\">\n"," <th></th>\n"," <th>The</th>\n"," <th>cat</th>\n"," <th>sat</th>\n"," <th>on</th>\n"," <th>my</th>\n"," <th>face</th>\n"," <th>I</th>\n"," <th>hate</th>\n"," <th>a</th>\n"," <th>dog</th>\n"," <th>bed</th>\n"," <th>love</th>\n"," </tr>\n"," </thead>\n"," <tbody>\n"," <tr>\n"," <th>0</th>\n"," <td>0.0</td>\n"," <td>0.138629</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.069315</td>\n"," <td>0.0</td>\n"," <td>0.069315</td>\n"," <td>0.0</td>\n"," <td>0.000000</td>\n"," <td>0.000000</td>\n"," <td>0.000000</td>\n"," </tr>\n"," <tr>\n"," <th>1</th>\n"," <td>0.0</td>\n"," <td>0.000000</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.000000</td>\n"," <td>0.0</td>\n"," <td>0.000000</td>\n"," <td>0.0</td>\n"," <td>0.138629</td>\n"," <td>0.069315</td>\n"," <td>0.069315</td>\n"," </tr>\n"," </tbody>\n","</table>\n","</div>"],"text/plain":[" The cat sat on my ... hate a dog bed love\n","0 0.0 0.138629 0.0 0.0 0.0 ... 0.069315 0.0 0.000000 0.000000 0.000000\n","1 0.0 0.000000 0.0 0.0 0.0 ... 0.000000 0.0 0.138629 0.069315 0.069315\n","\n","[2 rows x 12 columns]"]},"metadata":{},"execution_count":20}]},{"cell_type":"markdown","metadata":{"id":"huJ0-b2bKb_8"},"source":["## 3.2 sklearn 활용"]},{"cell_type":"code","metadata":{"id":"kBKXqmVF_URx"},"source":["d1 = \"The cat sat on my face I hate a cat\"\n","d2 = \"The dog sat on my bed I love a dog\" \n","docs = [d1, d2]"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"em3l3IS5kRP-","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636924864975,"user_tz":-540,"elapsed":332,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"52547d7f-3fb9-4d14-c7a5-dd0942c1b512"},"source":["from sklearn.feature_extraction.text import TfidfVectorizer\n","\n","tfidf_vect = TfidfVectorizer()\n","tfidf = tfidf_vect.fit_transform(docs)\n","tfidf.todense()"],"execution_count":21,"outputs":[{"output_type":"execute_result","data":{"text/plain":["matrix([[0. , 0.70600557, 0. , 0.35300279, 0.35300279,\n"," 0. , 0.25116439, 0.25116439, 0.25116439, 0.25116439],\n"," [0.35300279, 0. , 0.70600557, 0. , 0. ,\n"," 0.35300279, 0.25116439, 0.25116439, 0.25116439, 0.25116439]])"]},"metadata":{},"execution_count":21}]},{"cell_type":"code","metadata":{"id":"0wwnv6EdgROS","colab":{"base_uri":"https://localhost:8080/","height":111},"executionInfo":{"status":"ok","timestamp":1636924896445,"user_tz":-540,"elapsed":443,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"33dab9c5-7d22-4541-f204-459a921dae0f"},"source":["import pandas as pd\n","pd.DataFrame(tfidf.todense(), columns=tfidf_vect.get_feature_names())"],"execution_count":22,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n"," .dataframe tbody tr th:only-of-type {\n"," vertical-align: middle;\n"," }\n","\n"," .dataframe tbody tr th {\n"," vertical-align: top;\n"," }\n","\n"," .dataframe thead th {\n"," text-align: right;\n"," }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n"," <thead>\n"," <tr style=\"text-align: right;\">\n"," <th></th>\n"," <th>bed</th>\n"," <th>cat</th>\n"," <th>dog</th>\n"," <th>face</th>\n"," <th>hate</th>\n"," <th>love</th>\n"," <th>my</th>\n"," <th>on</th>\n"," <th>sat</th>\n"," <th>the</th>\n"," </tr>\n"," </thead>\n"," <tbody>\n"," <tr>\n"," <th>0</th>\n"," <td>0.000000</td>\n"," <td>0.706006</td>\n"," <td>0.000000</td>\n"," <td>0.353003</td>\n"," <td>0.353003</td>\n"," <td>0.000000</td>\n"," <td>0.251164</td>\n"," <td>0.251164</td>\n"," <td>0.251164</td>\n"," <td>0.251164</td>\n"," </tr>\n"," <tr>\n"," <th>1</th>\n"," <td>0.353003</td>\n"," <td>0.000000</td>\n"," <td>0.706006</td>\n"," <td>0.000000</td>\n"," <td>0.000000</td>\n"," <td>0.353003</td>\n"," <td>0.251164</td>\n"," <td>0.251164</td>\n"," <td>0.251164</td>\n"," <td>0.251164</td>\n"," </tr>\n"," </tbody>\n","</table>\n","</div>"],"text/plain":[" bed cat dog ... on sat the\n","0 0.000000 0.706006 0.000000 ... 0.251164 0.251164 0.251164\n","1 0.353003 0.000000 0.706006 ... 0.251164 0.251164 0.251164\n","\n","[2 rows x 10 columns]"]},"metadata":{},"execution_count":22}]},{"cell_type":"markdown","metadata":{"id":"uucW3BFQMUAg"},"source":["\n","\n","---\n","\n"]},{"cell_type":"markdown","metadata":{"id":"V0I3IeVIzrzO"},"source":["## 3.3 gensim 활용"]},{"cell_type":"code","metadata":{"id":"kDggrSH5_cHp"},"source":["d1 = \"The cat sat on my face I hate a cat\"\n","d2 = \"The dog sat on my bed I love a dog\" \n","docs = [d1, d2]"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"aIpVmOr0_PKh","executionInfo":{"status":"ok","timestamp":1636925059990,"user_tz":-540,"elapsed":300,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["import gensim\n","from gensim import corpora\n","from gensim.models import TfidfModel\n","\n","doc_ls = [d.split() for d in docs]\n","id2word = corpora.Dictionary(doc_ls)\n","TDM = [id2word.doc2bow(d) for d in doc_ls]\n","model = TfidfModel(TDM)\n"],"execution_count":23,"outputs":[]},{"cell_type":"code","metadata":{"id":"lFlRXagh_PKj","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1636925076416,"user_tz":-540,"elapsed":313,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"e6a8063f-e3e6-467e-c3d8-67fbba65dfcd"},"source":["model[TDM][0]"],"execution_count":25,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[(3, 0.8164965809277261), (4, 0.4082482904638631), (5, 0.4082482904638631)]"]},"metadata":{},"execution_count":25}]},{"cell_type":"code","metadata":{"id":"cLQTM2eNVor0","executionInfo":{"status":"ok","timestamp":1636925255788,"user_tz":-540,"elapsed":405,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}}},"source":["from gensim.matutils import sparse2full\n","\n","TDM_matrix = [ sparse2full(d, len(id2word)).tolist() for d in model[TDM]]"],"execution_count":30,"outputs":[]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":111},"id":"GH01l_msV-HG","executionInfo":{"status":"ok","timestamp":1636925285082,"user_tz":-540,"elapsed":515,"user":{"displayName":"이민호","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiFPPatrtQJJCEfMd6D3DoTVRog9gVm7Ovj5Lex=s64","userId":"15829449822908558555"}},"outputId":"e66f9b13-9287-4158-dd6d-fa3eb088cb7a"},"source":["import pandas as pd\n","pd.DataFrame(TDM_matrix, columns=id2word.values())"],"execution_count":32,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n"," .dataframe tbody tr th:only-of-type {\n"," vertical-align: middle;\n"," }\n","\n"," .dataframe tbody tr th {\n"," vertical-align: top;\n"," }\n","\n"," .dataframe thead th {\n"," text-align: right;\n"," }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n"," <thead>\n"," <tr style=\"text-align: right;\">\n"," <th></th>\n"," <th>I</th>\n"," <th>The</th>\n"," <th>a</th>\n"," <th>cat</th>\n"," <th>face</th>\n"," <th>hate</th>\n"," <th>my</th>\n"," <th>on</th>\n"," <th>sat</th>\n"," <th>bed</th>\n"," <th>dog</th>\n"," <th>love</th>\n"," </tr>\n"," </thead>\n"," <tbody>\n"," <tr>\n"," <th>0</th>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.816497</td>\n"," <td>0.408248</td>\n"," <td>0.408248</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.000000</td>\n"," <td>0.000000</td>\n"," <td>0.000000</td>\n"," </tr>\n"," <tr>\n"," <th>1</th>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.000000</td>\n"," <td>0.000000</td>\n"," <td>0.000000</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.0</td>\n"," <td>0.408248</td>\n"," <td>0.816497</td>\n"," <td>0.408248</td>\n"," </tr>\n"," </tbody>\n","</table>\n","</div>"],"text/plain":[" I The a cat face ... on sat bed dog love\n","0 0.0 0.0 0.0 0.816497 0.408248 ... 0.0 0.0 0.000000 0.000000 0.000000\n","1 0.0 0.0 0.0 0.000000 0.000000 ... 0.0 0.0 0.408248 0.816497 0.408248\n","\n","[2 rows x 12 columns]"]},"metadata":{},"execution_count":32}]},{"cell_type":"markdown","metadata":{"id":"v8WBguvSzulc"},"source":["---"]}]}