-
Notifications
You must be signed in to change notification settings - Fork 1
/
atom.xml
123 lines (61 loc) · 134 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Hexo</title>
<link href="https://www.cobaltyang.me/atom.xml" rel="self"/>
<link href="https://www.cobaltyang.me/"/>
<updated>2023-03-02T03:25:58.406Z</updated>
<id>https://www.cobaltyang.me/</id>
<author>
<name>John Doe</name>
</author>
<generator uri="https://hexo.io/">Hexo</generator>
<entry>
<title>Begin and reasons</title>
<link href="https://www.cobaltyang.me/2023/03/01/begin/"/>
<id>https://www.cobaltyang.me/2023/03/01/begin/</id>
<published>2023-03-01T07:26:43.000Z</published>
<updated>2023-03-02T03:25:58.406Z</updated>
<content type="html"><![CDATA[<h2 id="Here-I-want-to-talk-about-the-birth-of-this-personal-blog"><a href="#Here-I-want-to-talk-about-the-birth-of-this-personal-blog" class="headerlink" title="Here, I want to talk about the birth of this personal blog."></a>Here, I want to talk about the birth of this personal blog.</h2>]]></content>
<summary type="html"><h2 id="Here-I-want-to-talk-about-the-birth-of-this-personal-blog"><a href="#Here-I-want-to-talk-about-the-birth-of-this-personal-blog" clas</summary>
<category term="Daily Life" scheme="https://www.cobaltyang.me/categories/Daily-Life/"/>
</entry>
<entry>
<title>python隐性马尔科夫模型案例分析</title>
<link href="https://www.cobaltyang.me/2023/02/28/markov/"/>
<id>https://www.cobaltyang.me/2023/02/28/markov/</id>
<published>2023-02-28T13:18:20.000Z</published>
<updated>2023-03-01T08:11:26.049Z</updated>
<content type="html"><![CDATA[<h2 id="问题"><a href="#问题" class="headerlink" title="问题:"></a>问题:</h2><p>什么是马尔科夫模型?用来干什么?<br>大家可以参考这篇简书</p><p><隐性马尔科夫模型简介,只聊原理, (保证没有数学)><br><a class="link" href="https://www.jianshu.com/p/3b4fbcea2744" >https://www.jianshu.com/p/3b4fbcea2744 <i class="fa-regular fa-arrow-up-right-from-square fa-sm"></i></a></p><h2 id="python-实现"><a href="#python-实现" class="headerlink" title="python 实现"></a>python 实现</h2><p>关于HMM有两个主要问题:<br><strong>已知上述三个参数,和当前观测序列,求解隐藏状态的变化</strong><br><strong>所有参数未知,只有数据,如何获得三个参数</strong><br>需要使用hmmlearn 包</p><h3 id="导入需要的库"><a href="#导入需要的库" class="headerlink" title="导入需要的库"></a>导入需要的库</h3><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> random</span><br><span class="line"><span class="keyword">import</span> datetime <span class="comment"># 可有可无,用来记录模型学习时间,</span></span><br><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">from</span> hmmlearn <span class="keyword">import</span> hmm</span><br></pre></td></tr></table></figure></div><h3 id="已知参数和当前的观测序列,求解隐含状态的序列"><a href="#已知参数和当前的观测序列,求解隐含状态的序列" class="headerlink" title="已知参数和当前的观测序列,求解隐含状态的序列"></a>已知参数和当前的观测序列,求解隐含状态的序列</h3><p>隐藏状态:</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">hidden_states = [<span class="string">"A"</span>, <span class="string">"B"</span> ]</span><br><span class="line">n_hidden_states = <span class="built_in">len</span>(hidden_states)</span><br></pre></td></tr></table></figure></div><p>观察情况</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">observations = [<span class="string">"1"</span>, <span class="string">"2"</span>, <span class="string">"3"</span>, <span class="string">"4"</span>, <span class="string">"5"</span>, <span class="string">"6"</span>]</span><br><span class="line">n_observations = <span class="built_in">len</span>(observations)</span><br></pre></td></tr></table></figure></div><p>初始状态矩阵</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">start_probability = np.array([<span class="number">0.8</span>, <span class="number">0.2</span>,])</span><br></pre></td></tr></table></figure></div><p>状态转移矩阵</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">transition_probability = np.array([[<span class="number">0.2</span>, <span class="number">0.8</span>], </span><br><span class="line"> [<span class="number">0.8</span>, <span class="number">0.2</span>]]</span><br><span class="line"> )</span><br></pre></td></tr></table></figure></div><p>发射矩阵</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">emission_probability = np.array([ [<span class="number">1</span>/<span class="number">6</span>, <span class="number">1</span>/<span class="number">6</span>, <span class="number">1</span>/<span class="number">6</span>, <span class="number">1</span>/<span class="number">6</span>, <span class="number">1</span>/<span class="number">6</span>, <span class="number">1</span>/<span class="number">6</span>,], </span><br><span class="line"> [<span class="number">0.25</span>, <span class="number">0.25</span>, <span class="number">0.25</span>, <span class="number">0.25</span>, <span class="number">0</span>, <span class="number">0</span>] ]</span><br><span class="line"> )</span><br></pre></td></tr></table></figure></div><p>构建模型</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">model = hmm.MultinomialHMM(n_components=n_hidden_states)</span><br></pre></td></tr></table></figure></div><p>将上述设定的参数传入其中</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">model.startprob_=start_probability</span><br><span class="line">model.transmat_=transition_probability</span><br><span class="line">model.emissionprob_=emission_probability</span><br></pre></td></tr></table></figure></div><p>输入我们观察到的序列</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">observations_sequence = np.array([[<span class="number">0</span>, <span class="number">1</span>, <span class="number">2</span>,<span class="number">1</span>, <span class="number">3</span>, <span class="number">3</span>, <span class="number">4</span>, <span class="number">0</span>, <span class="number">5</span>, <span class="number">3</span>]]).T</span><br><span class="line"><span class="built_in">print</span>(<span class="string">"观察到的取出的色子点数:"</span>, <span class="string">", "</span>.join(<span class="built_in">map</span>(<span class="keyword">lambda</span> x: observations[<span class="built_in">int</span>(x)], observations_sequence)))</span><br><span class="line">logprob, box = model.decode(observations_sequence, algorithm=<span class="string">"viterbi"</span>)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">"decode 方法计算: "</span>, <span class="string">", "</span>.join(<span class="built_in">map</span>(<span class="keyword">lambda</span> x: hidden_states[x], box)))</span><br><span class="line">box = model.predict(observations_sequence)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">"predict方法计算: "</span>, <span class="string">", "</span>.join(<span class="built_in">map</span>(<span class="keyword">lambda</span> x: hidden_states[<span class="built_in">int</span>(x)], box)), <span class="string">'\n'</span>)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">observations_sequence = np.array([[<span class="number">2</span>, <span class="number">1</span>, <span class="number">1</span>,<span class="number">0</span>, <span class="number">2</span>, <span class="number">3</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">1</span>, <span class="number">2</span>, <span class="number">2</span>,<span class="number">1</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="number">3</span>,<span class="number">3</span>]]).T</span><br><span class="line"><span class="built_in">print</span>(<span class="string">"取出的色子点数:"</span>, <span class="string">", "</span>.join(<span class="built_in">map</span>(<span class="keyword">lambda</span> x: observations[<span class="built_in">int</span>(x)], observations_sequence)))</span><br><span class="line">box = model.predict(observations_sequence)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">"从哪个盒子取的:"</span>, <span class="string">", "</span>.join(<span class="built_in">map</span>(<span class="keyword">lambda</span> x: hidden_states[<span class="built_in">int</span>(x)], box)),<span class="string">'\n'</span>)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">observations_sequence = np.array([[<span class="number">5</span>, <span class="number">4</span>, <span class="number">3</span>, <span class="number">2</span>, <span class="number">2</span>, <span class="number">3</span>, <span class="number">0</span>, <span class="number">4</span>, <span class="number">5</span>, <span class="number">4</span>, <span class="number">1</span>,<span class="number">1</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="number">3</span>,<span class="number">3</span>]]).T</span><br><span class="line"><span class="built_in">print</span>(<span class="string">"取出的色子点数:"</span>, <span class="string">", "</span>.join(<span class="built_in">map</span>(<span class="keyword">lambda</span> x: observations[<span class="built_in">int</span>(x)], observations_sequence)))</span><br><span class="line">box = model.predict(observations_sequence)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">"从哪个盒子取的:"</span>, <span class="string">", "</span>.join(<span class="built_in">map</span>(<span class="keyword">lambda</span> x: hidden_states[<span class="built_in">int</span>(x)], box)))</span><br></pre></td></tr></table></figure></div><p>输出的结果是<br>观察到的取出的色子点数: 1, 2, 3, 2, 4, 4, 5, 1, 6, 4<br>decode 方法计算: A, B, A, B, A, B, A, B, A, B<br>predict方法计算: A, B, A, B, A, B, A, B, A, B</p><p>取出的色子点数: 3, 2, 2, 1, 3, 4, 1, 2, 2, 3, 3, 2, 1, 1, 4, 4<br>从哪个盒子取的: A, B, A, B, A, B, A, B, A, B, A, B, A, B, A, B</p><p>取出的色子点数: 6, 5, 4, 3, 3, 4, 1, 5, 6, 5, 2, 2, 1, 1, 4, 4<br>从哪个盒子取的: A, A, B, A, B, A, B, A, A, A, B, A, B, A, B, A</p><h3 id="所有参数未知,只有数据,如何获得三个参数"><a href="#所有参数未知,只有数据,如何获得三个参数" class="headerlink" title="所有参数未知,只有数据,如何获得三个参数"></a>所有参数未知,只有数据,如何获得三个参数</h3><p>方案: 我们先假定一个初始概率分布,转移矩阵,以及发射矩阵,按照这些参数生成很多很多数据.<br>然后将数据放到HMM模型里,利用数据学习,检查模型学到的参数和我们假定的参数是不是一致。<br>按照假象的规则去生成一些列数据</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 初始状态 判断标准,小于此值为A 大于此值为B</span></span><br><span class="line">a_b_init_critia = <span class="number">0.2</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># state_change</span></span><br><span class="line">state_change = {<span class="string">"A"</span>: <span class="number">0.3</span>, <span class="comment"># 此时如果是A, 那么取random, 如果小于 此值就是A 否则为B</span></span><br><span class="line"> <span class="string">"B"</span>: <span class="number">0.6</span> <span class="comment"># 此时如果是B, 那么取random, 如果小于 此值就是A 否则为B</span></span><br><span class="line"> }</span><br><span class="line"><span class="comment"># 点数情况</span></span><br><span class="line">observations = [<span class="string">"1"</span>, <span class="string">"2"</span>, <span class="string">"3"</span>, <span class="string">"4"</span>, <span class="string">"5"</span>, <span class="string">"6"</span>]</span><br><span class="line"><span class="comment"># 点数对应的 index</span></span><br><span class="line">point={<span class="string">"A"</span>: [<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>,<span class="number">4</span>,<span class="number">5</span>], <span class="string">"B"</span>: [<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>]}</span><br><span class="line"></span><br><span class="line">data_size = <span class="number">10000</span></span><br><span class="line">whole_data = []</span><br><span class="line">lengths = []</span><br><span class="line"><span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(data_size):</span><br><span class="line"> dice = <span class="string">"A"</span> <span class="keyword">if</span> random.random() < a_b_init_critia <span class="keyword">else</span> <span class="string">"B"</span></span><br><span class="line"> data = []</span><br><span class="line"> sequence_length = random.randint(<span class="number">2</span>, <span class="number">25</span>)</span><br><span class="line"> <span class="keyword">for</span> _ <span class="keyword">in</span> <span class="built_in">range</span>(sequence_length):</span><br><span class="line"><span class="comment"># print(dice, end=" ")</span></span><br><span class="line"> data.append([random.sample(point[dice], <span class="number">1</span>)[<span class="number">0</span>]])</span><br><span class="line"> dice = <span class="string">"A"</span> <span class="keyword">if</span> random.random() < state_change[dice] <span class="keyword">else</span> <span class="string">"B"</span></span><br><span class="line"><span class="comment"># print(f"一共投了 {sequence_length} 次 \n点数的index {data} \n")</span></span><br><span class="line"> whole_data.append(data)</span><br><span class="line"> lengths.append(sequence_length)</span><br><span class="line">whole_data = np.concatenate(whole_data)</span><br></pre></td></tr></table></figure></div><p>将数据填入模型中,进行学习</p><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">print</span>(<span class="string">f"开始学习 <span class="subst">{datetime.datetime.now()}</span> 共 <span class="subst">{<span class="built_in">len</span>(lengths)}</span>条数据"</span>)</span><br><span class="line">hmm_model = hmm.MultinomialHMM(n_components=<span class="built_in">len</span>(point),</span><br><span class="line"> n_iter=<span class="number">100000</span>, <span class="comment"># Maximum number of iterations to perform.</span></span><br><span class="line"> tol=<span class="number">0.000001</span>, <span class="comment"># Convergence threshold. EM will stop if the gain in log-likelihood is below this value.</span></span><br><span class="line"> verbose = <span class="literal">False</span>, <span class="comment"># When True per-iteration convergence reports are printed to sys.stderr. </span></span><br><span class="line"> )</span><br><span class="line">hmm_model.fit(whole_data, lengths)</span><br><span class="line"></span><br><span class="line"><span class="comment"># 学习之后,查看参数</span></span><br><span class="line"><span class="built_in">print</span>(<span class="string">f"结束学习 <span class="subst">{datetime.datetime.now()}</span>"</span>)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">'因为是无监督学习,所以模型不会把 A B 排定先后顺序,但是 三个参数是相互关联的,所以顺序其实无关系'</span>)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">'初始概率'</span>)</span><br><span class="line"><span class="built_in">print</span>(hmm_model.startprob_,<span class="string">'\n'</span>)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">'状态转移矩阵'</span>)</span><br><span class="line"><span class="built_in">print</span>(hmm_model.transmat_,<span class="string">'\n'</span>)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">'从隐藏状态到 显示状态的发散矩阵'</span>)</span><br><span class="line"><span class="built_in">print</span>(hmm_model.emissionprob_,<span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div><p>输出的结果是<br>开始学习 2020-01-05 10:03:02.832792 共 10000条数据<br>结束学习 2020-01-05 13:05:53.612298<br>因为是无监督学习,所以模型不会把 A B 排定先后顺序,但是 三个参数是相互关联的,所以顺序其实无关系<br>初始概率<br>[0.20509604 0.79490396]</p><p>状态转移矩阵<br>[[0.31460223 0.68539777]<br>[0.6213235 0.3786765 ]]</p><p>从隐藏状态到 显示状态的发散矩阵<br>[[1.67834277e-01 1.74886284e-01 1.69078215e-01 1.68723388e-01<br>1.61611529e-01 1.57866306e-01]<br>[2.51185996e-01 2.46793569e-01 2.46239587e-01 2.53539909e-01<br>1.54840968e-06 2.23939182e-03]]</p><p>可见学习的还是很好的, 只是时间有点长(3个小时),<br>但是结果非常符合预期, 主要原因是,我们的数据非常干净,没有噪音. 如果在数据中混杂这噪音,可能结果就不会这么好了</p><h1 id="ref"><a href="#ref" class="headerlink" title="ref"></a>ref</h1><p><a class="link" href="https://links.jianshu.com/go?to=https://hmmlearn.readthedocs.io/en/latest/api.html" >https://hmmlearn.readthedocs.io/en/latest/api.html <i class="fa-regular fa-arrow-up-right-from-square fa-sm"></i></a></p>]]></content>
<summary type="html"><h2 id="问题"><a href="#问题" class="headerlink" title="问题:"></a>问题:</h2><p>什么是马尔科夫模型?用来干什么?<br>大家可以参考这篇简书</p>
<p>&lt;隐性马尔科夫模型简介,只聊原理, (保证没有数学)&</summary>
<category term="Programing" scheme="https://www.cobaltyang.me/categories/Programing/"/>
<category term="Python" scheme="https://www.cobaltyang.me/tags/Python/"/>
<category term="Machine Learning" scheme="https://www.cobaltyang.me/tags/Machine-Learning/"/>
</entry>
<entry>
<title>TensorFlow2 读写matlab变量 & 张量操作</title>
<link href="https://www.cobaltyang.me/2023/02/28/matlab/"/>
<id>https://www.cobaltyang.me/2023/02/28/matlab/</id>
<published>2023-02-28T13:18:20.000Z</published>
<updated>2023-03-01T08:35:45.911Z</updated>
<content type="html"><![CDATA[<blockquote><p>在科研中,matlab对矩阵处理有优势,而python语言对神经网络热门框架,譬如TensorFlow、Pytorch等都支持,所以难免会有需要跨平台处理。本文主要分享一种简便的处理方法:在.mat格式的文件中写入读取数据,因为.mat格式matlab和python都支持。</p></blockquote><h4 id="一、读写matlab变量"><a href="#一、读写matlab变量" class="headerlink" title="一、读写matlab变量"></a>一、读写matlab变量</h4><ul><li>python中读写.mat数据文件</li></ul><div class="highlight-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> tensorflow <span class="keyword">as</span> tf</span><br><span class="line"><span class="keyword">import</span> scipy.io <span class="keyword">as</span> io</span><br><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"></span><br><span class="line"><span class="comment"># 将.mat文件读取为字典变量InputData</span></span><br><span class="line">InputData = io.loadmat(<span class="string">'MatlabData.mat'</span>)</span><br><span class="line"><span class="comment"># 从字典变量中读取变量</span></span><br><span class="line">A = tf.constant(InputData[<span class="string">'A'</span>])</span><br><span class="line">A = tf.cast(A, dtype = tf.float32)</span><br><span class="line">B = tf.constant(InputData[<span class="string">'B'</span>]) </span><br><span class="line">......</span><br><span class="line">A = np.mat(A.numpy())</span><br><span class="line">B = np.mat(B.numpy())</span><br><span class="line"><span class="comment"># 将变量A保存为X,将变量B保存为Y,并写入到.mat文件</span></span><br><span class="line">io.savemat(<span class="string">'PythonData.mat'</span>,{<span class="string">'X'</span>:A, <span class="string">'Y'</span>:B})</span><br></pre></td></tr></table></figure></div><ul><li>matlab中读写.mat数据文件</li></ul><div class="highlight-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">load PythonData</span><br><span class="line"></span><br><span class="line">save MatlabData <span class="comment">#保存全部变量</span></span><br><span class="line">save MatlabData A B; <span class="comment">#保存指定变量A、B</span></span><br><span class="line">save (<span class="string">'MatlabData.mat'</span>, <span class="string">'A'</span>,<span class="string">'B'</span>);</span><br></pre></td></tr></table></figure></div><ul><li><a class="link" href="https://links.jianshu.com/go?to=https://www.cnblogs.com/mppp/p/12067442.html" >c++读写.mat数据文件麻烦一些 <i class="fa-regular fa-arrow-up-right-from-square fa-sm"></i></a></li></ul><h4 id="二、张量的理解"><a href="#二、张量的理解" class="headerlink" title="二、张量的理解"></a>二、张量的理解</h4><h5 id="2-1-初始化张量"><a href="#2-1-初始化张量" class="headerlink" title="2.1 初始化张量"></a>2.1 初始化张量</h5><div class="highlight-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 定义一个随机数(标量)</span></span><br><span class="line">random_float = tf.random.uniform(shape=())</span><br><span class="line"><span class="comment"># 定义一个有2个元素的零向量</span></span><br><span class="line">zero_vector = tf.zeros(shape=(2))</span><br><span class="line"><span class="comment"># 定义两个2×2的常量矩阵</span></span><br><span class="line">A = tf.constant([[1., 2.], [3., 4.]])</span><br></pre></td></tr></table></figure></div><h5 id="2-2-输出张量特征"><a href="#2-2-输出张量特征" class="headerlink" title="2.2 输出张量特征"></a>2.2 输出张量特征</h5><div class="highlight-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 查看矩阵A的形状、类型和值</span></span><br><span class="line"><span class="built_in">print</span>(A.shape) <span class="comment"># 输出(2, 2),即矩阵的长和宽均为2</span></span><br><span class="line"><span class="built_in">print</span>(A.dtype) <span class="comment"># 输出<dtype: 'float32'></span></span><br><span class="line"><span class="built_in">print</span>(A.numpy()) <span class="comment"># 输出[[1. 2.]</span></span><br><span class="line"> <span class="comment"># [3. 4.]]</span></span><br></pre></td></tr></table></figure></div><h5 id="2-3-基本张量操作"><a href="#2-3-基本张量操作" class="headerlink" title="2.3 基本张量操作"></a>2.3 基本张量操作</h5><div class="highlight-container" data-rel="Csharp"><figure class="iseeu highlight csharp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">C = tf.<span class="keyword">add</span>(A, B) <span class="meta"># 计算矩阵A和B的和</span></span><br><span class="line">D = tf.matmul(A, B) <span class="meta"># 计算矩阵A和B的乘积</span></span><br><span class="line">E = tf.mutiply(A, <span class="number">1</span>) <span class="meta"># 计算矩阵A与1的数乘</span></span><br><span class="line">tf.square(x) <span class="meta"># 对张量x每个元素平方,并返回同维度张量</span></span><br><span class="line">tf.log(x) <span class="meta"># 对张量x每个元素求自然对数,同上</span></span><br><span class="line"></span><br><span class="line">tf.reduce_sum(x) <span class="meta"># 对所有元素求和,返回0维张量(标量)</span></span><br><span class="line"><span class="meta"># tf.reduce_sum(x,0) # 求列和(纵向求和)</span></span><br><span class="line"><span class="meta"># tf.reduce_sum(x,1) # 求行和(横向求和)</span></span><br><span class="line">tf.reduce_mean() <span class="meta"># 用法类似于tf.reduce_sum、tf.reduce_max</span></span><br></pre></td></tr></table></figure></div><h6 id="参考链接"><a href="#参考链接" class="headerlink" title="参考链接"></a>参考链接</h6><ol><li><a class="link" href="https://links.jianshu.com/go?to=https://tf.wiki/zh/basic/basic.html" >简单粗暴TensorFlow2.0 <i class="fa-regular fa-arrow-up-right-from-square fa-sm"></i></a></li><li><a class="link" href="https://links.jianshu.com/go?to=https://tensorflow.google.cn/tutorials/customization/basics" >TensorFlow2.0 tutorials <i class="fa-regular fa-arrow-up-right-from-square fa-sm"></i></a></li><li><a class="link" href="https://links.jianshu.com/go?to=https://www.cnblogs.com/chenhuabin/p/11594239.html" >张量的数学运算 <i class="fa-regular fa-arrow-up-right-from-square fa-sm"></i></a></li></ol>]]></content>
<summary type="html"><blockquote>
<p>在科研中,matlab对矩阵处理有优势,而python语言对神经网络热门框架,譬如TensorFlow、Pytorch等都支持,所以难免会有需要跨平台处理。本文主要分享一种简便的处理方法:在.mat格式的文件中写入读取数据,因为.mat格式matl</summary>
<category term="Programing" scheme="https://www.cobaltyang.me/categories/Programing/"/>
<category term="Python" scheme="https://www.cobaltyang.me/tags/Python/"/>
<category term="Machine Learning" scheme="https://www.cobaltyang.me/tags/Machine-Learning/"/>
<category term="Matlab" scheme="https://www.cobaltyang.me/tags/Matlab/"/>
<category term="Tensorflow" scheme="https://www.cobaltyang.me/tags/Tensorflow/"/>
</entry>
<entry>
<title>R对邮件进行排序实现智能收件箱</title>
<link href="https://www.cobaltyang.me/2023/02/28/third/"/>
<id>https://www.cobaltyang.me/2023/02/28/third/</id>
<published>2023-02-28T13:18:20.000Z</published>
<updated>2023-03-01T08:36:30.097Z</updated>
<content type="html"><![CDATA[<h1 id="1、数据准备"><a href="#1、数据准备" class="headerlink" title="1、数据准备"></a>1、数据准备</h1><p>数据下载:<a class="link" href="https://links.jianshu.com/go?to=https://spamassassin.apache.org/old/publiccorpus/" >https://spamassassin.apache.org/old/publiccorpus/ <i class="fa-regular fa-arrow-up-right-from-square fa-sm"></i></a></p><p>参考谷歌Gmail服务,他们将邮件特征分为社交特征、内容特征、线程特征和标签特征。我们的数据中没有详细的时间戳及无法得知用户何时做了何种响应。但我们可以测量接收量,因此可以假设这种单向度量能够较好地代表数据中的社交特征类型。<br><strong>社交特征。</strong>用同一主题邮件的发送间隔时间来决定邮件的重要性,很自然的方法就是计算收件人在收到邮件后过了多久才处理这封邮件,在给定特征集下,这个平均时间越短,说明邮件在所属类型中的重要性越高。<br><strong>线程特征。</strong>匹配线程特征词项,比如“RE:”,线程很活跃,那么就比不活跃的更重要。<br><strong>内容特征。</strong>抽取邮件正文中的词项,新来一封邮件当它们包含更多的特征词项时,说明更重要。<br><strong>标签特征。</strong>暂不考虑。</p><p>我们只需要正常的邮件数据,对所有邮件信息按时间排序,然后将数据拆分为训练集和测试集。第一部分用于训练排序算法,第二部分用来测试模型效果。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> library<span class="punctuation">(</span>pacman<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> p_load<span class="punctuation">(</span>chinese.misc<span class="punctuation">,</span>stringr<span class="punctuation">,</span>dplyr<span class="punctuation">,</span>ggplot2<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> easy.ham.files <span class="operator"><-</span> dir_or_file<span class="punctuation">(</span><span class="string">"./easy_ham"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> easy.ham2.files <span class="operator"><-</span> dir_or_file<span class="punctuation">(</span><span class="string">"./easy_ham_2"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> hard.ham.files <span class="operator"><-</span> dir_or_file<span class="punctuation">(</span><span class="string">"./hard_ham"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> hard.ham2.files <span class="operator"><-</span> dir_or_file<span class="punctuation">(</span><span class="string">"./hard_ham_2"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> emails <span class="operator"><-</span> <span class="built_in">c</span><span class="punctuation">(</span>easy.ham.files<span class="punctuation">,</span>easy.ham2.files<span class="punctuation">,</span></span><br><span class="line"><span class="operator">+</span> hard.ham.files<span class="punctuation">,</span>hard.ham2.files<span class="punctuation">)</span> <span class="operator">%>%</span> unique<span class="punctuation">(</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><p>邮件头信息:<br>From:这封邮件来自谁?使用来自该发件人的邮件量作为社交特征的表征量。<br>Date:何时收到这封邮件?作为时间度量。<br>Subj:这是一个活跃线程吗?如果来自一个已知线程,那么可以确定其活跃程度以作为线程特征。<br>正文:邮件内容是什么?找到最常出现的词项作为内容特征。<br>构造函数,在读取时从每一封邮件中抽取如上内容,将半结构化数据转换为高度结构化的训练数据集。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> pre_fun <span class="operator"><-</span> <span class="keyword">function</span><span class="punctuation">(</span>string<span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> string <span class="operator"><-</span> str_replace_all<span class="punctuation">(</span>string<span class="punctuation">,</span><span class="string">"\\s+"</span><span class="punctuation">,</span><span class="string">" "</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> string <span class="operator"><-</span> tolower<span class="punctuation">(</span>string<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> string <span class="operator"><-</span> str_replace_all<span class="punctuation">(</span>string<span class="punctuation">,</span><span class="string">"[^a-z]"</span><span class="punctuation">,</span><span class="string">" "</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> string <span class="operator"><-</span> str_replace_all<span class="punctuation">(</span>string<span class="punctuation">,</span><span class="string">"\\s+"</span><span class="punctuation">,</span><span class="string">" "</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> string <span class="operator"><-</span> str_trim<span class="punctuation">(</span>string<span class="punctuation">,</span>side <span class="operator">=</span> <span class="string">"both"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="built_in">return</span><span class="punctuation">(</span>string<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> <span class="comment"># 数据读取函数</span></span><br><span class="line"><span class="operator">></span> read_fun <span class="operator"><-</span> <span class="keyword">function</span><span class="punctuation">(</span>f<span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> <span class="keyword">if</span> <span class="punctuation">(</span><span class="operator">!</span>str_detect<span class="punctuation">(</span>f<span class="punctuation">,</span><span class="string">"cmds"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> f.txt <span class="operator"><-</span> readr<span class="operator">::</span>read_file<span class="punctuation">(</span>f<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="comment"># 抽取From</span></span><br><span class="line"><span class="operator">+</span> from <span class="operator"><-</span> str_extract_all<span class="punctuation">(</span>f.txt<span class="punctuation">,</span><span class="string">"From:(.*)"</span><span class="punctuation">)</span> <span class="operator">%>%</span> unlist</span><br><span class="line"><span class="operator">+</span> from <span class="operator"><-</span> ifelse<span class="punctuation">(</span><span class="built_in">length</span><span class="punctuation">(</span>from<span class="operator">></span><span class="number">1</span><span class="punctuation">)</span><span class="punctuation">,</span>from<span class="punctuation">[</span>str_detect<span class="punctuation">(</span>from<span class="punctuation">,</span><span class="string">"@"</span><span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">,</span>from<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="comment"># 如果检测到邮箱地址在<>中,提取</span></span><br><span class="line"><span class="operator">+</span> <span class="keyword">if</span><span class="punctuation">(</span>str_detect<span class="punctuation">(</span>from<span class="punctuation">,</span><span class="string">"<"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> from <span class="operator"><-</span> str_extract<span class="punctuation">(</span>from<span class="punctuation">,</span><span class="string">"<+(.*?)+>"</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> str_remove_all<span class="punctuation">(</span><span class="string">"<|>"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span> <span class="keyword">else</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> <span class="comment"># 如果没有检测到尖括号,清除From和括号中的内容</span></span><br><span class="line"><span class="operator">+</span> from <span class="operator"><-</span> str_remove_all<span class="punctuation">(</span>from<span class="punctuation">,</span><span class="string">"From: |\\(.*?\\)"</span><span class="punctuation">)</span><span class="punctuation">}</span></span><br><span class="line"><span class="operator">+</span> <span class="comment"># 抽取Date</span></span><br><span class="line"><span class="operator">+</span> date <span class="operator"><-</span> str_extract<span class="punctuation">(</span>f.txt<span class="punctuation">,</span><span class="string">"Date:(.*)"</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> str_remove<span class="punctuation">(</span><span class="string">"Date: "</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="comment"># 抽取Subject</span></span><br><span class="line"><span class="operator">+</span> subject <span class="operator"><-</span> str_extract<span class="punctuation">(</span>f.txt<span class="punctuation">,</span><span class="string">"Subject:(.*)"</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> str_remove<span class="punctuation">(</span><span class="string">"Subject: "</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="comment"># 按第一个空行切割,抽取邮件正文</span></span><br><span class="line"><span class="operator">+</span> message <span class="operator"><-</span> str_split_fixed<span class="punctuation">(</span>f.txt<span class="punctuation">,</span><span class="string">"\n\n"</span><span class="punctuation">,</span><span class="number">2</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> message <span class="operator"><-</span> message<span class="punctuation">[</span><span class="number">1</span><span class="punctuation">,</span><span class="number">2</span><span class="punctuation">]</span> <span class="operator">%>%</span> pre_fun</span><br><span class="line"><span class="operator">+</span> df <span class="operator"><-</span> tibble<span class="punctuation">(</span>from<span class="operator">=</span>from<span class="punctuation">,</span>date<span class="operator">=</span>date<span class="punctuation">,</span>subject<span class="operator">=</span>subject<span class="punctuation">,</span>message<span class="operator">=</span>message<span class="punctuation">,</span>id<span class="operator">=</span>f<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="built_in">return</span><span class="punctuation">(</span>df<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> dt <span class="operator"><-</span> sapply<span class="punctuation">(</span>emails<span class="punctuation">,</span>read_fun<span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> do.call<span class="punctuation">(</span>bind_rows<span class="punctuation">,</span>.<span class="punctuation">)</span> <span class="operator">%>%</span> distinct<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> head<span class="punctuation">(</span>dt<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Ruby"><figure class="iseeu highlight ruby"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## # A tibble: 6 x 5</span></span><br><span class="line"><span class="comment">## from date subject message id </span></span><br><span class="line"><span class="comment">## <chr> <chr> <chr> <chr> <chr> </span></span><br><span class="line"><span class="comment">## 1 kre<span class="doctag">@munna</span>~ Thu, 22 A~ Re: New Sequen~ date wed aug from ~ D:/R/data_set/sp~</span></span><br><span class="line"><span class="comment">## 2 steve.bur~ Thu, 22 A~ [zzzzteana] RE~ martin a posted ta~ D:/R/data_set/sp~</span></span><br><span class="line"><span class="comment">## 3 timc@2ubh~ Thu, 22 A~ [zzzzteana] Mo~ man threatens expl~ D:/R/data_set/sp~</span></span><br><span class="line"><span class="comment">## 4 monty<span class="doctag">@ros</span>~ Thu, 22 A~ [IRR] Klez: Th~ klez the virus tha~ D:/R/data_set/sp~</span></span><br><span class="line"><span class="comment">## 5 Stewart.S~ Thu, 22 A~ Re: [zzzzteana~ in adding cream to~ D:/R/data_set/sp~</span></span><br><span class="line"><span class="comment">## 6 martin<span class="doctag">@sr</span>~ Thu, 22 A~ Re: [zzzzteana~ i just had to jump~ D:/R/data_set/sp~</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> mice<span class="operator">::</span>md.pattern<span class="punctuation">(</span>dt<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><p><img lazyload src="/images/loading.svg" data-src="//upload-images.jianshu.io/upload_images/20267488-696dcb9844048421.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700/format/webp" ></p><p>检查缺失值</p><div class="highlight-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## from date message id subject </span></span><br><span class="line"><span class="comment">## 6944 1 1 1 1 1 0</span></span><br><span class="line"><span class="comment">## 7 1 1 1 1 0 1</span></span><br><span class="line"><span class="comment">## 0 0 0 0 7 7</span></span><br></pre></td></tr></table></figure></div><p>subject变量存在7个缺失值。<br>另外,我们还需要针对具体变量做更详细的检查。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> <span class="comment"># 检查from中是否都存在@符号</span></span><br><span class="line"><span class="operator">></span> table<span class="punctuation">(</span>str_detect<span class="punctuation">(</span>dt<span class="operator">$</span>from<span class="punctuation">,</span><span class="string">"@"</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## </span></span><br><span class="line"><span class="comment">## TRUE </span></span><br><span class="line"><span class="comment">## 6951</span></span><br></pre></td></tr></table></figure></div><p>说明邮箱中发件人信息从邮箱格式上看是没有问题的。</p><p>随机查看30个date列的值:</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> dt<span class="operator">$</span>date<span class="punctuation">[</span>sample<span class="punctuation">(</span>nrow<span class="punctuation">(</span>dt<span class="punctuation">)</span><span class="punctuation">,</span><span class="number">30</span><span class="punctuation">)</span><span class="punctuation">]</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Css"><figure class="iseeu highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line">## <span class="selector-attr">[1]</span> "Tue, <span class="number">27</span> Aug <span class="number">2002</span> <span class="number">21</span>:<span class="number">36</span>:<span class="number">22</span> -<span class="number">0400</span><span class="string">" </span></span><br><span class="line"><span class="string">## [2] "</span>Fri, <span class="number">9</span> Aug <span class="number">2002</span> <span class="number">20</span>:<span class="number">09</span>:<span class="number">02</span> -<span class="number">0700</span><span class="string">" </span></span><br><span class="line"><span class="string">## [3] "</span>Thu, <span class="number">25</span> Jul <span class="number">2002</span> <span class="number">04</span>:<span class="number">56</span>:<span class="number">39</span> -<span class="number">0400</span> (EDT)<span class="string">"</span></span><br><span class="line"><span class="string">## [4] "</span>Sat, <span class="number">24</span> Aug <span class="number">2002</span> <span class="number">10</span>:<span class="number">57</span>:<span class="number">13</span> -<span class="number">0400</span> (EDT)<span class="string">"</span></span><br><span class="line"><span class="string">## [5] "</span>Fri, <span class="number">04</span> Oct <span class="number">2002</span> <span class="number">10</span>:<span class="number">03</span>:<span class="number">14</span> +<span class="number">0300</span><span class="string">" </span></span><br><span class="line"><span class="string">## [6] "</span>Mon, <span class="number">2</span> Sep <span class="number">2002</span> <span class="number">09</span>:<span class="number">33</span>:<span class="number">47</span> -<span class="number">0400</span><span class="string">" </span></span><br><span class="line"><span class="string">## [7] "</span>Mon, <span class="number">09</span> Sep <span class="number">2002</span> <span class="number">12</span>:<span class="number">29</span>:<span class="number">51</span> -<span class="number">0400</span><span class="string">" </span></span><br><span class="line"><span class="string">## [8] "</span>Thu, <span class="number">22</span> Aug <span class="number">2002</span> <span class="number">12</span>:<span class="number">39</span>:<span class="number">47</span> -<span class="number">0300</span><span class="string">" </span></span><br><span class="line"><span class="string">## [9] "</span>Wed, <span class="number">28</span> Aug <span class="number">2002</span> <span class="number">07</span>:<span class="number">45</span>:<span class="number">18</span> -<span class="number">0700</span><span class="string">" </span></span><br><span class="line"><span class="string">## [10] "</span>Tue, <span class="number">24</span> Sep <span class="number">2002</span> <span class="number">08</span>:<span class="number">00</span>:<span class="number">11</span> -<span class="number">0000</span><span class="string">" </span></span><br><span class="line"><span class="string">## [11] "</span>Sat, <span class="number">03</span> Aug <span class="number">2002</span> <span class="number">22</span>:<span class="number">31</span>:<span class="number">23</span> -<span class="number">0700</span><span class="string">" </span></span><br><span class="line"><span class="string">## [12] "</span>Mon, <span class="number">07</span> Oct <span class="number">2002</span> <span class="number">08</span>:<span class="number">00</span>:<span class="number">59</span> -<span class="number">0000</span><span class="string">" </span></span><br><span class="line"><span class="string">## [13] "</span><span class="number">20</span> Jul <span class="number">2002</span> <span class="number">10</span>:<span class="number">50</span>:<span class="number">58</span> +<span class="number">1200</span><span class="string">" </span></span><br><span class="line"><span class="string">## [14] "</span>Wed, <span class="number">10</span> Jul <span class="number">2002</span> <span class="number">16</span>:<span class="number">34</span>:<span class="number">42</span> -<span class="number">0700</span> (PDT)<span class="string">"</span></span><br><span class="line"><span class="string">## [15] "</span>Tue, <span class="number">08</span> Oct <span class="number">2002</span> <span class="number">13</span>:<span class="number">28</span>:<span class="number">56</span> +<span class="number">0100</span><span class="string">" </span></span><br><span class="line"><span class="string">## [16] "</span>Tue, <span class="number">20</span> Aug <span class="number">2002</span> <span class="number">16</span>:<span class="number">30</span>:<span class="number">38</span> -<span class="number">0300</span><span class="string">" </span></span><br><span class="line"><span class="string">## [17] "</span>Thu, <span class="number">26</span> Sep <span class="number">2002</span> <span class="number">08</span>:<span class="number">01</span>:<span class="number">56</span> -<span class="number">0000</span><span class="string">" </span></span><br><span class="line"><span class="string">## [18] "</span>Tue, <span class="number">1</span> Oct <span class="number">2002</span> <span class="number">14</span>:<span class="number">16</span>:<span class="number">16</span> +<span class="number">0300</span> (EEST)<span class="string">"</span></span><br><span class="line"><span class="string">## [19] "</span>Mon, <span class="number">30</span> Sep <span class="number">2002</span> <span class="number">15</span>:<span class="number">55</span>:<span class="number">47</span> -<span class="number">0400</span><span class="string">" </span></span><br><span class="line"><span class="string">## [20] "</span>Thu, <span class="number">18</span> Jul <span class="number">2002</span> <span class="number">17</span>:<span class="number">20</span>:<span class="number">20</span> -<span class="number">0700</span> (PDT)<span class="string">"</span></span><br><span class="line"><span class="string">## [21] "</span>Tue, <span class="number">20</span> Aug <span class="number">2002</span> <span class="number">15</span>:<span class="number">31</span>:<span class="number">17</span> +<span class="number">0100</span><span class="string">" </span></span><br><span class="line"><span class="string">## [22] "</span><span class="number">03</span> Oct <span class="number">2002</span> <span class="number">21</span>:<span class="number">58</span>:<span class="number">55</span> -<span class="number">0400</span><span class="string">" </span></span><br><span class="line"><span class="string">## [23] "</span>Sun, <span class="number">01</span> Dec <span class="number">2002</span> <span class="number">18</span>:<span class="number">03</span>:<span class="number">10</span> -<span class="number">0700</span><span class="string">" </span></span><br><span class="line"><span class="string">## [24] "</span>Thu, <span class="number">18</span> Jul <span class="number">2002</span> <span class="number">13</span>:<span class="number">46</span>:<span class="number">12</span> -<span class="number">0700</span> (PDT)<span class="string">"</span></span><br><span class="line"><span class="string">## [25] "</span>Sun, <span class="number">29</span> Sep <span class="number">2002</span> <span class="number">08</span>:<span class="number">00</span>:<span class="number">02</span> -<span class="number">0000</span><span class="string">" </span></span><br><span class="line"><span class="string">## [26] "</span>Thu, <span class="number">26</span> Sep <span class="number">2002</span> <span class="number">15</span>:<span class="number">32</span>:<span class="number">19</span> -<span class="number">0000</span><span class="string">" </span></span><br><span class="line"><span class="string">## [27] "</span>Wed, <span class="number">31</span> Jul <span class="number">2002</span> <span class="number">16</span>:<span class="number">37</span>:<span class="number">42</span> +<span class="number">0100</span><span class="string">" </span></span><br><span class="line"><span class="string">## [28] "</span>Mon, <span class="number">12</span> Aug <span class="number">2002</span> <span class="number">09</span>:<span class="number">29</span>:<span class="number">38</span> +<span class="number">0100</span><span class="string">" </span></span><br><span class="line"><span class="string">## [29] "</span>Sat Sep <span class="number">7</span> <span class="number">04</span>:<span class="number">38</span>:<span class="number">51</span> <span class="number">2002</span><span class="string">" </span></span><br><span class="line"><span class="string">## [30] "</span>Wed, <span class="number">09</span> Oct <span class="number">2002</span> <span class="number">08</span>:<span class="number">00</span>:<span class="number">35</span> -<span class="number">0000</span><span class="string">"</span></span><br></pre></td></tr></table></figure></div><p>多抽样几次,可以发现date列的格式比较多,比如<br>“Sun, 15 Sep 2002 21:22:52 -0400”,<br>“01 Oct 2002 19:22:16 -0700”,<br>“Wed, 17 Jul 2002 20:58:30 -0700 (PDT)”,<br>“Tue, 24 Sep 2002 08:46:08 EDT”,<br>“Tue Sep 10 10:29:19 2002”,<br>需要重新整理成统一的格式。构建日期转换函数:</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> trans_date <span class="operator"><-</span> <span class="keyword">function</span><span class="punctuation">(</span>string<span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> string <span class="operator"><-</span> str_split<span class="punctuation">(</span>string<span class="punctuation">,</span><span class="string">" "</span><span class="punctuation">)</span> <span class="operator">%>%</span> unlist <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> str_remove_all<span class="punctuation">(</span><span class="string">"Sun|Mon|Tue|Wed|Thu|Fri|Sat|,"</span><span class="punctuation">)</span> <span class="operator">%>%</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> str_remove_all<span class="punctuation">(</span><span class="string">"[+|-](.*)"</span><span class="punctuation">)</span> <span class="operator">%>%</span> str_remove_all<span class="punctuation">(</span><span class="string">"\\(.*\\)"</span><span class="punctuation">)</span> <span class="operator">%>%</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> str_remove_all<span class="punctuation">(</span><span class="string">"[A-Z]{2,}"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> year <span class="operator"><-</span> string<span class="punctuation">[</span>nchar<span class="punctuation">(</span>string<span class="punctuation">)</span><span class="operator">==</span><span class="number">4</span><span class="punctuation">]</span></span><br><span class="line"><span class="operator">+</span> month <span class="operator"><-</span> string<span class="punctuation">[</span>nchar<span class="punctuation">(</span>string<span class="punctuation">)</span><span class="operator">==</span><span class="number">3</span><span class="punctuation">]</span></span><br><span class="line"><span class="operator">+</span> day <span class="operator"><-</span> string<span class="punctuation">[</span>nchar<span class="punctuation">(</span>string<span class="punctuation">)</span><span class="operator">==</span><span class="number">1</span><span class="operator">|</span>nchar<span class="punctuation">(</span>string<span class="punctuation">)</span><span class="operator">==</span><span class="number">2</span><span class="punctuation">]</span></span><br><span class="line"><span class="operator">+</span> time <span class="operator"><-</span> string<span class="punctuation">[</span>nchar<span class="punctuation">(</span>string<span class="punctuation">)</span><span class="operator">==</span><span class="number">8</span><span class="punctuation">]</span></span><br><span class="line"><span class="operator">+</span> string.new <span class="operator"><-</span> paste<span class="punctuation">(</span>day<span class="punctuation">,</span>month<span class="punctuation">,</span>year<span class="punctuation">,</span>time<span class="punctuation">)</span> <span class="operator">%>%</span> lubridate<span class="operator">::</span>dmy_hms<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="built_in">return</span><span class="punctuation">(</span>string.new<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> dt<span class="operator">$</span>date <span class="operator"><-</span> dt<span class="operator">$</span>date <span class="operator">%>%</span> trans_date</span><br><span class="line"><span class="operator">></span> dt<span class="operator">$</span>date<span class="punctuation">[</span>sample<span class="punctuation">(</span>nrow<span class="punctuation">(</span>dt<span class="punctuation">)</span><span class="punctuation">,</span><span class="number">10</span><span class="punctuation">)</span><span class="punctuation">]</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Css"><figure class="iseeu highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">## <span class="selector-attr">[1]</span> "<span class="number">2002</span>-<span class="number">07</span>-<span class="number">15</span> <span class="number">03</span>:<span class="number">00</span>:<span class="number">01</span> UTC<span class="string">" "</span><span class="number">2002</span>-<span class="number">10</span>-<span class="number">08</span> <span class="number">08</span>:<span class="number">01</span>:<span class="number">21</span> UTC<span class="string">"</span></span><br><span class="line"><span class="string">## [3] "</span><span class="number">2002</span>-<span class="number">08</span>-<span class="number">29</span> <span class="number">08</span>:<span class="number">32</span>:<span class="number">08</span> UTC<span class="string">" "</span><span class="number">2002</span>-<span class="number">09</span>-<span class="number">25</span> <span class="number">08</span>:<span class="number">00</span>:<span class="number">22</span> UTC<span class="string">"</span></span><br><span class="line"><span class="string">## [5] "</span><span class="number">2002</span>-<span class="number">10</span>-<span class="number">08</span> <span class="number">08</span>:<span class="number">01</span>:<span class="number">05</span> UTC<span class="string">" "</span><span class="number">2002</span>-<span class="number">09</span>-<span class="number">12</span> <span class="number">09</span>:<span class="number">05</span>:<span class="number">50</span> UTC<span class="string">"</span></span><br><span class="line"><span class="string">## [7] "</span><span class="number">2002</span>-<span class="number">09</span>-<span class="number">30</span> <span class="number">22</span>:<span class="number">00</span>:<span class="number">02</span> UTC<span class="string">" "</span><span class="number">2002</span>-<span class="number">08</span>-<span class="number">06</span> <span class="number">16</span>:<span class="number">50</span>:<span class="number">07</span> UTC<span class="string">"</span></span><br><span class="line"><span class="string">## [9] "</span><span class="number">2002</span>-<span class="number">07</span>-<span class="number">10</span> <span class="number">16</span>:<span class="number">05</span>:<span class="number">42</span> UTC<span class="string">" "</span><span class="number">2002</span>-<span class="number">10</span>-<span class="number">08</span> <span class="number">08</span>:<span class="number">00</span>:<span class="number">31</span> UTC<span class="string">"</span></span><br></pre></td></tr></table></figure></div><p>转换后的结果很规整,完全符合我们的要求。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> head<span class="punctuation">(</span>dt<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Ruby"><figure class="iseeu highlight ruby"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## # A tibble: 6 x 5</span></span><br><span class="line"><span class="comment">## from date subject message id </span></span><br><span class="line"><span class="comment">## <chr> <dttm> <chr> <chr> <chr> </span></span><br><span class="line"><span class="comment">## 1 kre<span class="doctag">@munn</span>~ 2002-08-22 18:26:25 Re: New Sequ~ date wed aug fr~ D:/R/data_set~</span></span><br><span class="line"><span class="comment">## 2 steve.bu~ 2002-08-22 12:46:18 [zzzzteana] ~ martin a posted~ D:/R/data_set~</span></span><br><span class="line"><span class="comment">## 3 timc@2ub~ 2002-08-22 13:52:38 [zzzzteana] ~ man threatens e~ D:/R/data_set~</span></span><br><span class="line"><span class="comment">## 4 monty<span class="doctag">@ro</span>~ 2002-08-22 09:15:25 [IRR] Klez: ~ klez the virus ~ D:/R/data_set~</span></span><br><span class="line"><span class="comment">## 5 Stewart.~ 2002-08-22 14:38:22 Re: [zzzztea~ in adding cream~ D:/R/data_set~</span></span><br><span class="line"><span class="comment">## 6 martin<span class="doctag">@s</span>~ 2002-08-22 14:50:31 Re: [zzzztea~ i just had to j~ D:/R/data_set~</span></span><br></pre></td></tr></table></figure></div><p>现在数据基本转换成了我们需要的样子,下面继续做一些必要的转换。将from和subject全部转换为小写,并且将整个数据框按date列排序。最后将数据拆分为训练集和测试集。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> dt<span class="operator">$</span>from <span class="operator"><-</span> tolower<span class="punctuation">(</span>dt<span class="operator">$</span>from<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> dt<span class="operator">$</span>subject <span class="operator"><-</span> tolower<span class="punctuation">(</span>dt<span class="operator">$</span>subject<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> dt <span class="operator"><-</span> arrange<span class="punctuation">(</span>dt<span class="punctuation">,</span>date<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> set.seed<span class="punctuation">(</span><span class="number">123</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> ind <span class="operator"><-</span> sample<span class="punctuation">(</span><span class="number">1</span><span class="operator">:</span>nrow<span class="punctuation">(</span>dt<span class="punctuation">)</span><span class="punctuation">,</span>nrow<span class="punctuation">(</span>dt<span class="punctuation">)</span><span class="operator">*</span><span class="number">0.8</span><span class="punctuation">,</span>replace <span class="operator">=</span> <span class="built_in">T</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> train <span class="operator"><-</span> dt<span class="punctuation">[</span>ind<span class="punctuation">,</span><span class="punctuation">]</span></span><br><span class="line"><span class="operator">></span> test <span class="operator"><-</span> dt<span class="punctuation">[</span><span class="operator">-</span>ind<span class="punctuation">,</span><span class="punctuation">]</span></span><br></pre></td></tr></table></figure></div><h1 id="2、邮件发送量权重计算策略"><a href="#2、邮件发送量权重计算策略" class="headerlink" title="2、邮件发送量权重计算策略"></a>2、邮件发送量权重计算策略</h1><p>来自同一地址(from)的邮件越频繁,说明该邮件越重要。所以按邮件中的from计数来设计用于重要性排序的权重。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> p_load<span class="punctuation">(</span>ggplot2<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> from.weight <span class="operator"><-</span> train<span class="punctuation">[</span><span class="punctuation">,</span><span class="string">"from"</span><span class="punctuation">]</span> <span class="operator">%>%</span> group_by<span class="punctuation">(</span>from<span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> summarise<span class="punctuation">(</span>freq<span class="operator">=</span>n<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span> arrange<span class="punctuation">(</span><span class="operator">-</span>freq<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> head<span class="punctuation">(</span>from.weight<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Ruby"><figure class="iseeu highlight ruby"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## # A tibble: 6 x 2</span></span><br><span class="line"><span class="comment">## from freq</span></span><br><span class="line"><span class="comment">## <chr> <int></span></span><br><span class="line"><span class="comment">## 1 rssfeeds<span class="doctag">@example</span>.com 498</span></span><br><span class="line"><span class="comment">## 2 rssfeeds<span class="doctag">@spamassassin</span>.taint.org 468</span></span><br><span class="line"><span class="comment">## 3 tomwhore<span class="doctag">@slack</span>.net 112</span></span><br><span class="line"><span class="comment">## 4 garym<span class="doctag">@canada</span>.com 104</span></span><br><span class="line"><span class="comment">## 5 pudge<span class="doctag">@perl</span>.org 102</span></span><br><span class="line"><span class="comment">## 6 matthias<span class="doctag">@egwn</span>.net 86</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> <span class="comment"># 查看邮件数量最多的前30个账号</span></span><br><span class="line"><span class="operator">></span> from.weight <span class="operator">%>%</span> top_n<span class="punctuation">(</span><span class="number">30</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> ggplot<span class="punctuation">(</span>aes<span class="punctuation">(</span>freq<span class="punctuation">,</span>reorder<span class="punctuation">(</span>from<span class="punctuation">,</span>freq<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_bar<span class="punctuation">(</span>stat <span class="operator">=</span> <span class="string">"identity"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> theme_bw<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> labs<span class="punctuation">(</span>x<span class="operator">=</span><span class="string">"接收邮件数量"</span><span class="punctuation">,</span>y<span class="operator">=</span><span class="string">""</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><p><img lazyload src="/images/loading.svg" data-src="//upload-images.jianshu.io/upload_images/20267488-07aeffbb2216d9c2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700/format/webp" ></p><p>最密切的发件人</p><p>对发送量做对数转换。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> from.weight <span class="operator">%>%</span></span><br><span class="line"><span class="operator">+</span> ggplot<span class="punctuation">(</span>aes<span class="punctuation">(</span>x<span class="operator">=</span>freq<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_line<span class="punctuation">(</span>aes<span class="punctuation">(</span>y<span class="operator">=</span>log10<span class="punctuation">(</span>freq<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">,</span>col<span class="operator">=</span><span class="string">"green"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span><span class="number">500</span><span class="punctuation">,</span><span class="number">2.5</span><span class="punctuation">,</span>label<span class="operator">=</span><span class="string">"对数变换"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_line<span class="punctuation">(</span>aes<span class="punctuation">(</span>y<span class="operator">=</span><span class="built_in">log</span><span class="punctuation">(</span>freq<span class="punctuation">)</span><span class="punctuation">)</span><span class="punctuation">,</span>col<span class="operator">=</span><span class="string">"red"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_text<span class="punctuation">(</span>aes<span class="punctuation">(</span><span class="number">500</span><span class="punctuation">,</span><span class="number">6</span><span class="punctuation">,</span>label<span class="operator">=</span><span class="string">"自然对数变换"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> theme_bw<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> labs<span class="punctuation">(</span>x<span class="operator">=</span><span class="string">""</span><span class="punctuation">,</span>y<span class="operator">=</span><span class="string">"接收的邮件量"</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><p><img lazyload src="/images/loading.svg" data-src="//upload-images.jianshu.io/upload_images/20267488-c476e3d9140aaa81.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700/format/webp" ></p><p>对数变换效果</p><p>做对数变换后曲线会更平缓,同时,自然对数变换相对对数变换程度更小,更能保留原始数据的一些差异,所以最终我们选择自然对数变换后的值作为发送量特征的权重。<br>但是在做对数变换时需要注意的是,如果观测值为1,转换后就为0,计算权重时0乘以其他任何值都为0。为了避免这种情况,在转换前一般对观测值都加1。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> <span class="comment"># log1p()函数计算log(p+1)</span></span><br><span class="line"><span class="operator">></span> from.weight<span class="operator">$</span>freq <span class="operator"><-</span> log1p<span class="punctuation">(</span>from.weight<span class="operator">$</span>freq<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> <span class="comment"># 检查下变换为的数据</span></span><br><span class="line"><span class="operator">></span> summary<span class="punctuation">(</span>from.weight<span class="operator">$</span>freq<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Css"><figure class="iseeu highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">## Min. <span class="number">1s</span>t Qu. Median Mean <span class="number">3</span>rd Qu. Max. </span><br><span class="line">## <span class="number">0.6931</span> <span class="number">0.6931</span> <span class="number">1.3863</span> <span class="number">1.4869</span> <span class="number">1.7918</span> <span class="number">6.2126</span></span><br></pre></td></tr></table></figure></div><h1 id="3、邮件线程活跃度权重计算策略"><a href="#3、邮件线程活跃度权重计算策略" class="headerlink" title="3、邮件线程活跃度权重计算策略"></a>3、邮件线程活跃度权重计算策略</h1><p>从subject中查找“re:”,然后查找这个线程里面的其他邮件,并测量其活跃度。在短时间内有更多邮件发送的线程就更活跃,因此也更重要。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> <span class="comment"># 提取包含“re:”的subject,并提取“re:”后面的内容作为主题</span></span><br><span class="line"><span class="operator">></span> threads.train <span class="operator"><-</span> train <span class="operator">%>%</span> filter<span class="punctuation">(</span>str_detect<span class="punctuation">(</span>subject<span class="punctuation">,</span><span class="string">"re:"</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> extract_subject <span class="operator"><-</span> <span class="keyword">function</span><span class="punctuation">(</span>string<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> string <span class="operator"><-</span> str_split<span class="punctuation">(</span>string<span class="punctuation">,</span><span class="string">"re:"</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> unlist <span class="operator">%>%</span> .<span class="punctuation">[</span><span class="number">2</span><span class="punctuation">]</span> <span class="operator">%>%</span> str_trim<span class="punctuation">(</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="built_in">return</span><span class="punctuation">(</span>string<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">></span> threads.train<span class="operator">$</span>subject <span class="operator"><-</span> threads.train<span class="operator">$</span>subject <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> lapply<span class="punctuation">(</span>extract_subject<span class="punctuation">)</span> <span class="operator">%>%</span> unlist</span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> <span class="comment"># 分组统计数量</span></span><br><span class="line"><span class="operator">></span> threads.freq <span class="operator"><-</span> threads.train <span class="operator">%>%</span> group_by<span class="punctuation">(</span>subject<span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> summarise<span class="punctuation">(</span>freq<span class="operator">=</span>n<span class="punctuation">(</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span> arrange<span class="punctuation">(</span>freq<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><p>数据中存在freq<2的情况,是因为数据集在采集的时候存在一部分主题邮件是在采集时间开始之前发起的,这时候主题中也存在“re:”标记,但是该线程发起时间并不在数据集中,所以需要去掉这部分数据。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> <span class="comment"># 线程时间跨度,即第一封邮件和最后一封邮件之间的时间间隔</span></span><br><span class="line"><span class="operator">></span> time_span <span class="operator"><-</span> <span class="keyword">function</span><span class="punctuation">(</span>df<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> max.time <span class="operator"><-</span> <span class="built_in">max</span><span class="punctuation">(</span>df<span class="operator">$</span>date<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> min.time <span class="operator"><-</span> <span class="built_in">min</span><span class="punctuation">(</span>df<span class="operator">$</span>date<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> threads.span <span class="operator"><-</span> difftime<span class="punctuation">(</span>max.time<span class="punctuation">,</span>min.time<span class="punctuation">,</span>units <span class="operator">=</span> <span class="string">"secs"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> df.new <span class="operator"><-</span> tibble<span class="punctuation">(</span>subject<span class="operator">=</span>df<span class="operator">$</span>subject<span class="punctuation">[</span><span class="number">1</span><span class="punctuation">]</span><span class="punctuation">,</span>threads.span<span class="operator">=</span>threads.span<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="built_in">return</span><span class="punctuation">(</span>df.new<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> <span class="comment"># 将数据框按subject拆分</span></span><br><span class="line"><span class="operator">></span> threads.train.split <span class="operator"><-</span> split.data.frame<span class="punctuation">(</span>threads.train<span class="punctuation">,</span></span><br><span class="line"><span class="operator">+</span> threads.train<span class="operator">$</span>subject<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> threads.span <span class="operator"><-</span> lapply<span class="punctuation">(</span>threads.train.split<span class="punctuation">,</span>time_span<span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> do.call<span class="punctuation">(</span>rbind.data.frame<span class="punctuation">,</span>.<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><p>按主题合并两个数据框。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> subject.weight <span class="operator"><-</span> left_join<span class="punctuation">(</span>threads.freq<span class="punctuation">,</span>threads.span<span class="punctuation">,</span>by<span class="operator">=</span><span class="string">"subject"</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> <span class="comment"># 转换为数值型</span></span><br><span class="line"><span class="operator">+</span> transform<span class="punctuation">(</span>threads.span<span class="operator">=</span><span class="built_in">as.numeric</span><span class="punctuation">(</span>threads.span<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> filter<span class="punctuation">(</span>freq<span class="operator">>=</span><span class="number">2</span> <span class="operator">&</span> threads.span<span class="operator">!=</span><span class="number">0</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> mutate<span class="punctuation">(</span>weight<span class="operator">=</span>freq<span class="operator">/</span>threads.span<span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> <span class="comment"># 仿射变换</span></span><br><span class="line"><span class="operator">+</span> transform<span class="punctuation">(</span>weight<span class="operator">=</span>log10<span class="punctuation">(</span>weight<span class="punctuation">)</span><span class="operator">+</span><span class="number">10</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> arrange<span class="punctuation">(</span>weight<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> head<span class="punctuation">(</span>subject.weight<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Css"><figure class="iseeu highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">## subject freq threads<span class="selector-class">.span</span> weight</span><br><span class="line">## <span class="number">1</span> activebuddy <span class="number">17</span> <span class="number">820721053</span> <span class="number">2.316253</span></span><br><span class="line">## <span class="number">2</span> <span class="selector-attr">[zzzzteana]</span> <span class="number">6</span> <span class="number">8275106</span> <span class="number">3.860378</span></span><br><span class="line">## <span class="number">3</span> no matter where you go <span class="number">4</span> <span class="number">2672629</span> <span class="number">4.175121</span></span><br><span class="line">## <span class="number">4</span> <span class="selector-attr">[sadev]</span> <span class="number">7</span> <span class="number">3325905</span> <span class="number">4.323188</span></span><br><span class="line">## <span class="number">5</span> <span class="selector-attr">[ilug-social]</span> <span class="number">4</span> <span class="number">1649403</span> <span class="number">4.384733</span></span><br><span class="line">## <span class="number">6</span> <span class="selector-attr">[razor-users]</span> <span class="number">16</span> <span class="number">5174599</span> <span class="number">4.490243</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> summary<span class="punctuation">(</span>subject.weight<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Css"><figure class="iseeu highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">## subject freq threads<span class="selector-class">.span</span> weight </span><br><span class="line">## Length:<span class="number">287</span> Min. : <span class="number">2.000</span> Min. : <span class="number">16</span> Min. :<span class="number">2.316</span> </span><br><span class="line">## Class :character <span class="number">1s</span>t Qu.: <span class="number">3.000</span> <span class="number">1s</span>t Qu.: <span class="number">15482</span> <span class="number">1s</span>t Qu.:<span class="number">5.675</span> </span><br><span class="line">## Mode :character Median : <span class="number">5.000</span> Median : <span class="number">49344</span> Median :<span class="number">6.126</span> </span><br><span class="line">## Mean : <span class="number">7.784</span> Mean : <span class="number">3094884</span> Mean :<span class="number">6.115</span> </span><br><span class="line">## <span class="number">3</span>rd Qu.: <span class="number">9.000</span> <span class="number">3</span>rd Qu.: <span class="number">145816</span> <span class="number">3</span>rd Qu.:<span class="number">6.512</span> </span><br><span class="line">## Max. :<span class="number">41.000</span> Max. :<span class="number">820721053</span> Max. :<span class="number">9.097</span></span><br></pre></td></tr></table></figure></div><p>从摘要中可以看到freq平均为7.784,threads.span平均为3094884,这样计算的weight将会很小,平均为2.515118e-06,在做对数转换时,就会得到负值:log10(7.784/3094884)=-5.600825。计算时权重不能为负值,所以这里进行仿射变换,简单地给所有转换值加10,以保证所有权重值为正数。</p><h1 id="4、邮件内容中高频词项的权重策略"><a href="#4、邮件内容中高频词项的权重策略" class="headerlink" title="4、邮件内容中高频词项的权重策略"></a>4、邮件内容中高频词项的权重策略</h1><p>假设出现在活跃线程邮件主题中的高频词比低频词和出现在不活跃线程中的词项更重要。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> p_load<span class="punctuation">(</span>text2vec<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> it <span class="operator"><-</span> itoken<span class="punctuation">(</span>threads.dt<span class="operator">$</span>message<span class="punctuation">,</span>ids <span class="operator">=</span> threads.dt<span class="operator">$</span>id<span class="punctuation">,</span>progressbar <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> <span class="comment"># 创建训练集词汇表</span></span><br><span class="line"><span class="operator">></span> vocab <span class="operator"><-</span> create_vocabulary<span class="punctuation">(</span>it<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> <span class="comment"># 去除停用词</span></span><br><span class="line"><span class="operator">></span> stopword <span class="operator"><-</span> readr<span class="operator">::</span>read_table<span class="punctuation">(</span><span class="string">"D:/R/dict/english_stopword.txt"</span><span class="punctuation">,</span></span><br><span class="line"><span class="operator">+</span> col_names <span class="operator">=</span> <span class="built_in">F</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> <span class="comment"># 还是以对数转换计算高频词的权重</span></span><br><span class="line"><span class="operator">></span> term.weight <span class="operator"><-</span> anti_join<span class="punctuation">(</span>vocab<span class="punctuation">,</span>stopword<span class="punctuation">,</span>by<span class="operator">=</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">"term"</span><span class="operator">=</span><span class="string">"X1"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> mutate<span class="punctuation">(</span>term.weight<span class="operator">=</span>log10<span class="punctuation">(</span>term_count<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> filter<span class="punctuation">(</span>term.weight<span class="operator">></span><span class="number">0</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><h1 id="5、训练和测试排序算法"><a href="#5、训练和测试排序算法" class="headerlink" title="5、训练和测试排序算法"></a>5、训练和测试排序算法</h1><p>一封邮件的整体权重(优先级)等于前面三种权重的乘积。当收到一封邮件的时候,我们需要先对其进行解析,计算其权重,然后对其进行优先级排序。<br>构造排序函数:</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> get_weight <span class="operator"><-</span> <span class="keyword">function</span><span class="punctuation">(</span>newemail<span class="punctuation">)</span><span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> from.new.n <span class="operator"><-</span> left_join<span class="punctuation">(</span>newemail<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">]</span><span class="punctuation">,</span>from.weight<span class="punctuation">,</span>by <span class="operator">=</span> <span class="string">"from"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> from.new <span class="operator"><-</span> ifelse<span class="punctuation">(</span><span class="built_in">is.na</span><span class="punctuation">(</span>from.new.n<span class="operator">$</span>freq<span class="punctuation">)</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">,</span>from.new.n<span class="operator">$</span>freq<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> <span class="keyword">if</span> <span class="punctuation">(</span><span class="operator">!</span><span class="built_in">is.na</span><span class="punctuation">(</span>newemail<span class="operator">$</span>subject<span class="punctuation">)</span> <span class="operator">&</span> str_detect<span class="punctuation">(</span>newemail<span class="operator">$</span>subject<span class="punctuation">,</span><span class="string">"re:"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> newemail<span class="operator">$</span>subject <span class="operator"><-</span> extract_subject<span class="punctuation">(</span>newemail<span class="operator">$</span>subject<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> subject.new.n <span class="operator"><-</span> left_join<span class="punctuation">(</span>newemail<span class="punctuation">,</span>subject.weight<span class="punctuation">,</span>by<span class="operator">=</span><span class="string">"subject"</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> subject.new <span class="operator"><-</span> ifelse<span class="punctuation">(</span><span class="built_in">is.na</span><span class="punctuation">(</span>subject.new.n<span class="operator">$</span>weight<span class="punctuation">)</span><span class="punctuation">,</span><span class="number">1</span><span class="punctuation">,</span></span><br><span class="line"><span class="operator">+</span> subject.new.n<span class="operator">$</span>weight<span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span> <span class="keyword">else</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> subject.new <span class="operator"><-</span> 1</span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> <span class="keyword">if</span> <span class="punctuation">(</span>newemail<span class="operator">$</span>message<span class="operator">!=</span><span class="string">""</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> msg.weight <span class="operator"><-</span> str_split<span class="punctuation">(</span>newemail<span class="operator">$</span>message<span class="punctuation">,</span><span class="string">" "</span><span class="punctuation">)</span> <span class="operator">%>%</span> unlist <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> jiebaR<span class="operator">::</span>freq<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">%>%</span> anti_join<span class="punctuation">(</span>stopword<span class="punctuation">,</span>by<span class="operator">=</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">"char"</span><span class="operator">=</span><span class="string">"X1"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> filter<span class="punctuation">(</span>char<span class="operator">!=</span><span class="string">""</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> <span class="keyword">if</span> <span class="punctuation">(</span>nrow<span class="punctuation">(</span>msg.weight<span class="punctuation">)</span><span class="operator">!=</span><span class="number">0</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> msg.weight.n <span class="operator"><-</span> left_join<span class="punctuation">(</span>msg.weight<span class="punctuation">,</span>term.weight<span class="punctuation">[</span><span class="punctuation">,</span><span class="built_in">c</span><span class="punctuation">(</span><span class="number">1</span><span class="punctuation">,</span><span class="number">4</span><span class="punctuation">)</span><span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line"><span class="operator">+</span> by<span class="operator">=</span><span class="built_in">c</span><span class="punctuation">(</span><span class="string">"char"</span><span class="operator">=</span><span class="string">"term"</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">%>%</span> </span><br><span class="line"><span class="operator">+</span> summarise<span class="punctuation">(</span>msg.new<span class="operator">=</span><span class="built_in">sum</span><span class="punctuation">(</span>freq<span class="operator">*</span>term.weight<span class="punctuation">,</span>na.rm <span class="operator">=</span> <span class="built_in">T</span><span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> msg.new <span class="operator"><-</span> msg.weight.n<span class="operator">$</span>msg.new</span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span> <span class="keyword">else</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> msg.new <span class="operator"><-</span> 1</span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> </span><br><span class="line"><span class="operator">+</span> <span class="built_in">return</span><span class="punctuation">(</span><span class="built_in">prod</span><span class="punctuation">(</span>from.new<span class="punctuation">,</span>subject.new<span class="punctuation">,</span>msg.new<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br></pre></td></tr></table></figure></div><h2 id="5-1-对训练集进行排序"><a href="#5-1-对训练集进行排序" class="headerlink" title="5.1 对训练集进行排序"></a>5.1 对训练集进行排序</h2><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> rank.train <span class="operator"><-</span> vector<span class="punctuation">(</span><span class="built_in">length</span> <span class="operator">=</span> nrow<span class="punctuation">(</span>train<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> <span class="keyword">for</span> <span class="punctuation">(</span>i <span class="keyword">in</span> <span class="number">1</span><span class="operator">:</span>nrow<span class="punctuation">(</span>train<span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> rank.train<span class="punctuation">[</span>i<span class="punctuation">]</span> <span class="operator"><-</span> get_weight<span class="punctuation">(</span>train<span class="punctuation">[</span>i<span class="punctuation">,</span><span class="punctuation">]</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> train.rank <span class="operator"><-</span> tibble<span class="punctuation">(</span>id<span class="operator">=</span>train<span class="operator">$</span>id<span class="punctuation">,</span>rank<span class="operator">=</span>rank.train<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> summary<span class="punctuation">(</span>train.rank<span class="operator">$</span>rank<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Css"><figure class="iseeu highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">## Min. <span class="number">1s</span>t Qu. Median Mean <span class="number">3</span>rd Qu. Max. </span><br><span class="line">## <span class="number">8.64</span> <span class="number">282.39</span> <span class="number">859.82</span> <span class="number">2300.37</span> <span class="number">2702.20</span> <span class="number">241506.57</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> <span class="comment"># 检查排序值的分布</span></span><br><span class="line"><span class="operator">></span> p1 <span class="operator"><-</span> ggplot<span class="punctuation">(</span>train.rank<span class="punctuation">,</span>aes<span class="punctuation">(</span>rank<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_histogram<span class="punctuation">(</span>bins <span class="operator">=</span> <span class="number">1000</span><span class="punctuation">,</span> fill <span class="operator">=</span> <span class="string">"dodgerblue"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_vline<span class="punctuation">(</span>xintercept <span class="operator">=</span> median<span class="punctuation">(</span>rank.train<span class="punctuation">)</span><span class="punctuation">,</span>size<span class="operator">=</span><span class="number">1</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> xlim<span class="punctuation">(</span><span class="built_in">c</span><span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span><span class="number">25000</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> theme_bw<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> labs<span class="punctuation">(</span>y<span class="operator">=</span><span class="string">""</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><h1 id="6、对测试集进行排序"><a href="#6、对测试集进行排序" class="headerlink" title="6、对测试集进行排序"></a>6、对测试集进行排序</h1><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> rank <span class="operator"><-</span> vector<span class="punctuation">(</span><span class="built_in">length</span> <span class="operator">=</span> nrow<span class="punctuation">(</span>test<span class="punctuation">)</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> <span class="keyword">for</span> <span class="punctuation">(</span>i <span class="keyword">in</span> <span class="number">1</span><span class="operator">:</span>nrow<span class="punctuation">(</span>test<span class="punctuation">)</span><span class="punctuation">)</span> <span class="punctuation">{</span></span><br><span class="line"><span class="operator">+</span> rank<span class="punctuation">[</span>i<span class="punctuation">]</span> <span class="operator"><-</span> get_weight<span class="punctuation">(</span>test<span class="punctuation">[</span>i<span class="punctuation">,</span><span class="punctuation">]</span><span class="punctuation">)</span></span><br><span class="line"><span class="operator">+</span> <span class="punctuation">}</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> test.rank <span class="operator"><-</span> tibble<span class="punctuation">(</span>id<span class="operator">=</span>test<span class="operator">$</span>id<span class="punctuation">,</span>rank<span class="operator">=</span>rank<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> </span><br><span class="line"><span class="operator">></span> summary<span class="punctuation">(</span>test.rank<span class="operator">$</span>rank<span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Css"><figure class="iseeu highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">## Min. <span class="number">1s</span>t Qu. Median Mean <span class="number">3</span>rd Qu. Max. </span><br><span class="line">## <span class="number">8.64</span> <span class="number">237.50</span> <span class="number">716.02</span> <span class="number">1913.47</span> <span class="number">2301.08</span> <span class="number">36852.14</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> <span class="comment"># 检查排序值的分布</span></span><br><span class="line"><span class="operator">></span> p2 <span class="operator"><-</span> ggplot<span class="punctuation">(</span>test.rank<span class="punctuation">,</span>aes<span class="punctuation">(</span>rank<span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_histogram<span class="punctuation">(</span>bins <span class="operator">=</span> <span class="number">1000</span><span class="punctuation">,</span> fill <span class="operator">=</span> <span class="string">"red"</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> geom_vline<span class="punctuation">(</span>xintercept <span class="operator">=</span> median<span class="punctuation">(</span>rank<span class="punctuation">)</span><span class="punctuation">,</span>size<span class="operator">=</span><span class="number">1</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> xlim<span class="punctuation">(</span><span class="built_in">c</span><span class="punctuation">(</span><span class="number">0</span><span class="punctuation">,</span><span class="number">25000</span><span class="punctuation">)</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> theme_bw<span class="punctuation">(</span><span class="punctuation">)</span> <span class="operator">+</span></span><br><span class="line"><span class="operator">+</span> labs<span class="punctuation">(</span>y<span class="operator">=</span><span class="string">""</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><p>对比训练集和测试的排序分布。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> p_load<span class="punctuation">(</span>patchwork<span class="punctuation">)</span></span><br><span class="line"><span class="operator">></span> p1 <span class="operator">+</span> p2 <span class="operator">+</span> plot_layout<span class="punctuation">(</span>nrow <span class="operator">=</span> <span class="number">2</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><p><img lazyload src="/images/loading.svg" data-src="//upload-images.jianshu.io/upload_images/20267488-9b981b5001f799fb.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700/format/webp" ></p><p>训练集和测试集排序值分布</p><p>可以看到训练集和测试集的排序分布几乎一模一样,都是长尾分布,意味着更多的邮件的优先级排序不高,这也符合常理。<br>然后检查一下测试集排序最靠前的20行。</p><div class="highlight-container" data-rel="R"><figure class="iseeu highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="operator">></span> test<span class="punctuation">[</span><span class="punctuation">,</span><span class="number">3</span><span class="punctuation">]</span> <span class="operator">%>%</span> cbind<span class="punctuation">(</span>rank<span class="operator">=</span>rank<span class="punctuation">)</span> <span class="operator">%>%</span> arrange<span class="punctuation">(</span><span class="operator">-</span>rank<span class="punctuation">)</span> <span class="operator">%>%</span> head<span class="punctuation">(</span><span class="number">20</span><span class="punctuation">)</span></span><br></pre></td></tr></table></figure></div><div class="highlight-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## subject rank</span></span><br><span class="line"><span class="comment">## 1 re: apple sauced...again 36852.14</span></span><br><span class="line"><span class="comment">## 2 re: apple sauced...again 36852.14</span></span><br><span class="line"><span class="comment">## 3 sed /s/united states/roman empire/g 33739.40</span></span><br><span class="line"><span class="comment">## 4 re: selling wedded bliss (was re: ouch...) 25141.77</span></span><br><span class="line"><span class="comment">## 5 re: selling wedded bliss (was re: ouch...) 25141.77</span></span><br><span class="line"><span class="comment">## 6 re: new sequences window 22509.49</span></span><br><span class="line"><span class="comment">## 7 [lockergnome windows daily] fraud wipes 21055.61</span></span><br><span class="line"><span class="comment">## 8 [lockergnome windows daily] fraud wipes 21023.90</span></span><br><span class="line"><span class="comment">## 9 [lockergnome windows daily] brilliant mistakes 21004.60</span></span><br><span class="line"><span class="comment">## 10 [lockergnome penguin shell] recursive metaphor 20823.06</span></span><br><span class="line"><span class="comment">## 11 [lockergnome windows daily] cranky beats 20449.50</span></span><br><span class="line"><span class="comment">## 12 re: comrade communism (was re: crony capitalism (was re: sed 20073.10</span></span><br><span class="line"><span class="comment">## 13 [lockergnome windows daily] deeper uplink 20043.92</span></span><br><span class="line"><span class="comment">## 14 bush covers the waterfront 19548.67</span></span><br><span class="line"><span class="comment">## 15 [lockergnome digital media] clever ritual 19499.38</span></span><br><span class="line"><span class="comment">## 16 [lockergnome digital media] clever ritual 19467.67</span></span><br><span class="line"><span class="comment">## 17 [lockergnome windows daily] dignity shakedown 19325.42</span></span><br><span class="line"><span class="comment">## 18 [lockergnome penguin shell] good hearts 19295.45</span></span><br><span class="line"><span class="comment">## 19 [lockergnome windows daily] dignity shakedown 19293.71</span></span><br><span class="line"><span class="comment">## 20 [lockergnome windows daily] sticker courtesy 19132.87</span></span><br></pre></td></tr></table></figure></div><p>主题中几乎有一大半是不活跃的邮件,因为subject中不包含“re:”,也表明排序算法可以将主题之外的其他权重应用到数据中。<br>尽管这种非监督的排序算法无法测算其准确度,但这结果仍然是很鼓舞人心的。</p>]]></content>
<summary type="html"><h1 id="1、数据准备"><a href="#1、数据准备" class="headerlink" title="1、数据准备"></a>1、数据准备</h1><p>数据下载:<a class="link" href="https://links.jianshu.com</summary>
<category term="Programing" scheme="https://www.cobaltyang.me/categories/Programing/"/>
<category term="R" scheme="https://www.cobaltyang.me/tags/R/"/>
</entry>
</feed>