forked from astrohome/RI-Project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
79 lines (70 loc) · 3.73 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
Ok, guys, we've got a plenty to do.
First, congrats for dealing with GIT. It's not as that easy ;)
Next, we'll divide our work. Check here this evening for futher instruction!
Our formule to work with:
w_{t,d}=\frac{log(1+tf_{t,d})}{\sqrt{\sum_{i=1}^{N} w_{i,d}^{2}}}*\frac{N}{df_{t}}
BM25: do we need it??
w_{t,d}=\frac{tf_{t,d}*(k_{1}+1))}{k_{1}+((1-b)+b*\frac{dl}{avdl})+tf_{t,d}}*log(\frac{N-df_{t}+0.5}{df_{t}+0.5})
it's LATEX, you can visualize it here:
http://www.codecogs.com/latex/eqneditor.php
So, first. Load all documents and make a dictionary of all words.
I've got some code in c#, someone should port it, it works fine:
StreamReader fs = new StreamReader(Path.GetFileName(url) + ".txt");
while (!fs.EndOfStream)
{
ProceedOneLine(fs.ReadLine());
}
fs.Close();
/// <summary>
/// Proceed one line, analyze, and fill dictionary with statistics.
/// </summary>
/// <param name="line">Line in file to proceed.</param>
private void ProceedOneLine(String line)
{
String current = line.Trim(); //trim spaces
if (current.Length == 0) return; //it could be only spaces in line - skip.
//<doc><docno>12345</docno>
//...
if (line[0] == '<')
{
if (line.Contains("</doc>")) return; //that's end of document
int start = current.IndexOf("no>") + 3; //find first occurence
int end = current.IndexOf("</do"); //last
char[] number = new char[end - start]; //copy the number of document
current.CopyTo(start, number, 0, number.Length);
Int32.TryParse(new string(number), out _currentDocument.number); //and convert to Int32
DocumentsNumber++;
}
else
{
//for Thread safety reason, we should lock
lock (GlobalStatistic)
{
var words = current.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries);
//forech trim symbols to get real words and make everything in lower register
foreach (var w in words.Select(word => word.Trim(new[] { ',', '.', ':', ';', '!', '?', '"', ')', '(', '\'' })).Select(w => w.ToLower()))
{
//if it was only symbol, continue;
if (w == "") continue;
//If there is such word in dictionary
if (GlobalStatistic.ContainsKey(w))
{
//check if there is current document in list
if (GlobalStatistic[w].ContainsKey(_currentDocument.number))
{
//if so, increment statistic for that document.
GlobalStatistic[w][_currentDocument.number]++;
}
else //or just add document and set 1 - first time this word appeared here
{
GlobalStatistic[w].Add(_currentDocument.number, 1);
}
}
else //we haven't got any statistics for this word -> create
{
GlobalStatistic.Add(w, new Dictionary<int, int> {{_currentDocument.number, 1}});
}
}
}
}
}