Skip to content

allaouiamine/RI-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Ok, guys, we've got a plenty to do.
First, congrats for dealing with GIT. It's not as that easy ;)
Next, we'll divide our work. Check here this evening for futher instruction!

Our formule to work with:
w_{t,d}=\frac{log(1+tf_{t,d})}{\sqrt{\sum_{i=1}^{N} w_{i,d}^{2}}}*\frac{N}{df_{t}}
BM25: do we need it??
w_{t,d}=\frac{tf_{t,d}*(k_{1}+1))}{k_{1}+((1-b)+b*\frac{dl}{avdl})+tf_{t,d}}*log(\frac{N-df_{t}+0.5}{df_{t}+0.5})
it's LATEX, you can visualize it here:
http://www.codecogs.com/latex/eqneditor.php

So, first. Load all documents and make a dictionary of all words.
I've got some code in c#, someone should port it, it works fine:
		StreamReader fs = new StreamReader(Path.GetFileName(url) + ".txt");
                                                      while (!fs.EndOfStream)
                                                      {
                                                          ProceedOneLine(fs.ReadLine());
                                                      }
                                                      fs.Close();
													  
		/// <summary>
        /// Proceed one line, analyze, and fill dictionary with statistics.
        /// </summary>
        /// <param name="line">Line in file to proceed.</param>
        private void ProceedOneLine(String line)
        {
            String current = line.Trim(); //trim spaces
            if (current.Length == 0) return; //it could be only spaces in line - skip.

            //<doc><docno>12345</docno>
            //...
            if (line[0] == '<')
            {
                if (line.Contains("</doc>")) return; //that's end of document

                int start = current.IndexOf("no>") + 3; //find first occurence
                int end = current.IndexOf("</do"); //last
                char[] number = new char[end - start]; //copy the number of document
                current.CopyTo(start, number, 0, number.Length);
                Int32.TryParse(new string(number), out _currentDocument.number); //and convert to Int32
                DocumentsNumber++;
            }
            else
            {
                //for Thread safety reason, we should lock
                lock (GlobalStatistic)
                {
                    var words = current.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries);

                    //forech trim symbols to get real words and make everything in lower register
                    foreach (var w in words.Select(word => word.Trim(new[] { ',', '.', ':', ';', '!', '?', '"', ')', '(', '\'' })).Select(w => w.ToLower()))
                    {
                        //if it was only symbol, continue;
                        if (w == "") continue;

                        //If there is such word in dictionary
                        if (GlobalStatistic.ContainsKey(w))
                        {
                            //check if there is current document in list
                            if (GlobalStatistic[w].ContainsKey(_currentDocument.number))
                            {
                                //if so, increment statistic for that document.
                                GlobalStatistic[w][_currentDocument.number]++;
                            }
                            else //or just add document and set 1 - first time this word appeared here
                            {
                                GlobalStatistic[w].Add(_currentDocument.number, 1);
                            }
                        }
                        else //we haven't got any statistics for this word -> create
                        {
                            GlobalStatistic.Add(w, new Dictionary<int, int> {{_currentDocument.number, 1}});
                        }
                    }
                }
            }
        }
		
		

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published