This code have been developed on a debian based system so for using this on windows maching just look at the section where filename have been mentioned. if less on time then just replace '/' by '\' and if you have enough time then just import module os and use os.join() and push it back for greater good. If you're using mac then You're on your own.
Use following command to install dependency
pip install -r req.txt
Now you need to download some models in nltk, You can simply run these command in python consol
import nltk
nltk.download(pkgName)
there will be Out/linkSub.csv file to give you reference of link from filename of subtitle or tutorials which will be helpful in interpreting result
Run gettingSub.py by using following commands and give keywords "Html Tutoprials" when asked
python gettingSub.py
you can also use this feature in other python script by importing the class ScrapSubs
Run gettingTut.py by using following commands and give base url like "https://www.tutorialspoint.com/javascript/javascript_overview.htm" when asked
python gettingTut.py
you can also use this feature in other python script by importing the class scrapTutorials
In preprocessing I'm going through following steps:
1) Considering subtopics too from a html file from tutorialspoint.com
2) Converting string to lower
3) Removing non Alpha-Numeric characters
4) Removing StepWords
5) Removing some parts of speech which have no impacts
6) Removing most common youtube word like "guys", "yeah"....,
7) You can get most common youtube words by downloading subtitles from many tutorial and getting distribution from function commonAll()
8) Stemming the word to get the original word so we don't see function, functions different
File gettingTut.py contains a many relevent functions.
Function preProcess() will return preprocessed string into string format as well as list format
you can import functions like this:
from langProcessing import docParse, preProcess, cwords, subsParse, get_cosine, column
This is the most common algorithms to compare two documents, Where we get bag of words then see cosine similarity of two documents.
In this scripts I have also compared heading-heading and haven't normalised the score so don't freak out if you see score more than 1 :p
Run docSimilarity.py by using following commands it will save results in a csv file where first column consist our tutorialpoints filename and subsequesnt columns represents close yoputube videos
python docSimilarity.py
In topic modeling we get a topic from a given document, It used LDA "https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation" model.
In this scripts I have also compared heading-heading and haven't normalised the score so don't freak out if you see score more than 1 :p
Run topicSim.py by using following commands it will save results in a csv file where first column consist our tutorialpoints filename and subsequesnt columns represents close yoputube videos
python topicSim.py
Just check the filename agains link using ref.csv