-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
try to generate candidates for comparison #86
Comments
Caching mechanism is important not to re-compute everything from scratch every time. |
I now tried a different approach:
Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly. What do you think about the first 2 steps? |
Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left: For many candidates there were no hypernym at all (e.g. perl, lua, c++) as can also be viewed here: http://wordnetweb.princeton.edu/perl/webwn |
I would expect that WordNet is not useful - it coverage is quite limited.
However, distributional models can be useful.
Here you will find the word similarities (AKA distributional thesaurus JoBimText) computed exactly from our corpus.
http://ltdata1.informatik.uni-hamburg.de/depcc/distributional-models/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-1000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/ <http://ltdata1.informatik.uni-hamburg.de/depcc/distributional-models/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-1000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/>
You can try to use them to generate more candidates.
… On Oct 29, 2018, at 1:48 PM, Matthias Schildwächter ***@***.***> wrote:
Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left:
[('java', 13), ('alligator', 4), ('ruby', 4), ('scheme', 1), ('crocodile', 1), ('tiger', 1), ('cat', 1), ('gator', 1), ('kc', 1), ('deer', 1)]
For many candidates there were no hypernym at all (e.g. perl, lua, c++) as can also be viewed here: http://wordnetweb.princeton.edu/perl/webwn <http://wordnetweb.princeton.edu/perl/webwn>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#86 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vgNzDbDOkOPYYcedNexcGN_blWQ-ks5upvk2gaJpZM4V0CMH>.
|
For the second filter approach the following comparison candidates were selected: They are sorted by found comparative sentences for python and the candidate. If only candidates with more than 40 comparative sentences are shown, probably the best get presented: Only r, c++, matlab and go are not found so 60% are found and in addition there are found some more, which could also be interesting. |
I find this approach very interesting. Would be really great to show more examples of these…
Use the DT JoBimText to filter in step 3 (see my other mail for details).
… On Oct 29, 2018, at 11:43 AM, Matthias Schildwächter ***@***.***> wrote:
I now tried a different approach:
I query sentences containing with the following query: "text:(<object> AND vs)" where <object> is "python" for example
I take the nouns (NN) where the following pattern matches: ( (vs|vs.) candidate | candidate (vs|vs.) )
Alone this two steps deliver quite good results for comparison candidates (for python):
[('perl', 22), ('java', 15), ('php', 13), ('ruby', 9), ('alligator', 7), ('c', 6), ('lua', 4), ('r', 3), ('julia', 2), ('c++', 2), ('haskell', 2), ('crocodile', 1), ('tiger', 1), ('cat', 1), ('deer', 1), ('kruger', 1), ('gator', 1), ('qml', 1), ('ptrace', 1), ('jlizard', 1), ('visual', 1), ('dog', 1), ('kc', 1), ('scheme', 1), ('javascript', 1)]
I still need to filter out some candidates like "kruger" do you think common hypernyms (wordnet) could be helpfull for this? Are there standard functions to get common hypernyms for words I only found one for synsets...
Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly.
What do you think about the first 2 steps?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#86 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vl4qgdl16fvy5mPlOPciaFB68hSVks5uptvSgaJpZM4V0CMH>.
|
looks quite good. actually i think that it is already worth deploying it (to see how it works more realistically and to be able to play with it…)
… On Oct 29, 2018, at 5:06 PM, Matthias Schildwächter ***@***.***> wrote:
by found comparative sentences for python and the candidate. If only candidates with more than 40 comparative sentences are shown, probably the bes
|
Thanks for the hint with JoBim text, I hope it is easy to get, how to use it. About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes. Maybe it is something we have to do beforehand: |
On Oct 29, 2018, at 5:29 PM, Matthias Schildwächter ***@***.***> wrote:
Thanks for the hint with JoBim text, I hope it is easy to get, how to use it.
the all files are big, but you can trim them considerably by sorting all the values by the scores and keeping some 20% of top entries and removing the remaining 80% of the word pairs.
About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes.
ok. maybe later then
… Maybe it is something we have to do beforehand:
Taking seed words from different domains and searching the comparation candidates. Afterwards, we continue with the candidates and so on. We then could save it to a DB or file system.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#86 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vinqGT8qCuj7m5QKyAU_Xy9yYUz1ks5upy0FgaJpZM4V0CMH>.
|
Do you maybe have an example of how to use the DT JoBIM? I have to setup a local database to use it, right? |
No, in your case just download the files I gave the link to (a bunch of archives). You will get a huge set of triples word1:word2:similarity. I would index them using elastic search and use at stage 3. Th JoBimText model includes much-much more parts you do not need. This is the part called DT.
…Sent from my iPhone
On 29. Oct 2018, at 17:50, Matthias Schildwächter ***@***.***> wrote:
Do you maybe have an example of how to use the DT JoBIM?
That would be great, since that is the last part of the thesis I need to write about (and create content to write about) and I would like to send the first draft including this part to you at the end of this week.
I have to setup a local database to use it, right?
The trim operation can achieved using this http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/pruning/ , right?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Alright, thank you very much for the clarification, I though I have to understand how to set up and use JoBimText now. |
Given a word, like 'python' generate the list of candidate, like in Google 'python vs ...' .
Get all sentences containing the target words (python)
Classify them (first word = python, second word = last / first noun in the sentence, text = input sentence). OR Classify them iterating over all nouns in the sentence (first word = python, second word = i-th noun in the sentence, text = input sentence).
Rank the nouns found in the sentences by the total number of comparative sentences (with a high threshold). No normalization is needed - just take the raw sentence counts.
The text was updated successfully, but these errors were encountered: