try to generate candidates for comparison #86

alexanderpanchenko · 2018-08-08T14:22:32Z

Given a word, like 'python' generate the list of candidate, like in Google 'python vs ...' .

Get all sentences containing the target words (python)
Classify them (first word = python, second word = last / first noun in the sentence, text = input sentence). OR Classify them iterating over all nouns in the sentence (first word = python, second word = i-th noun in the sentence, text = input sentence).
Rank the nouns found in the sentences by the total number of comparative sentences (with a high threshold). No normalization is needed - just take the raw sentence counts.

alexanderpanchenko · 2018-08-30T12:43:27Z

Caching mechanism is important not to re-compute everything from scratch every time.

mschildw · 2018-10-29T10:43:18Z

I now tried a different approach:

I query sentences containing with the following query: "text:(<object> AND vs)" where <object> is "python" for example
I take the nouns (NN) where the following pattern matches: ( (vs|vs.) candidate | candidate (vs|vs.) )
Alone this two steps deliver quite good results for comparison candidates (for python):
[('perl', 40), ('java', 23), ('ruby', 22), ('php', 19), ('boa', 16), ('alligator', 15), ('julia', 14), ('net', 9), ('c++', 6), ('visual', 5), ('javascript', 4), ('gatoroid', 2), ('crocodile', 2), ('ruby ruby', 2), ('matlab gc', 2), ('brython', 2), ('cat', 2), ('lua', 2), ('qml', 2), ('jython', 1), ('lisp', 1), ('arc', 1), ('tiger', 1), ('rhinoscript', 1), ("print 'weave", 1), ('matlab/eeglab', 1), ('node', 1), ('python programs', 1), ('aqueon', 1), ('africanized honeybee', 1), ('gator', 1), ('gql', 1), ('profiling pypy', 1), ('scheme', 1), ('alligator watch', 1), ('deer', 1), ('octave', 1), ('nspr', 1), ('stones', 1), ('jlizard', 1), ('thinking upside down ruby', 1), ('ruby deathmatch', 1), ('kruger', 1), ('ruby performance', 1), ('cockatoo photos', 1), ('python-novaclient', 1), ('prothon', 1), ('film boa', 1), ('cython', 1), ('sas', 1), ("print 'f2py", 1), ('pycuda', 1)]
I still need to filter out some candidates like "kruger" do you think common hypernyms (wordnet) could be helpfull for this? Are there standard functions to get common hypernyms for words I only found one for synsets...

Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly.

What do you think about the first 2 steps?

mschildw · 2018-10-29T12:48:09Z

Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left:
[('java', 23), ('ruby', 22), ('boa', 16), ('alligator', 15), ('net', 9), ('cat', 2), ('crocodile', 2), ('sas', 1), ('tiger', 1), ('lisp', 1), ('arc', 1), ('node', 1), ('stones', 1), ('octave', 1), ('deer', 1), ('gator', 1), ('scheme', 1)]

For many candidates there were no hypernym at all (e.g. perl, lua, c++) as can also be viewed here: http://wordnetweb.princeton.edu/perl/webwn

alexanderpanchenko · 2018-10-29T15:58:41Z

I would expect that WordNet is not useful - it coverage is quite limited. However, distributional models can be useful. Here you will find the word similarities (AKA distributional thesaurus JoBimText) computed exactly from our corpus. http://ltdata1.informatik.uni-hamburg.de/depcc/distributional-models/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-1000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/ <http://ltdata1.informatik.uni-hamburg.de/depcc/distributional-models/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-1000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/> You can try to use them to generate more candidates.

…

On Oct 29, 2018, at 1:48 PM, Matthias Schildwächter ***@***.***> wrote: Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left: [('java', 13), ('alligator', 4), ('ruby', 4), ('scheme', 1), ('crocodile', 1), ('tiger', 1), ('cat', 1), ('gator', 1), ('kc', 1), ('deer', 1)] For many candidates there were no hypernym at all (e.g. perl, lua, c++) as can also be viewed here: http://wordnetweb.princeton.edu/perl/webwn <http://wordnetweb.princeton.edu/perl/webwn> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#86 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vgNzDbDOkOPYYcedNexcGN_blWQ-ks5upvk2gaJpZM4V0CMH>.

mschildw · 2018-10-29T16:00:55Z

For the second filter approach the following comparison candidates were selected:
['lisp', 'lua', 'scheme', 'perl', 'visual', 'jython', 'net', 'cython', 'ruby', 'java', 'javascript', 'php', 'node', 'boa', 'julia', 'alligator', 'qml', 'python programs', 'cat', 'deer', 'crocodile', 'octave', 'tiger', 'arc', 'sas', 'gator', 'aqueon', 'prothon', 'ruby ruby', 'stones', 'brython', 'ruby performance', 'gql', 'nspr', 'pycuda']

They are sorted by found comparative sentences for python and the candidate. If only candidates with more than 40 comparative sentences are shown, probably the best get presented:
['lisp', 'lua', 'scheme', 'perl', 'visual', 'jython', 'net', 'cython', 'ruby', 'java', 'javascript', 'php', 'node', 'boa', 'julia']

comparing with google:

Only r, c++, matlab and go are not found so 60% are found and in addition there are found some more, which could also be interesting.

alexanderpanchenko · 2018-10-29T16:01:18Z

I find this approach very interesting. Would be really great to show more examples of these… Use the DT JoBimText to filter in step 3 (see my other mail for details).

…

On Oct 29, 2018, at 11:43 AM, Matthias Schildwächter ***@***.***> wrote: I now tried a different approach: I query sentences containing with the following query: "text:(<object> AND vs)" where <object> is "python" for example I take the nouns (NN) where the following pattern matches: ( (vs|vs.) candidate | candidate (vs|vs.) ) Alone this two steps deliver quite good results for comparison candidates (for python): [('perl', 22), ('java', 15), ('php', 13), ('ruby', 9), ('alligator', 7), ('c', 6), ('lua', 4), ('r', 3), ('julia', 2), ('c++', 2), ('haskell', 2), ('crocodile', 1), ('tiger', 1), ('cat', 1), ('deer', 1), ('kruger', 1), ('gator', 1), ('qml', 1), ('ptrace', 1), ('jlizard', 1), ('visual', 1), ('dog', 1), ('kc', 1), ('scheme', 1), ('javascript', 1)] I still need to filter out some candidates like "kruger" do you think common hypernyms (wordnet) could be helpfull for this? Are there standard functions to get common hypernyms for words I only found one for synsets... Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly. What do you think about the first 2 steps? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#86 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vl4qgdl16fvy5mPlOPciaFB68hSVks5uptvSgaJpZM4V0CMH>.

alexanderpanchenko · 2018-10-29T16:08:12Z

looks quite good. actually i think that it is already worth deploying it (to see how it works more realistically and to be able to play with it…)

…

On Oct 29, 2018, at 5:06 PM, Matthias Schildwächter ***@***.***> wrote: by found comparative sentences for python and the candidate. If only candidates with more than 40 comparative sentences are shown, probably the bes

mschildw · 2018-10-29T16:24:46Z

Thanks for the hint with JoBim text, I hope it is easy to get, how to use it.

About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes.

Maybe it is something we have to do beforehand:
Taking seed words from different domains and searching the comparation candidates. Afterwards, we continue with the candidates and so on. We then could save it to a DB or file system.

alexanderpanchenko · 2018-10-29T16:31:58Z

On Oct 29, 2018, at 5:29 PM, Matthias Schildwächter ***@***.***> wrote: Thanks for the hint with JoBim text, I hope it is easy to get, how to use it.

the all files are big, but you can trim them considerably by sorting all the values by the scores and keeping some 20% of top entries and removing the remaining 80% of the word pairs.

About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes.

ok. maybe later then

…

Maybe it is something we have to do beforehand: Taking seed words from different domains and searching the comparation candidates. Afterwards, we continue with the candidates and so on. We then could save it to a DB or file system. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#86 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vinqGT8qCuj7m5QKyAU_Xy9yYUz1ks5upy0FgaJpZM4V0CMH>.

mschildw · 2018-10-29T16:47:02Z

Do you maybe have an example of how to use the DT JoBIM?
That would be great, since that is the last part of the thesis I need to write about (and create content to write about) and I would like to send the first draft including this part to you at the end of this week.

I have to setup a local database to use it, right?
The trim operation can achieved using this http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/pruning/ , right?

alexanderpanchenko · 2018-10-29T17:07:35Z

No, in your case just download the files I gave the link to (a bunch of archives). You will get a huge set of triples word1:word2:similarity. I would index them using elastic search and use at stage 3. Th JoBimText model includes much-much more parts you do not need. This is the part called DT.

…

Sent from my iPhone

On 29. Oct 2018, at 17:50, Matthias Schildwächter ***@***.***> wrote: Do you maybe have an example of how to use the DT JoBIM? That would be great, since that is the last part of the thesis I need to write about (and create content to write about) and I would like to send the first draft including this part to you at the end of this week. I have to setup a local database to use it, right? The trim operation can achieved using this http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/pruning/ , right? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

mschildw · 2018-10-29T17:17:25Z

Alright, thank you very much for the clarification, I though I have to understand how to set up and use JoBimText now.
I will have a look how good that works for filtering the candidates, thanks!

alexanderpanchenko assigned mschildw Aug 30, 2018

alexanderpanchenko added the enhancement New feature or request label Aug 30, 2018

alexanderpanchenko closed this as completed Jan 29, 2020

alexanderpanchenko mentioned this issue Jan 31, 2020

Deployment of the query suggestion feature #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try to generate candidates for comparison #86

try to generate candidates for comparison #86

alexanderpanchenko commented Aug 8, 2018 •

edited

Loading

alexanderpanchenko commented Aug 30, 2018

mschildw commented Oct 29, 2018 •

edited

Loading

mschildw commented Oct 29, 2018 •

edited

Loading

alexanderpanchenko commented Oct 29, 2018 via email

mschildw commented Oct 29, 2018

alexanderpanchenko commented Oct 29, 2018 via email

alexanderpanchenko commented Oct 29, 2018 via email

mschildw commented Oct 29, 2018

alexanderpanchenko commented Oct 29, 2018 via email

mschildw commented Oct 29, 2018

alexanderpanchenko commented Oct 29, 2018 via email

mschildw commented Oct 29, 2018

try to generate candidates for comparison #86

try to generate candidates for comparison #86

Comments

alexanderpanchenko commented Aug 8, 2018 • edited Loading

alexanderpanchenko commented Aug 30, 2018

mschildw commented Oct 29, 2018 • edited Loading

mschildw commented Oct 29, 2018 • edited Loading

alexanderpanchenko commented Oct 29, 2018 via email

mschildw commented Oct 29, 2018

alexanderpanchenko commented Oct 29, 2018 via email

alexanderpanchenko commented Oct 29, 2018 via email

mschildw commented Oct 29, 2018

alexanderpanchenko commented Oct 29, 2018 via email

mschildw commented Oct 29, 2018

alexanderpanchenko commented Oct 29, 2018 via email

mschildw commented Oct 29, 2018

alexanderpanchenko commented Aug 8, 2018 •

edited

Loading

mschildw commented Oct 29, 2018 •

edited

Loading

mschildw commented Oct 29, 2018 •

edited

Loading