Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

try to generate candidates for comparison #86

Closed
alexanderpanchenko opened this issue Aug 8, 2018 · 12 comments
Closed

try to generate candidates for comparison #86

alexanderpanchenko opened this issue Aug 8, 2018 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@alexanderpanchenko
Copy link
Contributor

alexanderpanchenko commented Aug 8, 2018

Given a word, like 'python' generate the list of candidate, like in Google 'python vs ...' .

  1. Get all sentences containing the target words (python)

  2. Classify them (first word = python, second word = last / first noun in the sentence, text = input sentence). OR Classify them iterating over all nouns in the sentence (first word = python, second word = i-th noun in the sentence, text = input sentence).

  3. Rank the nouns found in the sentences by the total number of comparative sentences (with a high threshold). No normalization is needed - just take the raw sentence counts.

@alexanderpanchenko
Copy link
Contributor Author

Caching mechanism is important not to re-compute everything from scratch every time.

@mschildw
Copy link
Collaborator

mschildw commented Oct 29, 2018

I now tried a different approach:

  1. I query sentences containing with the following query: "text:(<object> AND vs)" where <object> is "python" for example

  2. I take the nouns (NN) where the following pattern matches: ( (vs|vs.) candidate | candidate (vs|vs.) )
    Alone this two steps deliver quite good results for comparison candidates (for python):
    [('perl', 40), ('java', 23), ('ruby', 22), ('php', 19), ('boa', 16), ('alligator', 15), ('julia', 14), ('net', 9), ('c++', 6), ('visual', 5), ('javascript', 4), ('gatoroid', 2), ('crocodile', 2), ('ruby ruby', 2), ('matlab gc', 2), ('brython', 2), ('cat', 2), ('lua', 2), ('qml', 2), ('jython', 1), ('lisp', 1), ('arc', 1), ('tiger', 1), ('rhinoscript', 1), ("print 'weave", 1), ('matlab/eeglab', 1), ('node', 1), ('python programs', 1), ('aqueon', 1), ('africanized honeybee', 1), ('gator', 1), ('gql', 1), ('profiling pypy', 1), ('scheme', 1), ('alligator watch', 1), ('deer', 1), ('octave', 1), ('nspr', 1), ('stones', 1), ('jlizard', 1), ('thinking upside down ruby', 1), ('ruby deathmatch', 1), ('kruger', 1), ('ruby performance', 1), ('cockatoo photos', 1), ('python-novaclient', 1), ('prothon', 1), ('film boa', 1), ('cython', 1), ('sas', 1), ("print 'f2py", 1), ('pycuda', 1)]

  3. I still need to filter out some candidates like "kruger" do you think common hypernyms (wordnet) could be helpfull for this? Are there standard functions to get common hypernyms for words I only found one for synsets...

  4. Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly.

    What do you think about the first 2 steps?

@mschildw
Copy link
Collaborator

mschildw commented Oct 29, 2018

Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left:
[('java', 23), ('ruby', 22), ('boa', 16), ('alligator', 15), ('net', 9), ('cat', 2), ('crocodile', 2), ('sas', 1), ('tiger', 1), ('lisp', 1), ('arc', 1), ('node', 1), ('stones', 1), ('octave', 1), ('deer', 1), ('gator', 1), ('scheme', 1)]

For many candidates there were no hypernym at all (e.g. perl, lua, c++) as can also be viewed here: http://wordnetweb.princeton.edu/perl/webwn

@alexanderpanchenko
Copy link
Contributor Author

alexanderpanchenko commented Oct 29, 2018 via email

@mschildw
Copy link
Collaborator

For the second filter approach the following comparison candidates were selected:
['lisp', 'lua', 'scheme', 'perl', 'visual', 'jython', 'net', 'cython', 'ruby', 'java', 'javascript', 'php', 'node', 'boa', 'julia', 'alligator', 'qml', 'python programs', 'cat', 'deer', 'crocodile', 'octave', 'tiger', 'arc', 'sas', 'gator', 'aqueon', 'prothon', 'ruby ruby', 'stones', 'brython', 'ruby performance', 'gql', 'nspr', 'pycuda']

They are sorted by found comparative sentences for python and the candidate. If only candidates with more than 40 comparative sentences are shown, probably the best get presented:
['lisp', 'lua', 'scheme', 'perl', 'visual', 'jython', 'net', 'cython', 'ruby', 'java', 'javascript', 'php', 'node', 'boa', 'julia']

comparing with google:
grafik

Only r, c++, matlab and go are not found so 60% are found and in addition there are found some more, which could also be interesting.

@alexanderpanchenko
Copy link
Contributor Author

alexanderpanchenko commented Oct 29, 2018 via email

@alexanderpanchenko
Copy link
Contributor Author

alexanderpanchenko commented Oct 29, 2018 via email

@mschildw
Copy link
Collaborator

Thanks for the hint with JoBim text, I hope it is easy to get, how to use it.

About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes.

Maybe it is something we have to do beforehand:
Taking seed words from different domains and searching the comparation candidates. Afterwards, we continue with the candidates and so on. We then could save it to a DB or file system.

@alexanderpanchenko
Copy link
Contributor Author

alexanderpanchenko commented Oct 29, 2018 via email

@mschildw
Copy link
Collaborator

Do you maybe have an example of how to use the DT JoBIM?
That would be great, since that is the last part of the thesis I need to write about (and create content to write about) and I would like to send the first draft including this part to you at the end of this week.

I have to setup a local database to use it, right?
The trim operation can achieved using this http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/pruning/ , right?

@alexanderpanchenko
Copy link
Contributor Author

alexanderpanchenko commented Oct 29, 2018 via email

@mschildw
Copy link
Collaborator

Alright, thank you very much for the clarification, I though I have to understand how to set up and use JoBimText now.
I will have a look how good that works for filtering the candidates, thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants