Code and data of the AAAI-20 paper "Multi-channel Reverse Dictionary Model" [pdf]
- Python 3.x
- Pytorch 1.x
- Other requirements: numpy, tqdm, nltk, gensim, thulac
Download the code and data from Google Drive or Tsinghua Cloud, where the code is the same as that here.
Unzip the data.zip (under English and Chinese paths respectively), and all files under EnglishReverseDictionary
and ChineseReverseDictionary
should be prepared as follows:
ReverseDictionary
|- EnglishReverseDictionary
| |- data
| | |- data_train.json
| | |- data_dev.json
| | |- data_test_500_rand1_seen.json
| | |- data_test_500_rand1_unseen.json
| | |- data_defi_c.json [definitions of the target words in 200 descriptions]
| | |- data_desc_c.json [testset of 200 descriptions]
| | |- vec_inuse.json [Only embeddings used in this model are included.]
| | |- lexname_all.txt
| | |- root_affix_freq.txt
| | |- sememes_all.txt
| | |- target_words.txt
| |- code
| |- main.py
| |- model.py
| |- data.py
| |- evaluate.py
| |- evaluate_result.py
| |- analyse_result.py
| |- result_analysis_En_1200.py
|- ChineseReverseDictionary
| |- data
| | |- Cx.json [x=1,2,3,4]
| | |- description_sense.json [train & dev dataset]
| | |- description_idio_locu.json [testset of Question]
| | |- description_byHand.json [testset of description]
| | |- hownet.json
| | |- sememe.json
| | |- word_cilinClass.json
| | |- word_index.json
| | |- word_vector.npy [Only embeddings used in this model are included.]
| |- code
| |- main.py
| |- model.py
| |- data.py
| |- evaluate.py
| |- evaluate_result.py
|- PrepareYourOwnDataset
|- <See below.>
Execute this command under code path:
python main.py -b [batch_size] -e [epoch_num] -g [gpu_num] -sd [random_seed] -f [freq_mor] -m [rsl, r, s, l, b] -v
In -m [rsl, r, s, l, b]
,
-m r
indicates the use of Morpheme information including roots and affixes. You can filter morphemes by-f
, usually 15~35;-m s
means using the Sememe predictor;-m l
means using WordNet lexnames, which is word category information (include Lexical name and POS tag information);-m b
means not using any other information, just the basic BiLSTM model;-m rsl
means to use all information which is our Multi-channel model;
-e
is usually set to 10~20;
-g
indicates which GPU to use;
-v
means showing progess bar.
After training, you will get two new files, xxx_label_list.json
and xxx_pred_list.json
. "xxx" indicates the mode you set in -m
, e.g., the -m rsl
setting indicates that the file will be rsl_label_list.json
.
Execute this command under code path:
python evaluate_result.py -m [mode]
Here, mode
is the same as above.
Then you'll get median rank
, accuracy@1/10/100
and rank variance
results on 3 test sets including seen, unseen and description.
You can evaluate model performance with prior knowledge:
python analyse_result.py
python result_analysis_En_1200.py -m [mode]
Execute this command under code path:
python main.py -b [batch_size] -e [epoch_num] -g [gpu_num] -sd [random_seed] -u/-s -m [CPsc, C, P, s, c, b] -v
Different from English model training, we use -u
or -s
to represent Unseen or Seen test mode. In fact, there is no need to use the test mode on the Seen Definition test set.
In -m [CPsc, C, P, s, c, b]
-m C
means using Cilin word category information and we use 4 word classes in Cilin;-m P
means using POS predictor;-m s
means using Sememe predictor;-m c
indicates the use of Morpheme predictor where morphemes are Chinese characters;-m b
means not using any other information, just the basic BiLSTM model;-m CPsc
means to use all information as our Multi-channel model.
-e
, -g
and -v
are the same as those in English model training.
python evaluate_result.py -m [mode]
Here, the mode
is the prefix of xxx_label_list.json
.
Then you'll get median rank
, accuracy@1/10/100
and rank variance
results on 4 test sets including seen, unseen, Description and Question.
You can evaluate model performance with prior knowledge:
python result_analysis_Ch.py -m [mode]
Here is some code for reference. The data format is shown below, and you can build your own data set.
ReverseDictionary
|- EnglishReverseDictionary
|- ChineseReverseDictionary
|- PrepareYourOwnDataset
|- proc_allFeatures.py
|- get_wordnet_lexname.py
|- get_wordnet_500sample.py
|- process_googleVec_checkAllData.py
|- readHowNet_to_word_sememe.py
|- wordnik_get_defi.py
|- check_root_affix.py
It is json format in data_xxx.json files.
{
"word": "fatalism",
"lexnames": [
"noun.cognition"
],
"root_affix": [
"fatal",
"ism"
],
"sememes": [
"knowledge",
"believe",
"experience",
"Fate"
],
"definitions": "the doctrine that all events are predetermined by fate and are therefore unalterable"
}
Word embeddings are in vec_inuse.json
which contains all target words and words in definitions. Only used words are included. The format is {word: [vector]}
, ....
lexname_all.txt
contains all 45 lexnames from WordNet.
sememes_all.txt
contains 1400 sememes from HowNet.
Morphemes (root and affix) are in root_affix_freq.txt
, which contains morphemes and their numbers, separated by spaces.
In English experiments, we use the Description dataset from (Hill et al. 2016).
Word embeddings are from GoogleNews-vectors-negative300.
Sememes can be obtained using OpenHowNet.
Lexnames are from WordNet which you can get them easily by NLTK.
We get morphemes by Morfessor tool. The used dataset is from morpho.aalto.fi. You should train mofessor model first, and then use it to process the target words to get the corresponding roots and affixes.
morfessor-train --encoding=ISO_8859-15 --traindata-list --logfile=log.log -s model.bin -d ones wordlist-2010.eng
morfessor-segment -l ../morfessor_data/model.bin target_words.txt -o word_root_affix.txt
Unfortunately, the morphemes obtained by this method are not accurate. It is recommended that you use the standard root-affix dictionary.
If you use any code or data, please cite this paper
@article{zhang2019multi
title={Multi-channel Reverse Dictionary Model},
author={Zhang, Lei and Qi, Fanchao and Liu, Zhiyuan and Wang, Yasheng and Liu, Qun and Sun, Maosong},
journal={arXiv preprint arXiv:1912.08441},
year={2019}
}
You can visit our online reverse dictionary website, where we have optimized our methods and datasets. Github WantWords. You can post issues if you have any questions.