-
Notifications
You must be signed in to change notification settings - Fork 70
Starspace embedding
We trained new embeddings directly on the medical notes from NYU Langone Hospital EHR system, as the style and abbreviations present in clinical notes are distinct from medical publications available at PubMed.
We adopt StarSpace: https://github.com/facebookresearch/StarSpace as a general-purpose neural model for efficient learning of entity embeddings. In particular, we label the notes from each encounter with the ICD-10 diagnosis codes of the same encounter, as shown in Figure (a). Under StarSpace's bag-of-word approach, the encounter is represented by aggregating the embedding of individual words (we used the default aggregation method where the encounter is the sum of embeddings of all words divided by the squared root of number of words). Both the word embeddings and the diagnose code embeddings are trained so that the cosine similarity between the encounter and its diagnoses is ranked higher than that between the encounter and a set of different diagnoses. Thus words related to the same symptom are placed close to each other in the embedding space. For example, Figure (b) shows neighbours of the word "inhale" by t-SNE projection of the embeddings to the 2-dimensional space.
The followings are the parameters we used to train the Starspace embeddings:
echo "Start to train on ag_news data:"
starspace train -trainFile "${DATADIR}"/"${TrainFile}" -model "${MODELDIR}"/"${MODELNAME}" -ngrams 1 -adagrad True -thread 40 -dropoutRHS 0.8 -dim 300 -lr 0.01 -epoch 20 -margin 0.05 -verbose true -loss hinge -initRandSd 0.01 -trainMode 0 -similarity "cosine" -negSearchLimit 100 -minCount 5 -label "<diag>" -maxNegSamples 100 -dropoutLHS 0.0 -minCountLabel 5 -fileFormat fastText
echo "Start to evaluate trained model:"
starspace test -testFile "${DATADIR}"/"${TestFile}" -model "${MODELDIR}"/"${MODELNAME}" -ngrams 1 -dim 300 -thread 20 -verbose true -label "<diag>" -similarity "cosine" -trainMode 0 -fileFormat fastText