This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. We use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific samples. Our work demonstrated that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We showed that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification.
Code demonstrating the workflow with working examples can be found at github.com/EESI/16s_embeddings.
Complete datasets used in the manuscript can be found at gitlab.com/sw1/embeddings_data.