Skip to content

Generalized Correspondence LDA (Python implementation). Deprecated in favor of NiMARE.

License

Notifications You must be signed in to change notification settings

tsalo/gclda

 
 

Repository files navigation

gclda

This is a Python implementation of the Generalized Correspondence-LDA model (gcLDA).

Build Status Coverage Status License

Generalized Correspondence-LDA Model (GC-LDA)

The gcLDA model is a generalization of the correspondence-LDA model (Blei & Jordan, 2003, "Modeling annotated data"), which is an unsupervised learning model used for modeling multiple data-types, where one data-type describes the other. The gcLDA model was introduced in the following paper:

Generalized Correspondence-LDA Models (GC-LDA) for Identifying Functional Regions in the Brain

where the model was applied for modeling the Neurosynth corpus of fMRI publications. Each publication in this corpus consists of a set of word tokens and a set of reported peak activation coordinates (x, y and z spatial coordinates corresponding to brain locations).

When applied to fMRI publication data, the gcLDA model identifies a set of T topics, where each topic captures a 'functional region' of the brain. More formally: each topic is associated with (1) a spatial probability distribution that captures the extent of a functional neural region, and (2) a probability distribution over linguistic features that captures the cognitive function of the region.

The gcLDA model can additionally be directly applied to other types of data. For example, Blei & Jordan presented correspondence-LDA for modeling annotated images, where pre-segmented images were represented by vectors of real-valued image features. The code provided here should be directly applicable to these types of data, provided that they are appropriately formatted. Note however that this package has only been tested on the Neurosynth dataset; some modifications may be needed for use with other datasets.

Installation

Dependencies for this package are: scipy, numpy and matplotlib. If you don't have these installed, the easiest way to do so may be to use Anaconda. Alternatively, this page provides a tutorial on installing them (note that the line "brew install gfortran" now must be replaced by "brew install gcc").

Additionally, some of the example scripts rely on gzip and cPickle (for saving compressed model instances to disk).

This code can be installed as a python package using:

> python setup.py install

The classes needed to run a gclda model can then be imported into python using:

> from gclda.dataset import Dataset

> from gclda.model import Model

Summary of gclda package

The repository consists of:

  • two python classes (contained within the subdirectory 'gclda')
  • several scripts and a tutorial that illustrate how to use these classes to train and export a gcLDA model (contained within the subdirectory 'examples')
  • formatted versions of the Neurosynth dataset that can be used to train a gclda model (contained within the subdirectory datasets/neurosynth)
  • some examples of results from trained gcLDA models under different parameter settings (contained within subdirectories of 'example_results')

Dataset formatting

The Dataset class requires four .txt files containing all dataset features that the gcLDA model needs to operate. Please see the example datasets in the datasets/neurosynth subdirectory, for examples of properly formatted data. For additional details about these files, please see README.txt in the documentation subdirectory.

Tutorial usage examples

For a simple tutorial illustrating usage of the gclda package, see the following file:

This tutorial demonstrates how to (1) build a Dataset object (using a small subset of the Neurosynth dataset), (2) train a gcLDA Model on the Dataset object, and (3) export figures illustrating the trained Model to files for viewing.

There is also a version of this same tutorial in the following Jupyter notebook:

Code usage examples

For additional examples of how to use the code, please see the following scripts in the 'examples' subdirectory:

  • script_run_gclda.py: Illustrates how to build a dataset object from a version of the Neurosynth dataset, and then train a gcLDA model (using the dataset object and several hyper-parameter settings that get passed to the model constructor).
  • script_export_gclda_figs.py: Illustrates how to export model data and png files illustrating each topic from a trained gcLDA model object.
  • script_predict_holdout_data.py: Illustrates how to compute the log-likelihood for a hold-out dataset.

Note that these scripts operate on the following version of the Neurosynth dataset: "2015Filtered2_TrnTst1P1", which is a training dataset from which a subset document data has been removed for testing (the test-data is in the dataset: "2015Filtered2_TrnTst1P2"). The complete Neurosynth dataset, without any test-data removed, is the version labeled "2015Filtered2".

Additional details about the gcLDA code, gcLDA hyper-parameter settings, and about these scripts are provided in the README.txt in the documentation subdirectory, as well as in the comments of the script_run_gclda.py file. Note that all three models presented in the source paper ('no subregions', 'unconstrained subregions' and 'constrained subregions') can be trained by modifying the model hyper-parameters appropriately.

Example results for trained models

Results for some example trained models (including .png files illustrating all topics for the models) are included in the 'example_results' subdirectories.

Using alternative spatial distributions

As described in our paper, the gcLDA model allows one to associate topics with any valid probability distribution for modeling the observed 'x' data. The package currently has the ability to train gcLDA models using Gaussian mixture models with any number of components, as well as Gaussian mixture models with spatial constraints. If you wish to modify the code to train a model using an alternative distribution, you will need to modify the following methods in model.py: (1) _update_regions (2) _get_peak_probs, as well as the lines of the (3) __init__ method which allocate memory for storing the distributional parameters.

Citing the code and data

To cite this module directly from the code, please use DueCredit. For example, if you have a script named run_gclda.py that uses functions from this package, simply run:

> python -m duecredit run_gclda.py

This will print a DueCredit report with relevant citations to your terminal. It will also create a file (.duecredit.p) containing those citations in the folder you run the script from.

If you want to compile citations by hand, please cite the following paper if you wish to reference this code:

Additionally, the following paper demonstrates a variety of cool applications for gcLDA models trained on Neurosynth (such as "brain decoding"):

To reference any of the datasets contained in this repository, or Neurosynth itself:

Additionally, the complete Neurosynth datasets can be accessed at http://github.com/neurosynth/neurosynth-data (note however that those datasets need to be reformatted in order to make them work with the gclda package).

For additional details about Neurosynth please visit neurosynth.org.

Documentation

To generate documentation files:

sphinx-apidoc --separate -M -f -o doc/source/ gclda/ gclda/due.py gclda/version.py gclda/tests/
make html

Releases

No releases published

Packages

No packages published

Languages

  • Python 86.0%
  • Jupyter Notebook 8.4%
  • Makefile 5.6%