GitHub - Nemesis1303/CASE: Large Scale Topic Exploitation for Decision Support Systems

CASE is a Solr-based exploitation tool designed to efficiently index metadata and topic information. It is optimized for calculating aggregated indicators, semantic similarities, and supporting web service requests.

The Solr-powered service is a multi-container application with a Solr search engine for data storage and retrieval. The Python-based RESTful API (case-tm) acts as an intermediary between Solr and the user (or frontend). It utilizes two additional services: case-inferencer for text inference using indexed models, and case-classifier for classification. The TM API also provides endpoints for indexing collections and topic models.

🚀 Deployment Steps

Prepare the Data Source
Create a folder named data/source and place all the corpus and model information you wish to index into this directory.
Create Docker Network
Set up a Docker network named case_net using the following command:
```
docker network create -d bridge case_net --subnet X.X.X.X/X --attachable
```
Start the Services
Launch the services using Docker Compose:
```
docker-compose up -d
```
Verify the Setup
Ensure that the system is correctly initialized:
- Access the Solr service, which should be available at http://your_server_name:20003/solr/#/.
- Create a test collection using the case_config configset via the Solr interface. If the setup is successful, you may delete the test collection and proceed with indexing.

📂 Indexing

📚 Corpus Indexing

To index a corpus, we require the presence of the raw corpus in the mounted volume "/data/source".

Then, to index the HFRI corpus, named "HFRI.parquet" we do as depicted in the following image:

This process creates a corpus collection named "hfri" in Solr. The collection includes all the metadata available in the parquet file. Additionally, it includes information related to the lemmas used for topic modeling calculations ("all_lemmas"). To maintain consistency among all the possible corpora indexed into the Solr collection, we rename the fields "id", "title", and "date" to these pseudonyms, regardless of their original names. The instructions for performing these field equivalences must be specified in the "case_config/config.cf" file prior to indexing. For more detailed information, you can refer to here.

During corpus indexing, an entry is also created in the "corpora" collection. This collection stores information about all the corpus collections indexed in the Solr instances, along with their indexed models.

🧠 Model Indexing

To index a model, the following requirements must be met:

The topic model entity should be present in the mounted volume "/data/source". This includes a folder named after the model, containing at least the "TMmodel" folder and the training configuration file ("trainconfig.json").
The model to be indexed must be associated with a corpus that has already been indexed into the Solr instance.

To index a model (e.g., a model named "HFRI-30"), follow the steps illustrated in the image below:

This process creates a model collection named "hfri-30" in Solr. The collection includes all the metadata available in the model's "TMmodel" folder, namely word distribution, size, entropy, coherence, number of active documents, chemical description, labels, vocabulary, and coordinates in a 2D-space for each topic in the model.

Additionally, the corpus collection associated with the model is modified by adding two fields to each document that has a topical representation for that model:

"doctpc_{model_name}" contains the document-topic distribution given by the model with the name "model_name".
"sim_{model_name}" contains a list of the 50 most similar documents to the given document, according to the model with the name "model_name".

These additional fields are included in the corpus information within the "corpora" collection. Furthermore, the name of the model collection is added to the list of models associated with that corpus, as shown in the example below:

🔗 Endpoints

📦 Collections

The endpoints in this category refer to generic Solr-related operations that, in principle, will only be used internally:

/collections/createCollection/: Creates a Solr collection.
/collections/deleteCollection/: Deletes a Solr collection.
/collections/listCollections/: List all collections available in the Solr instance.
/collections/query/: Performs a generic Solr query.

📚 Corpora

These endpoints performs corpora-related operations, that is, those related with the management, indexing and listing of linguistic data sets or collections known as corpora:

/corpora/deleteCorpus/: Deletes an entire corpus collection from the system.
/corpora/indexCorpus/: Indexes a corpus in a Solr collection, using the logical corpus name as the collection identifier.
/corpora/listAllCorpus/: Lists all available corpus collections in the Solr instance.
/corpora/listCorpusModels/: Lists all models associated with a specific corpus previously indexed in Solr.
/corpora/listCorpusEWBdisplayed/: Lists the corpus metadata fields that will be displayed in the EWB frontend.
/corpora/listCorpusSearcheableFields/: Lists the corpus metadata fields enabled for semantic search.
/corpora/addSearcheableFields/: Adds metadata fields to a corpus, enabling them for semantic search.
/corpora/deleteSearcheableFields/: Removes specific metadata fields from a corpus, disabling them from semantic search.

🧠 Models

These endpoints performs models-related operations, that is, those related with the management, indexing and listing of topic models:

/models/deleteModel/: Deletes a model collection.
/models/indexModel/: Index the model information in a model collection and its corresponding corpus collection.
/models/listAllModels/: List all model collections available in the Solr instance.
/models/addRelevantTpcForUser/: Adds a topic's relevance information for a user to a model collection.
/models/removeRelevantTpcForUser/: Removes a topic's relevance information for a user to a model collection.

🛠️ Queries

Endpoint	Description	Returns
getThetasDocById	Retrieve the document-topic distribution of a selected document in a corpus collection for a given topic model	`{"thetas": thetas}`
getCorpusMetadataFields	Get the available metadata fields for a specific corpus collection	`{"metadata_fields": meta_fields}`
getNrDocsColl	Get the number of documents in a collection	`{"ndocs": ndocs}`
getDocsWithThetasLargerThanThr	Get documents with a topic proportion larger than a threshold according to a selected topic model	`[{"id": id1, "doctpc_{model_name}": doctpc1 }, {"id": id2, "doctpc_{model_name}": doctpc2}, ...]`
getDocsWithHighSimWithDocByid	Retrieve documents that have a high semantic relationship with a selected document, i.e., its most similar documents	`[{"id": id1, "score": score1 }, {"id": id2, "score": score2 }, ...]`
getMetadataDocById	Get the metadata of a selected document in a corpus collection	`{"metadata1": metadata1, "metadata2": metadata2, "metadata3": metadata3, ... }`
getDocsWithString	Retrieve the IDs of documents whose title contains a specific string in a corpus collection	`[{"id": id1}, {"id": id2}, ...]`
getTopicsLabels	Get the labels associated with each topic in a given model	`[{"id": id1, "tpc_labels": label1 }, {"id": id2, "tpc_labels": label2}, ...]`
getTopicTopDocs	Get the top documents for a given topic in a model collection. Two criteria are considered: first, the thematic representation for the requested topic and second, the number of words in the document.	`[{"id": id1, "thetas": thetas1, "num_words_per_doc": num_words_per_doc1 }, {"id": id2, thetas": thetas2, "num_words_per_doc": num_words_per_doc2}, ...]`
getModelInfo	Get information (chemical description, label, statistics, top docs, etc.) for each topic in a model collection	`[{"id":id1, "betas": betas1, "alphas": alphas1, "topic_entropy":entropies1, "topic_coherence":cohrs1, "ndocs_active":active1, "tpc_descriptions":desc1, "tpc_labels":labels1, "coords":coords1, "top_words_betas":top_words_betas1,}, {"id":id2, "betas": betas2, "alphas": alphas2, "topic_entropy":entropies2, "topic_coherence":cohrs2, "ndocs_active":active2, "tpc_descriptions":desc2, "tpc_labels":labels2, "coords":coords2, "top_words_betas":top_words_betas2}, ...]`
getBetasTopicById	Get the word distribution of a selected topic in a model collection	`{"betas": betas}`
getMostCorrelatedTopics	Get the most correlated topics to a given topic in a selected model	`[{"id": id1, "betas": betas1 }, {"id": id2, "betas": betas2}, ...]`
getPairsOfDocsWithHighSim	Retrieve pairs of documents with a semantic similarity larger than a certain threshold in a given topic model, filtered by year.	`[{"id_1": id1, "id_2": id2, "score": score1 }, {"id_1": id1, "id_2": id2, "score": score2}, ...]`
getDocsSimilarToFreeText	Get documents that are semantically similar to a free text according to a given topic model	`[{"id": id1, "score": score1 }, {"id": id2, "score": score2 }, ...]`
getLemmasDocById	Retrieve the lemmas of a selected document in a corpus collection	`{"thetas": thetas}`
getThetasAndDateAllDocs	Get the date and document-topic representation associated with a given model for all documents in a corpus collection	`[{"id": id1, "date": date1, "doctpc_{model_name}":doctpc1}, {"id": id2, "date": date2, "doctpc_{model_name}":doctpc2}, ...]`
getBetasByWordAndTopicId	Get the topic-word distribution of a given word in a given topic associated with a given model	`{"betas": betas}`
getBOWbyDocsIDs	Get the BoW counts of a given words in a document.	`{"id": id, "payload(bow,word)": count}`
getUserRelevantTopics	Get the topic-word distribution of a given word in a given topic associated with a given model	`[{"id":id1, "betas": betas1, "alphas": alphas1, "topic_entropy":entropies1, "topic_coherence":cohrs1, "ndocs_active":active1, "tpc_descriptions":desc1, "tpc_labels":labels1, "coords":coords1, "top_words_betas":top_words_betas1,}, {"id":id2, "betas": betas2, "alphas": alphas2, "topic_entropy":entropies2, "topic_coherence":cohrs2, "ndocs_active":active2, "tpc_descriptions":desc2, "tpc_labels":labels2, "coords":coords2, "top_words_betas":top_words_betas2}, ...]`

✍️ Cite us

If you find CASE useful in your work, we'd greatly appreciate it if you could cite us!

@inproceedings{calvo-bartolome-etal-2025-case,
    title = "{CASE}: Large Scale Topic Exploitation for Decision Support Systems",
    author = "Calvo Bartolom{\'e}, Lorena  and
      Arenas-Garc{\'i}a, Jer{\'o}nimo  and
      P{\'e}rez Fern{\'a}ndez, David",
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven  and
      Mather, Brodie  and
      Dras, Mark",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-demos.15/",
    pages = "151--162"
}

If you run into any issues or have questions, don't hesitate to reach out!

🤝 Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 101004870 (IntelComp project) and from Grant TED2021-132366B-I00 funded by MICIU/AEI/10.13039/501100011033 and by the ``European Union NextGenerationEU/PRTICIU.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.vscode		.vscode
aux_files		aux_files
case-classifier		case-classifier
case-inferencer		case-inferencer
case-tm		case-tm
case_config		case_config
notebooks		notebooks
prompter		prompter
resource_analysis		resource_analysis
solr_config		solr_config
solr_plugins		solr_plugins
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analisys_s3_thetas.py		analisys_s3_thetas.py
docker-compose.yml		docker-compose.yml
docker-compose.yml.old		docker-compose.yml.old

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Deployment Steps

📂 Indexing

📚 Corpus Indexing

🧠 Model Indexing

🔗 Endpoints

📦 Collections

📚 Corpora

🧠 Models

🛠️ Queries

✍️ Cite us

🤝 Acknowledgements

About

Releases

Packages

Languages

License

Nemesis1303/CASE

Folders and files

Latest commit

History

Repository files navigation

🚀 Deployment Steps

📂 Indexing

📚 Corpus Indexing

🧠 Model Indexing

🔗 Endpoints

📦 Collections

📚 Corpora

🧠 Models

🛠️ Queries

✍️ Cite us

🤝 Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages