Docker image packaging for TextnConceptsRetrieval.
TextnConceptsRetrieval is a retrieval system based on SOLR search platform. It indexes raw text, as well as concepts derived from it. For retrieval, it uses text (monolingual retrieval) and concepts (crosslingual retrieval). It is a RESTful web service using Spring. It uses IXA pipes and DBpedia Spotlight to get concepts from the text.
The contents of the module are the following:
+ COPYING License file
+ Dockerfile File containing all the commands to build a Docker image
+ docs4indexing Empty directory in which to place your documents to be indexed (this directory will be mounted in the docker image)
+ index Directory to store your Solr cores. It contains a already created 'textnconcepts' empty index.
+ ixa-pipes Directory to place ixa-pipes tools and their models
+ pom.xml Maven pom file which deals with everything related to compilation and execution of the web service module
+ scripts Different scripts that will be executed when running a docker container
+ src/ Java source code of the web service module
+ Furthermore, the installation process, as described in the README.md, will generate another directory:
target/ it contains a jar executable file and other directories
- 1.1. Install JDK 1.8
- 1.2. Install MAVEN 3
- 1.3. Get and compile module source code
- 1.4. Get the IXA pipes tools
- 1.5. Install Docker
- 1.6. Build Docker image
- 2.1. Run the Docker image
- 2.2. Check API documentaion
- 2.3. Index a single document
- 2.4. Index all the documents in a directory
- 2.5. Retrieve documents
- 2.6. Get concepts
If you want to build the application from source, follow the next steps.
If not, you can use the already built image in docker hub. Pull the docker image and jump to the USAGE section.
docker pull antxa/textnconceptsretrieval
If you do not install JDK 1.8 in a default location, you will probably need to configure the PATH in .bashrc or .bash_profile:
export JAVA_HOME=/yourpath/local/java18
export PATH=${JAVA_HOME}/bin:${PATH}
If you use tcsh you will need to specify it in your .login as follows:
setenv JAVA_HOME /usr/java/java18
setenv PATH ${JAVA_HOME}/bin:${PATH}
If you re-login into your shell and run the command
java -version
you should now see that your JDK is 1.8.
Download MAVEN 3 from
wget http://www.apache.org/dyn/closer.cgi/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz
Now you need to configure the PATH. For Bash Shell:
export MAVEN_HOME=/home/ragerri/local/apache-maven-3.0.4
export PATH=${MAVEN_HOME}/bin:${PATH}
For tcsh shell:
setenv MAVEN3_HOME ~/local/apache-maven-3.0.4
setenv PATH ${MAVEN3}/bin:{PATH}
If you re-login into your shell and run the command
mvn -version
git clone https://github.com/antxa/TextnConceptsRetrieval.git
cd TextnConceptsRetrieval
mvn clean package
In order to get the concepts from the query and documents, IXA pipes tools are used; namely ixa-pipe-tok, ixa-pipe-pos and ixa-pipe-wikify.
Place the jar files (and models or other files if needed) of these tools in the corresponding subdirectory under 'ixa-pipes' folder. Follow the instructions in their respectives github readmes.
Update the name of the jar and models' files in all the scripts used for running these tools located in 'scripts/ixapipes/' folder.
Follow the instructions in https://www.docker.com/
docker build . -t antxa/textnconceptsretrieval
Use one of the following commands:
docker run -v $PWD/index:/opt/solr/server/solr/mycores -v $PWD/docs4indexing:/docs4indexing -it -p 8080:8080 -p 8983:8983 antxa/textnconceptsretrieval start-bash
or
docker run -v $PWD/index:/opt/solr/server/solr/mycores -v $PWD/docs4indexing:/docs4indexing -it -p 8080:8080 -p 8983:8983 antxa/textnconceptsretrieval start
Using one of the above commands, we create a container and start some processes on it. Note that:
- We mount the 'index' and 'docs4indexing' folders in the Docker container. The owner of the 'index' folder has to be 8983 to be accesible by Solr.
- As two DBpedia Spotlight servers are launched inside the container (one for English and another one for Spanish), at least 14GB of RAM memory are needed.
The difference between these two commands is the last argument ('start-bash' or 'start'). Using the first one, after initializing the container, a bash prompt into a running docker will be available.
Now Solr server is up listening on port 8983. You can access the Solr admin interface and the browse interface of the already created 'textnconcepts' index using your browser following these links:
http://localhost:8983/solr/#/textnconcepts
http://localhost:8983/solr/textnconcepts/browse
Also RESTful web service is running listening on port 8080.
The RESTful API is documented using Swagger.
Once the docker image is running, you can check the API documentation using the user interface or the JSON API following the links below:
http://localhost:8080/swagger-ui.html
http://localhost:8080/v2/api-docs
index
POST HTTP request indexes a single document sent in the
request body.
To index a document (text and concepts will be indexed), you just need to pass a plain text file in the body of the request ('doc' parameter), and specify the document id and the language ('en' or 'es') as follows:
curl -F "doc=@file.txt" localhost:8080/index?"id=id1&lang=en"
Check all the available parameters for index
request in the Swagger
UI.
Remember that you can check the updated index information on the Solr admin interface: http://localhost:8983/solr/#/textnconcepts
indexdir
POST HTTP request indexes documents located in the
'docs4indexing' directory.
IMPORTANT The documents to be indexed (files under 'docs4indexing' directory) have to have read permissions (rw-rw-r--).
You can index all the documents (text and concepts will be indexed) under 'docs4indexing' directory like this:
curl localhost:8080/indexdir?"docsDir=docs4indexing&lang=en"
'docsDir' and 'lang' parameters are mandatory (optional values for
language are 'en' or 'es'). Check all the rest parameters for
indexdir
request in the Swagger
UI.
Remember that you can check the updated index information on the Solr admin interface: http://localhost:8983/solr/#/textnconcepts
query
POST HTTP request retrieves documents given a query. For
monolingual retrieval, the query will be the content of the plain text
file sent in the request body. For crosslingual retrieval, concepts
extracted from the text will be added to the text query.
To retrieve documents, you just need to pass a plain text file in the body of the request ('qfile' parameter), and specify the language ('en' or 'es') and the type of retrieval ('mono' or 'cross') as follows:
curl -F "qfile=@query.txt" localhost:8080/query?"lang=en&type=cross"
If you want, you can specify the number of documents to be retrieved using 'ndocs' parameter as follows:
curl -F "qfile=@query.txt" localhost:8080/query?"lang=en&type=cross&ndocs=100"
Check all the available parameters for query
request in the Swagger
UI.
concepts
POST HTTP request returns a list of concepts derived from
the document sent in the request body.
To get concepts, you just need to pass a plain text file in the body of the request ('doc' parameter) and specify the language ('en' or 'es') as follows:
curl -F "doc=@file.txt" localhost:8080/concepts?lang=en
Check all the available parameters for concepts
request in the Swagger
UI.
Arantxa Otegi
IXA NLP Group
University of the Basque Country (UPV/EHU)
arantza.otegi@ehu.eus