Duplicated / Missing index entries #1133

marcosmro · 2021-06-15T18:15:22Z

I've noticed that some CDE fields are duplicated in our Elasticsearch index (e.g., @id = https://repo.metadatacenter.org/template-fields/01a811c6-cf28-4a47-bccf-727191c5602a), while others are missing. I'm trying to find the cause of this issue.

marcosmro · 2021-06-15T18:41:51Z

It seems that the issue with missing entries was caused by the disk running low on space. When that happens, Elasticsearch puts itself into read-only mode and CEDAR throws the following message:

WARN  [2021-06-15 10:13:52,866] org.metadatacenter.cedar.util.dw.CedarCedarExceptionMapper: :CCEM:msg :blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

marcosmro · 2021-06-15T23:19:14Z

We fixed the missing entries issue by truncating the log_cypher table and disabling the read_only_allow_delete flag in Elasticsearch as follows:

PUT /_all/_settings
{
    "index.blocks.read_only_allow_delete": null
}

Our short-term plan to address the disk space issue is described at #1134

marcosmro · 2021-09-27T23:31:43Z

Here is a brief description of the issue that caused duplicated index entries and the approach used to fix it:

The caDSR CDEs ingestion tool makes use of two different endpoints to upload CDEs to CEDAR:

E1: Create a CDE. It creates a CDE in MongoDB, Neo4j, and the associated Elasticsearch index entry.
E2: Attach the CDE created in E1 to one or several categories. It creates the relationships between the CDE and the categories in Neo4j, and updates the existing Elasticsearch index entry created in E1 to include the corresponding category identifiers.

This issue was related to the index update performed in E2. Now, focusing on the Elasticsearch index, here are the actions taken to create a CDE:

a. Create a CDE index entry (without any associated categories) (done by E1).
b. Find the existing index entry for the CDE (search by CEDAR id) (done by E2).
c. Delete the existing index entry for the CDE using the index identifier (done by E2).
d. Create a new index entry for the CDE, including the category identifiers (done by E2).

The index ends up with duplicate index entries for CDEs when there is not enough time between a) and b). These steps happen sequentially but, in Elasticsearch, there is a delay (by default, 1 second) from the time a document is created until it’s visible (searchable). Therefore, when b) happens in less than 1 second from a), the index entry associated with the CDE won’t be found and won’t be deleted. Consequently, after E2, the index will contain two indexed documents for a given CDE: one without the categories and one with the categories.

There are ways in Elasticsearch both to force an index refresh and decrease the refresh interval, but refreshing an index takes up considerable resources, and taking those actions is not recommended.

The approach used to solve this issue is twofold:

Create in batches: Instead of creating CDEs one by one and immediately associating categories to them, now they are processed in fixed-size batches (100 CDEs/batch). Given that creating 100 CDEs will always take more than 1 second, we are sure that the CDEs are findable (and removable) in the index before associating them to their categories.
Retry to delete. To ensure that CDEs never remain undeleted, which is possible in rare cases, such as when the number of CDEs leaves a final batch with size 1 (e.g., if the total number of CDEs is 301, the batches will be 100, 100, 100, 1), we wait and retry the deletion several times.

An obvious alternative to the described approach would be to develop a new endpoint that does all the work in E1 and E2 in just one call, that is, creates the CDE, associates the CDEs to categories, and creates the corresponding index entry once, at the end of the process.

marcosmro added the bug label Jun 15, 2021

marcosmro self-assigned this Jun 15, 2021

marcosmro added the CDE label Jun 15, 2021

marcosmro closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated / Missing index entries #1133

Duplicated / Missing index entries #1133

marcosmro commented Jun 15, 2021

marcosmro commented Jun 15, 2021

marcosmro commented Jun 15, 2021

marcosmro commented Sep 27, 2021 •

edited

Loading

Duplicated / Missing index entries #1133

Duplicated / Missing index entries #1133

Comments

marcosmro commented Jun 15, 2021

marcosmro commented Jun 15, 2021

marcosmro commented Jun 15, 2021

marcosmro commented Sep 27, 2021 • edited Loading

marcosmro commented Sep 27, 2021 •

edited

Loading