Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Duplicated / Missing index entries #1133

Closed
marcosmro opened this issue Jun 15, 2021 · 3 comments
Closed

Duplicated / Missing index entries #1133

marcosmro opened this issue Jun 15, 2021 · 3 comments
Assignees

Comments

@marcosmro
Copy link
Member

I've noticed that some CDE fields are duplicated in our Elasticsearch index (e.g., @id = https://repo.metadatacenter.org/template-fields/01a811c6-cf28-4a47-bccf-727191c5602a), while others are missing. I'm trying to find the cause of this issue.

@marcosmro marcosmro added the bug label Jun 15, 2021
@marcosmro marcosmro self-assigned this Jun 15, 2021
@marcosmro marcosmro added the CDE label Jun 15, 2021
@marcosmro
Copy link
Member Author

It seems that the issue with missing entries was caused by the disk running low on space. When that happens, Elasticsearch puts itself into read-only mode and CEDAR throws the following message:

WARN  [2021-06-15 10:13:52,866] org.metadatacenter.cedar.util.dw.CedarCedarExceptionMapper: :CCEM:msg :blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

@marcosmro
Copy link
Member Author

We fixed the missing entries issue by truncating the log_cypher table and disabling the read_only_allow_delete flag in Elasticsearch as follows:

PUT /_all/_settings
{
    "index.blocks.read_only_allow_delete": null
}

Our short-term plan to address the disk space issue is described at #1134

@marcosmro
Copy link
Member Author

marcosmro commented Sep 27, 2021

Here is a brief description of the issue that caused duplicated index entries and the approach used to fix it:

The caDSR CDEs ingestion tool makes use of two different endpoints to upload CDEs to CEDAR:

  • E1: Create a CDE. It creates a CDE in MongoDB, Neo4j, and the associated Elasticsearch index entry.
  • E2: Attach the CDE created in E1 to one or several categories. It creates the relationships between the CDE and the categories in Neo4j, and updates the existing Elasticsearch index entry created in E1 to include the corresponding category identifiers.

This issue was related to the index update performed in E2. Now, focusing on the Elasticsearch index, here are the actions taken to create a CDE:

a. Create a CDE index entry (without any associated categories) (done by E1).
b. Find the existing index entry for the CDE (search by CEDAR id) (done by E2).
c. Delete the existing index entry for the CDE using the index identifier (done by E2).
d. Create a new index entry for the CDE, including the category identifiers (done by E2).

The index ends up with duplicate index entries for CDEs when there is not enough time between a) and b). These steps happen sequentially but, in Elasticsearch, there is a delay (by default, 1 second) from the time a document is created until it’s visible (searchable). Therefore, when b) happens in less than 1 second from a), the index entry associated with the CDE won’t be found and won’t be deleted. Consequently, after E2, the index will contain two indexed documents for a given CDE: one without the categories and one with the categories.

There are ways in Elasticsearch both to force an index refresh and decrease the refresh interval, but refreshing an index takes up considerable resources, and taking those actions is not recommended.

The approach used to solve this issue is twofold:

  • Create in batches: Instead of creating CDEs one by one and immediately associating categories to them, now they are processed in fixed-size batches (100 CDEs/batch). Given that creating 100 CDEs will always take more than 1 second, we are sure that the CDEs are findable (and removable) in the index before associating them to their categories.
  • Retry to delete. To ensure that CDEs never remain undeleted, which is possible in rare cases, such as when the number of CDEs leaves a final batch with size 1 (e.g., if the total number of CDEs is 301, the batches will be 100, 100, 100, 1), we wait and retry the deletion several times.

An obvious alternative to the described approach would be to develop a new endpoint that does all the work in E1 and E2 in just one call, that is, creates the CDE, associates the CDEs to categories, and creates the corresponding index entry once, at the end of the process.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

1 participant