Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: Standalone SIGABRT: tantivy OpenReadError (FileDoesNotExist("meta.json")) #39585

Open
1 task done
ThreadDao opened this issue Jan 24, 2025 · 8 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-ci
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): go sdk
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

pr's go-sdk ci test run failed:

[2025-01-24T10:45:39Z INFO  tantivy::directory::file_watcher] Meta file "/var/lib/milvus/data/indexnode/index_files/455531513553853686/1/meta.json" was modified |  
-- | --
  |   | [2025-01-24T10:45:39Z INFO  tantivy::directory::file_watcher] Meta file "/var/lib/milvus/data/indexnode/index_files/455531513553853646/1/meta.json" was modified |  
  |   | [2025-01-24T10:45:39Z ERROR tantivy::reader] Error while loading searcher after commit was detected. OpenReadError(FileDoesNotExist("meta.json")) |  
  |   | terminate called after throwing an instance of 'milvus::SegcoreError' |  
  |   | what():   => remove local directory:/var/lib/milvus/data/indexnode/index_files/455531513553853646/1/ failed, error: Directory not empty, files: /var/lib/milvus/data/indexnode/index_files/455531513553853646/1/.tantivy-meta.lock at /workspace/source/internal/core/src/storage/LocalChunkManager.cpp:228 |  
  |   |   |  
  |   | SIGABRT: abort |  
  |   | PC=0x7f2df9e419fc m=662 sigcode=18446744073709551610 |  
  |   | signal arrived during cgo execution |  
  |   |   |  
  |   | goroutine 245337 gp=0xc001dbbdc0 m=662 mp=0xc0209df808 [syscall, locked to thread]: |  
  |   | non-Go function |  
  |   | pc=0x7f2df9e419fc |  
  |   | non-Go function |  
  |   | pc=0x7f2df9ded475 |  
  |   | non-Go function |  
  |   | pc=0x7f2df9dd37f2 |  
  |   | non-Go function |  
  |   | pc=0x7f2df9c21b9d |  
  |   | non-Go function |  
  |   | pc=0x7f2df9c2d20b

  • server pods:
ms-39579-3-go-pr-etcd-0                                     1/1     Running            0               15m     10.104.29.87    4am-node35   <none>           <none>

2025-01-24T10:47:45Z {container="step-check-status"} ms-39579-3-go-pr-milvus-standalone-5ccf8b647b-dnpqk         1/1     Running            3 (2m ago)      15m     10.104.31.192   4am-node34   <none>           <none>

2025-01-24T10:47:45Z {container="step-check-status"} ms-39579-3-go-pr-minio-85454bb9f-znjpp                      1/1     Running            0               15m     10.104.29.85    4am-node35   <none>           <none>

Loki logs

Expected Behavior

No response

Steps To Reproduce

Milvus Log

No response

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2025
@ThreadDao ThreadDao added this to the 2.6.0 milestone Jan 24, 2025
@xiaofan-luan
Copy link
Collaborator

it seems to be a file not found?

@xiaofan-luan
Copy link
Collaborator

/assign @SpadeA-Tang
please help on it

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 26, 2025
@SpadeA-Tang
Copy link
Contributor

It seems there's a race condition between LocalChunkManager::RemoveDir and Tantivy InnerIndexReader::reload.
Tantivy has a watcher on "meta.json" and when the file changed, tantivy will reload the index reader.
So, my conjecture is that:
t1: Tantivy: meta.json was changed -> file watcher detected it, so a index reader reload will be triggered
t2: Segcore: LocalChunkManager::RemoveDir is called concurrently
t3: Tantivy: for index reader reload, it acquires "tantivy-meta.lock" (creating file named "tantivy-meta.lock"). And at this time, "meta.json" is deleted by LocalChunkManager, so it reports logs:
[2025-01-24T10:45:39Z ERROR tantivy::reader] Error while loading searcher after commit was detected. OpenReadError(FileDoesNotExist("meta.json"))
t4: Segcore: return the error "Directory not empty" due to concurrently file creation ("tantivy-meta.lock") I guess, which triggers the panic.

If my conjuecture is right, the problem goes to why LocalChunkManager chooses to remove the directory while the index reader is still alive.

@xiaofan-luan
Copy link
Collaborator

@SpadeA-Tang
maybe related to #39471

@SpadeA-Tang
Copy link
Contributor

what's the commit hash of the panic cluster? @ThreadDao

@SpadeA-Tang
Copy link
Contributor

I just noticed that the panic info is after the fix #39471. I think the root cause should be similar with that.

@ThreadDao
Copy link
Contributor Author

@SpadeA-Tang I will try again after this PR is merged #39253

@ThreadDao
Copy link
Contributor Author

It seems that it has not reappeared

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants