Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Better understanding of function names/class names #709

Open
last-partizan opened this issue Oct 9, 2024 · 3 comments
Open

Better understanding of function names/class names #709

last-partizan opened this issue Oct 9, 2024 · 3 comments

Comments

@last-partizan
Copy link
Contributor

last-partizan commented Oct 9, 2024

Thanks for this project, it looks really promising.

I just started using it, and here's what I found, example is this repo:

> gt 'data_file_path' --context 0 --max-results 3
─────────────────────────
File: seagoat/utils/server.py
─────────────────────────
def _get_server_data_file_path() -> Path:
    path = _get_server_data_file_path()
    write_to_json_file(_get_server_data_file_path(), servers_info)

But, when i split name in into words, it cannot find this function.

> gt 'get server data file path' --context 0 --max-results 3
────────────────────────────────────────────────
File: seagoat/server.py
────────────────────────────────────────────────
    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
────────────────────────────────────────────────
File: docs/server.md
───────────────────────────────────────────
called `cacheLocation` which contains the path to the cache directory for
each different type of cache associated with that project.

That's probably my main use case - to find something without knowing exact name. I'm happy to help with fixing this and doing some research or writing patches.

Maybe you have idea how to impove this? I see there's #354 issue, about trying different models. And probably some other "code-search-oriented model" can improve this.

My absolutely non-ai-related guess, is when encountering snake_case or SomeOtherCase names - convert them to normal words and let it index this. But probably code-search-related models already doing it...

@last-partizan
Copy link
Contributor Author

Oh, well.

It can use other embedding functions, and I tried some ollama models with different level of success, and then WordLlama.

chroma-core/chroma#2925

It looks promising, and it's lightning-fast even laptop with AM Ryzen 5 4500U.

@kantord
Copy link
Owner

kantord commented Oct 16, 2024

I think that what is going on here is that the sorting mechanism is not perfect, it is not based on an actual understanding of the query and the results.

Like you say, using a better embedding model can help with this, as the semantic distance is one of the main criteria for sorting. So we should definitely experiment with that.

Another thing to add is that I believe that there are different potential ways of using. For instance in your use case the result would have probably shown up somewhere towards the top of the list, but not on the very top. This could be improved by using an LLM also to understand the query: for instance using a RAG workflow we could get a list of results that fits into the context limit of a local ollama model, and use the ollama model to formulate the final result. The upside of this would be that you don't have to "manually" peruse several lines of results to find what you are looking for. Downside would be that this model could hallucinate or format the answer incorrectly - could be addressed by some validation.

Yet another thing is to improve the chunking: currently we use the actual code lines (based on some heuristic to ignore irrelevant lines) as well as the file names to create the embeddigs. Instead of this, we could use a generative model to actually understand the function of the code line and add additional context to the embedding. This should be fairly simple to do, but would greatly slow down the chunking process. But if we actually have a faster model now, it would be a good time to experiment with it.

@last-partizan
Copy link
Contributor Author

Instead of this, we could use a generative model to actually understand the function of the code line and add additional context to the embedding.

I was thinking that embedding function is supposed to do this.

Maybe, it could be achieved by using larger chunks, probably function/class or some other top-level structures with a model like this?

https://huggingface.co/jinaai/jina-embeddings-v2-base-code

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants