The biggest interest I have in this, is that I would like to have the ability to ask questions about large code-bases. I think being able to generate small functions or explain single code sections is nice, but being able to ask bigger architectural questions would be really helpful for all kind of engineers (in particular in a large company).
I have seen approaches with merging context across multiple levels. But that can only do so much. Is it viable to fine-train a model to a specific code-base so it has knowledge across all files? Does anyone have more info on this kind of problem space?
Steve Yegge's recent blog posts claim that SourceGraph are getting a pretty good result by using embeddings created from their knowledge graph of the code structure. That's still the usual [create embeddings, search against embedding of query, retrieve results and use them as prompt] schlep, so yeah, it isn't really understanding architecture well yet.
I too have a job where almost every question is about structural understanding and improvement of a large existing codebase. I'd love to have AI help, but I think it's going to take another iteration or three of model architecture to get there.
I also want this and was excited by that blog post too
So far the SourceGraph product ("Cody") is rather underwhelming, it doesn't seem to make deep use of the project context, where the blog post seemed to say that SourceGraph's special sauce would make that possible
The results seem quite similar with Copilot Chat - in both cases they seem to basically stuff the currently focused file as context to your prompt, and the results are no better than if you did the same with ChatGPT, and looks worse because it's cramped into a VS Code sidebar.
Hey, I'm the Sourcegraph CTO. Appreciate the critical feedback here. I suspect that Cody is using "keyword context", which is our fallback when we don't have an index on Sourcegraph that Cody can use (we need to do a better job of conveying when this happens to users). Would you mind sending me a screenshot of Cody not doing a good job of answering a question / fetching the wrong context? You can email me at beyang@sourcegraph.com or DM me on Twitter (https://twitter.com/beyang).
I was hoping it would just tap into VS Code's knowledge of my project structure and index what it needed to automatically
From what I saw on the Discord after getting my invite, people were making requests in the chat for specific github repos to get indexed... is that how it works? So for projects with dependencies which have been indexed I might get better results? And I need to get my own project indexed too?
Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's can be expensive but most importantly disturb the usefulness of having all the generalizations baked in from training data [1].
While LLMs can generate code based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. It's just very lossy. Perhaps it wouldn't be the best use for single code base fine tuning right now.
Can you please share more about the merging context across levels? This sounds interesting!
Right now the solution is vector databases; however we could envision a different state representation in the transformer decoder which is the main component of a GPT; for example, you could summarize your architecture and tests and implementation with compressed / smaller vectors for each piece and organize that stuff in a tree structure. Then just concatenate the tree to the context and user query. It’d require you to rewrite the multi head attention function or make a wrapper, and it’d add an ETL step to create the tree, but then you could have that whole compressed representation of your codebase available when you ask a question. It would necessarily be an abstraction and not verbatim copy of the code, otherwise you’d run out of room. Funny how everything runs into Kolmogorov complexity eventually
I have seen approaches with merging context across multiple levels. But that can only do so much. Is it viable to fine-train a model to a specific code-base so it has knowledge across all files? Does anyone have more info on this kind of problem space?