The biggest interest I have in this, is that I would like to have the ability to...

zellyn · on May 16, 2023

Steve Yegge's recent blog posts claim that SourceGraph are getting a pretty good result by using embeddings created from their knowledge graph of the code structure. That's still the usual [create embeddings, search against embedding of query, retrieve results and use them as prompt] schlep, so yeah, it isn't really understanding architecture well yet.

I too have a job where almost every question is about structural understanding and improvement of a large existing codebase. I'd love to have AI help, but I think it's going to take another iteration or three of model architecture to get there.

anentropic · on May 16, 2023

I also want this and was excited by that blog post too

So far the SourceGraph product ("Cody") is rather underwhelming, it doesn't seem to make deep use of the project context, where the blog post seemed to say that SourceGraph's special sauce would make that possible

The results seem quite similar with Copilot Chat - in both cases they seem to basically stuff the currently focused file as context to your prompt, and the results are no better than if you did the same with ChatGPT, and looks worse because it's cramped into a VS Code sidebar.

beyang · on May 16, 2023

Hey, I'm the Sourcegraph CTO. Appreciate the critical feedback here. I suspect that Cody is using "keyword context", which is our fallback when we don't have an index on Sourcegraph that Cody can use (we need to do a better job of conveying when this happens to users). Would you mind sending me a screenshot of Cody not doing a good job of answering a question / fetching the wrong context? You can email me at beyang@sourcegraph.com or DM me on Twitter (https://twitter.com/beyang).

anentropic · on May 16, 2023

I was hoping it would just tap into VS Code's knowledge of my project structure and index what it needed to automatically

From what I saw on the Discord after getting my invite, people were making requests in the chat for specific github repos to get indexed... is that how it works? So for projects with dependencies which have been indexed I might get better results? And I need to get my own project indexed too?

zamalek · on May 16, 2023

> it's going to take another iteration or three of model architecture to get there.

Not to mention legal being happy with handing over their codebase to an external vendor for reasons other than source control.

Mockapapella · on May 16, 2023

So like, next week then?

heliophobicdude · on May 16, 2023

Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's can be expensive but most importantly disturb the usefulness of having all the generalizations baked in from training data [1]. While LLMs can generate code based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. It's just very lossy. Perhaps it wouldn't be the best use for single code base fine tuning right now.

Can you please share more about the merging context across levels? This sounds interesting!

1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf

bionhoward · on May 16, 2023

Right now the solution is vector databases; however we could envision a different state representation in the transformer decoder which is the main component of a GPT; for example, you could summarize your architecture and tests and implementation with compressed / smaller vectors for each piece and organize that stuff in a tree structure. Then just concatenate the tree to the context and user query. It’d require you to rewrite the multi head attention function or make a wrapper, and it’d add an ETL step to create the tree, but then you could have that whole compressed representation of your codebase available when you ask a question. It would necessarily be an abstraction and not verbatim copy of the code, otherwise you’d run out of room. Funny how everything runs into Kolmogorov complexity eventually

bradleyjg · on May 15, 2023

Exactly. I’d love to be able to ask where and how would I go about adding some new feature to a code base.