Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: AI to study my DSL and then output it?
70 points by onesphere on April 19, 2023 | hide | past | favorite | 24 comments
Ideally I want to contain and run LLM output of my domain-specific language, but it seems that I would need to fine-tune existing models. What’s the easiest online or local solution?

How to automatically generate: a broad array of security tests; the most efficient code; the most readable and extensible code



There are a couple different approaches:

- Use multi-shot prompting with something like guardrails to try prompting a commercial model until it works. [1]

- Use a local model with a final layer that steers token selection towards syntactically valid tokens [2]

[1] https://github.com/ShreyaR/guardrails

[2] "Structural Alignment: Modifying Transformers (like GPT) to Follow a JSON Schema" @ https://github.com/newhouseb/clownfish (full disclosure: this is my work)


Regarding [2], dang, I am working on exactly this! I mean, it's not that novel of a technique once you start controlling the sampling process directly, but you beat me to the punch.

This technique generalizes to pretty much any grammar one can specify. I weakly hypothesize that by making it impossible for the LM to output syntactically invalid text, the model's task performance improves not just because all of its outputs are valid, but also because part of the model's "processing power" gets "rerouted" from trying to understand and follow the grammar it's writing, to applying improved reasoning overall.


Nice! I've been wondering similar things about whether you could use this to eek out more intelligence through methods like these, to quote the end of my write up:

> Does structured decoding increase the observability of emergent world models in these models? To make an analogy: I may not represent an opinion of how something works if I am not confident in it, but if I am forced to present an opinion we might find out that I in fact have (or have not) grasped something.

In practice, however, without tight integration with beam search, the autoregressive nature of these models means that the syntactic steering may result in the models rabbit-holing themselves without forward looking visibility that's obvious from the defined grammar. I.e. if it was forced to choose between "Don't Jump" and "Do run" in some hypothetical example, the set of tokens that it would likely be deciding between is "Don't" and "Do" with no idea what is going to end up syntactically required after those tokens.


There’s actually a few papers already on constrained decoding. I won’t link them but if you go on arxiv and really look you will find a couple in the past year.


What if the output schema were something like instruction code? Just get rid of the need for programming languages altogether.


Spends years working on an AI solution to problems caused by using postgres as a KV store. That's quite a branch.


I like that you use a local model to start off; why switch to OpenAI for tokenization?


The code supports both local and OpenAI as a backend, I added OpenAI as a backend because their models are still miles better than anything I can run locally (even 65B LLaMA)


Honestly ChatGPT has worked well for things like this in my experience. If you can fit enough examples within a prompt, you may not need anything special.


LLMs like GPT-4 'natively' speak certain syntaxes very well - e.g. Python, JSON. I'd suggest you want to take advantage of that, if at all possible, rather than embark on training or fine tuning your own LLM.

If you have a particular data structure you want to have the LLM generate or manipulate, which there aren't large quantities of in the training set, you might want to consider writing a translator that will translate it into a format the LLM natively 'speaks', using the LLM on that, and then translating back into your DSL.

Going this direction and also adding examples in some sort of vector store, as others have suggested, could be a good direction.


The best answer, by far, would be ChatGPT and GPT4 with some well-written prompts.

I'd be super impressed if any other approach worked as well and would fall under the category of "easy". Keep us updated on what you go with!


See

https://huggingface.co/blog/codeparrot

for some idea of how to train a code generator.


On https://flowchart.fun I found that I got better overall results by asking GPT for an intermediate syntax that it was less likely to mess up (and easier for me to parse), and then parsing and transforming that syntax to my DSL. The relevant code: https://github.com/tone-row/flowchart-fun/blob/main/api/prom...


We have a similar issue - we have a domain-specific schema that we want GPT4 to author SQL for. The challenge for us is that a full explanation of everything in the schema absolutely blows out the token limits.

Right now, we are playing around with the idea of using a classification layer to detect which schema elements are likely involved, and then dynamically including explanations for those elements in the final prompt.

Our attempts at fine tuning ended after about 2 weeks of struggling. I don't think it's viable for a certain range of domain-specific tasks.


I've had good success teaching GPT4 a language interactively: provide documentation, examples then asked it to generate examples of increasing complexity and correct it if it's wrong.

See previous comment here: https://news.ycombinator.com/item?id=35447368


This DSL might suspend instead of halt. Your comment got me thinking about using LLMs to generate new language grammar.

EDIT: suspension is halting?


Langchain with a vectorstore of examples of your DSL. https://python.langchain.com/en/latest/modules/indexes/vecto...


What have you tried so far?


I’m somewhere between thinking that a prompt won’t be enough to get it to think deeply/expertly about a limited subject, and realizing I don’t know my weight decays from my gradients.

What I want to do is train for some inputted amount of documentation, sample code, and maybe even interpreter implementation source and then ask it: “Generate lots of instructions to gain elevated access.” Or maybe even: “Generate social media widget site.” But of course, in the given language.


Maybe I'm looking for too specific a definition. So I've been considering https://en.wikipedia.org/wiki/PaLM but currently trying to find its pretrained dataset. Edit: "The API will first be available to a limited number of developers who join a waitlist before being opened to the public"

Implementation of PaLM in Elemental (I guess?): https://thetaplane.com/ai/palm


This is very interesting.

I’m still noodling on how to send a full page screenshot to a model and get it to return the individual images (or the bounds of them) in the page.



txtai accomplished a similar task by fine tuning a very small t5 model, notebook with usage samples (training code has to be somewhere near)

https://github.com/neuml/txtai/blob/master/examples/33_Query...


AI today is not intelligent, it is just a sophisticated generator using patterns it was trained on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: