It is kind of sad that Monaco (VS Code's editor as a JS web component) still doe...

debugnik · on May 1, 2023

Tree-sitter grammars take more effort than textmate ones for the most basic highlighting, but provide less accurate parse trees than a compiler could use for more advanced features and highlighting. I understand what they were going for but I don't "get" the trade-off.

At least as long as the parse trees are inherently worse: Even ignoring context-dependent grammars, last time I checked tree-sitter parsers had many edge cases in which they'd accept invalid syntax but not mark the parse trees as bad, which makes them unsuitable for compiler use.

DannyBee · on May 1, 2023

Using compilers for highlighting basically doesn't happen because they are too slow. They are used for progressive improvement as a result (depending how fast, it can take at least 5-10 seconds post-update to get a result. For slower languages, minutes).

I'm not sure what "more effort" means - for lots of languages, it's less effort than textmate because people spend lots of hacks trying to work around textmate to get it to highlight even slightly complex languages right. It's nesting mechanism is complicated, etc.

Tree-sitter is faster than textmate on initial parse in lots of cases, and update a lot faster than textmate (since they can be incrementally updated and textmate can't).

They are not meant as compiler replacements.

Again, compilers are slow, and rarely incremental. Using tree sitter for folding/highlighting/etc, and compilers for additional progressive type info, is the right path.

Textmate is worse than useless for anything but super simple languages.

I say this as a person who has written about 15 vscode language servers for all sorts of varied languages, from CNC g-code to you name it.

I even implemented incremental parsing/lexing for ANTLR's typescript port (and then java port), as a comparison to tree-sitter ;) It can be made as fast, but tree-sitter is clearly winning the current popularity contest ;)

I'd say for those languages, textmate work was the worst part of all of them, and getting textmate to highlight properly required more work than writing proper lexers/parsers. I still fix more bugs in textmate grammars than anywhere else.

Part of the reason is that it requires fairly complex regexps. Part of it is getting it to stop trigger on things it shouldn't, because you have to emulate a lot of state. Again, you start to spend a lot of time working around it, rather than working with it. Meanwhile, yes, the upfront work of a lexer in tree-sitter is sometimes more. But it is much easier to get it to work, to rely on it working, etc.

In the end, all my language servers were easier to implement in tree-sitter, and some don't even have compilers (like g-code), but have control flow/blocks/etc. tree-sitter handled all language duties there.

Hopefully this helps you "get" the tradeoff.

Anyone who has written enough vscode language servers would tell you the same thing. I have yet to meet someone in the community who writes language servers who is excited or happy with textmate vs tree-sitter.

Where do you think the push came from?

debugnik · on May 1, 2023

First, I specifically said that textmate takes less effort for "the most basic highlighting", that would be keywords, literals, comments, etc. Trying to encode a CFG in it would obviously be painful, but why would you? Use an actual parser at that point, but you might as well use a correct one.

Second, there's this idea that compiler parsers are inherently slow and require a full build to use or something. As if tree-sitter parsers were using some magic tech unavailable to compilers, and most new ones weren't modular or didn't offer APIs. It's not even hypothetical, increasingly more modern languages have semantic highlighting in real-time as part of the toolchain.

And in fact, I actually wanted to use that tech in a compiler, I said I tried to use tree-sitter for one, but that's where my big issue with tree-sitter came in: It would return parse trees that blatantly didn't match the grammar I wrote, and not mark them as bad syntax in any way. That's not error recovery, that's swallowing errors. If it does that, I'd rather work on making toolchain parsers better than fighting tree-sitter.

Even the "run the parser on the editor" part is solvable, since tree-sitter parsers are being shipped as WASM. A WASM interface for running possibly incremental parsers (that tree-sitter would also use) sounds more interesting, really.

DannyBee · on May 2, 2023

There's a lot here, and i'mn not sure where to start. You are complaining about a lot of things simultaneously, and it's hard to tell exactly which point you want to tackle.

For correctness - gonna leave this alone, i don't know of issues here and it seems fairly orthogonal to speed which is where we started. I've done parsing and lexing for decades now, including multiple compilers. So i'll try again - for magic tech, tree-sitter does actually have "magic tech" unavailable to most compilers, and i explained exactly what it was - but you simply want to deride it for some reason - not obvious to me.

the magic tech is that

1. tree-sitter ast/lex tokens (for purposes of incremental lexing/parsing) are compiler/language agnostic. it does not care. This is not true of most compilers

2. it allows incremental lexing/parsing when you keep those ast/tokens under the text of your editor. It only needs to be informed of what edit was made (add/change/remove of a range of characters is fine) and have random access to the text. This is not true of most compilers.

3. The incremental lexing/parsing is extremely fast. It is optimal in the amount that has to be re-lexed/parsed, which for both is bounded by the amount of max used lookahead in the grammar, which for most languages, is really small (again, not the theoretical lookahead, but the actual-used-during-lex/parse lookahead). Not true of most compilers.

Can you design compilers that can do this? Sure, i was at IBM when we did visualage, which could do this in other ways, though not optimally. As mentioned, i've even implemented the optimal incremental lexing/parsing algorithms in other parser/lexer generators, and even some compilers.

Is it common? Not a chance.

Your claim that increasingly more modern languages have semantic highlighting in real time "built in" is simply false - most cannot real time semantic highlight even a 100k file on every keystroke. There are a very small number which can, and it's mostly by luck - they fall down on larger files or other complex cases because they have no incrementality. clangd is about the closest you get. rust-analyzer definitely can't. etc Lexing has always been really fast, and doable in a single pass, that has never been the problem.

Meanwhile, what i just described (updated highlighting on ever keystroke) is trivial with tree-sitter for all languages because of its optimality and agnostism. If you want to see your claim in action - turn on semantic highlighting for vscode and type fast - most of the time you will lose syntax highlighting because things can't keep up. see, e.g., https://github.com/microsoft/vscode-extension-samples/issues...

Again, yes, you could make a compiler for every language which supports a mode that does what tree-sitter does. and people could carefully implement parsers/lexers in each of their favorite compilers and languages that support optimal incremental parsing in them. and then pay the cost of integrating 50 language specific ways of transforming these to work with the editor, basically reinventing what tree-sitter already did right. (As per above, the current semantic token and highlighting support does not resolve this, and the interface it provides can't support it, last i looked).

You seem to really otherwise be complaining that you believe it does always generate correct parsers. Like I said, that seems totally orthogonal to anything about the speed issue, and if that's your real concern, have at it.

debugnik · on May 2, 2023

> You are complaining about a lot of things simultaneously, and it's hard to tell exactly which point you want to tackle.

Thanks, that's exactly how I felt about your first reply.

> tree-sitter does actually have "magic tech" unavailable to most compilers

No, not using it doesn't mean it's unavailable, at best it means the toolchain is up for an upgrade.

> the magic tech is that

I too have written compilers and language servers, even tested novel parsing techniques. I don't need tree-sitter explained to me, I understand the theory behind it perfectly well and, as I already said, understand what they're going for in terms of ecosystem. I just don't think it's the sweet-spot you think it is.

In any case, making incremental edits is not novel, it's easy even with good old recursive descent by simply traversing the ast, skipping the prefix of the change, reparsing the inner node and checking that the edit was well-bounded (hopefully your language doesn't have multiline strings, or you'll have to accept that some incremental changes consume the whole file in the worst case anyway). I've done this, and again, if some toolchain doesn't then it's up for an upgrade.

The actually interesting parts of tree-sitter, to me, are the error recovery and the shared parse tree interface. The latter is orthogonal to the parsing technique, yet with tree-sitter it's all or nothing. The former would be great if it actually marked the error in the parse tree, which it didn't when I checked and it wasn't deemed important.

> most cannot real time semantic highlight even a 100k file on every keystroke

Even you concede that the bottleneck here is not usually parsing, and even then I just said it can be made incremental easily, but all the other work that the compiler is choosing to do before answering back to the editor. When this work is unnecessary (at this step anyway), then sure, you can speed past them by doing the work yourself on the tree-sitter parse tree.

> you could make a compiler for every language which supports a mode that does what tree-sitter does

Yes, we could. Maintaining and testing tree-sitter grammars also has a cost, maybe the difference is that compiler maintainers are externalising this cost right now.

> and then pay the cost of integrating 50 language specific ways of transforming these to work with the editor

No! We went through this with LSP. We can have any number of parsers with a shared driver interface, maybe like tree-sitter's, but we don't need to push every grammar through a single parser toolkit that isn't good enough to replace the existing parser in the toolchain.

In fact, tree-sitter grammars are shipped precompiled to editors, so that's kinda the situation already, but their current interface AFAIK expects a particular implementation on the parser (theirs).

> You seem to really otherwise be complaining that you believe it does always generate correct parsers

I guess you mean doesn't. Then yes, but that's not an "otherwise", that's a big deal, because if tree-sitter parse trees were actually sound I could just maintain a single tree-sitter grammar for the entire toolchain, compiler and editor, and this debate would be redundant.

eklitzke · on May 1, 2023

>Using compilers for highlighting basically doesn't happen because they are too slow.

This is not completely accurate, as clangd does support this: https://github.com/llvm/llvm-project/blob/main/clang-tools-e...

Getting a syntax tree from clangd is faster than you might think because it just needs to hook into the lexing/parsing portion of the compiler, which is actually pretty fast and much, much faster than actually compiling code. And generating fixes and warnings using clang-tidy (which can be built as a standalone binary, but is typically just linked into clangd) requires parsing and lexing anyway, so the parse information is basically "free" since it's work clangd needs to do regardless. And at least for C/C++ you can't really do accurate parsing without the compiler anyway due to the fact that preprocessor macros can do nearly anything. Even if your preprocessor macros are generally doing sane things and could be parsed without the compiler, most people want to see things like #if/#else/#endif branches highlighted so they can see which code paths the compiler is actually going to select when it compiles code, and again that requires actually running a full preprocessor with the same flags as the code will use when it is compiled.

That being said, I believe that most editors do a hybrid approach to lexing/parsing because getting syntax information entirely from clangd is too much to ask. So basically the editor will do a best-effort parse in real time of your code using some built-in syntax highlighting engine (possibly tree-sitter, possibly something else) and then once clangd sends syntax information they'll update the document as necessary.

DannyBee · on May 2, 2023

I mean, it depends on how much code it has to lex/parse, but yes, it's pretty fast but not realtime.

That is usually the goal for editors, or at least, what is tried to achieve.

It can't be done every keystroke, it can't be done close to every keystroke. For clangd it's dependent on the size of includes/etc, though it does a reasonable job of caching :)

But it's not hard to out-type it.

"That being said, I believe that most editors do a hybrid approach to lexing/parsing because getting syntax information entirely from clangd is too much to ask. So basically the editor will do a best-effort parse in real time of your code using some built-in syntax highlighting engine (possibly tree-sitter, possibly something else) and then once clangd sends syntax information they'll update the document as necessary."

Which is precisely what i said most of them should do. It's just that starting with "textmate" basically makes no sense when things like tree-sitter are available.

duped · on May 1, 2023

Tree sitter grammars generate a complete/concrete syntax tree with incremental updates, they necessarily have to be able to represent invalid source text (because users will type it!).

If you're using a tree-sitter grammar as the input to a compiler (which is not what its designed for) you need to write a pass to convert from CST to AST, as you normally would.

debugnik · on May 1, 2023

No, I specifically said "accept invalid syntax but not mark the parse trees as bad". Tree-sitter is supposed to accept invalid syntax, yes, but also tag the nodes as such. And at the time it seemed upstream didn't intend to fix the edge cases where it fails to do so, it just gave syntax trees that didn't match the grammar.