And then C4 as well, which is "a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset": https://paperswithcode.com/dataset/c4
The data looks like it should compress pretty well. If you use something like btrfs's transparent compression, I wouldn't be surprised if it all fit in less than 0.75TB of disk space while still being usable to any tool that expects uncompressed data.
Edit: It looks like some of this data is already compressed, so maybe not.
Note that you also need about 5TB of disk for the full decompressed dataset. However, only common crawl are compressed in jsonl.zst, everything else is uncompressed jsonl.
I am a little concerned that they have only about 60% of the code tokens (GitHub and stackexchange). Given that so far the only concrete use case I have for LLMs is coding assistance I wouldn't want this open source model to be and less quality in that area.
In your opinion do you think this will hamper the model at all? Or is it still more than enough to get good coding assistance?
Nice catch! We sampled the github dataset to match the total # tokens seen by LLaMA during training: ~64B tokens (they only pass through 0.64 of their total Github dataset according to the paper). We have a lot of Github data and will make them available soon. Note, we also have not built this for compute optimal training. We are following LLaMA's lead and are training on more data for longer to optimize for quality, not compute.
Thank you for developing the pipeline and amassing considerable compute for gathering and preprocessing this dataset!
I'm not sure if this is the right place to ask about this, but could you consider training an LLM using a more advanced, sparse transformer architecture (specifically, "Terraformer" from this paper https://arxiv.org/abs/2111.12763 and this codebase https://github.com/google/trax/blob/master/trax/models/resea... by Google Brain and OpenAI)? I understand the pressure to focus on training a straightforward LLaMA replication, but of course you see that it's a legacy dense architecture which limits its inference performance. This new architecture is not just an academic curiosity but is already validated at scale by Google, providing 10x+ inference performance boost on the same hardware.
Frankly, the community's compute budget - for training and for inference - isn't infinite, and neither is the public's interest in models that do not have advantage (at least in convenience) over closed-source ones; and so we should utilize both those resources as efficiently as possible. It could be a big step forward if you trained at least LLaMA-Terraformer-7B and 13B foundation models on the whole dataset.
Very good to hear that you are optimizing for inference rather than training.
I’ve tried llama and its various instruction tuned siblings and have yet to get equivalent performance to gpt-3.5 on coding tasks. Seeing how the base model performed relative to gpt-3 on the various benchmarks gives me hope that the difference is just in RLHF or other fine tuning steps. I really hope the community is able to get there, Especially if the resulting model is able to be quantized with minimal loss.
I wonder if it would make sense to create tokens for each emoji so they don't have to be multi-token. Especially considering people have experimented with using them for makeshift compression.
As mentioned in the post, the smaller models are trained well past "compute-optimal" amounts of data and I would expect are well into diminishing returns. On the other hand, large models are good one-shot and few-shot learners, and might be able to pick up enough context from your prompt alone to be useable, even if it wasn't specifically trained on your use case.
In this context compute optimal isn't quite the same as diminishing returns. If you look at the loss graphs in the Llama paper, you can see that even the curves for the smaller models were still going down at the time they stopped training and weren't anywhere near plateauing yet. LLMs are notoriously data hungry and will take a long time to reach convergence.
Compute optimal here means the point at which it makes sense to move from a smaller to a larger model assuming that: (a) you have a fixed compute budget of FLOPs, and (b) you want to train the best model possible. The problem is that this applies only to training and assumes nothing about the cost of inference. If you actually need to deploy these trained models and support them long-term for hundreds, thousands, even millions of people to use, would you rather deploy a 13B model or a 30B model at the same level of quality, even if the 13B model would be more costly to train?
There is going to be a point at which these models plateau and further improvement will not be possible without moving to a larger model, but Llama doesn't get there quite yet.
I agree with this as well. Code has been absolutely anemic outside of GPT-3/4. One trick they used was to train it on code first and then also use a lot more code than we see in Llama even.
Interesting they're allowed to use stackexchange. I don't know much about the legalities of scraping. Was this an agreement between them, or is it simply ok to scrape and use the data in a model?
The entire purpose of stackexchange was to create a scrapeable index of questions and answers. The scraper they expected was googlebot, not an LLM trainer, and what they expected it to do was build an index of what questions and answers are located on each of their pages.
No. It means that if they are doing something that is prohibited by copyright law (without a license) then it needs to be CC-BY-SA.
The only theory under which training this sort of model is remotely legal is that doing so is not prohibited by copyright law in the first place. If that theory is correct they don't need a license, and they don't need to abide by any terms of licenses that they were granted without asking.
If that theory is incorrect, they have to comply with the stackoverflow license, but they also have to not use any of the (massive amounts) of unlicensed training data they are using, and comply with the numerous incompatible licenses other sources of training data are licensed under. In other words it's impossible to do this.
If ‘reading something and using the content to update your Bayesian priors about the world’ is a breach of copyright, then reading things is a breach of copyright. The tricky new world that the LLM opens up is that it lets you distribute an exact copy of the result of having read the thing. That’s something you can’t do with a human mind (although it’s sort of the job description of a ‘teacher’).
That's not likely to be true. An AI can create a work that is infringing, a picture of a Marvel character, for example. But that doesn't make the AI or its weights or its training an infringement.
Humans and machines are distinct with respect to copyright law. A human memorizing a book is legal. A machine scanning a book is creating a new copy and is in (at least some cases) illegal. It is not obvious that just because humans are allowed to learn from things that machines also are.
I tend to favour the view that in this case it is legal (by way of the de minimis doctrine), but I don't think it's a trivial question.
A human memorizing a book is legal; a human reciting that book aloud for an audience is not (performances of plays require licenses to the performing rights of a work, for example).
Distribution is when the issue arises - not consumption and construction of a mental model.
I acknowledge the parallels are imperfect and this all needs to be worked out in court. But it’s possible that at the pace LLMs are developing, by the time courts start addressing these questions we’ll already be questioning whether the distinction between machines and people is as big as we thought.
CC-BY-SA content needs attribution too, but I don’t see the(se) model(s) in the current state being able to do so.
I imagine we’re gonna see the IBM PC bios/Unix/ReactOS “tainted code” arguments again in court, this time is not the human who is more-or-less knowingly responsible for sneaking in copyrighted code.
By that line of reasoning, GitHub copilot would have to be GPL. Until somebody fights about this in court we don't really know. But even in the worst case the CC-BY-SA is one of the easier licenses to fulfill, not much worse than the MIT-licensed code contained in the dataset.
Even if the model doesn’t, where does code written with the aid of an llm end up after the various rulings about the output of Stable Diffusion etc. not being copyrightable at all?
Good that they disclosed it. In one of the places where I worked before, I had to sign a statement that I won't copy code from stackexchange, because of the unclear licensing. That is, the risk that the answer is quoted from or otherwise based on some open-source project, and because that could, in the worst case, force the company to disclose their code publicly.
I ran a HEAD request against them all to sum up the total file size, and it's 2.67TB total.
Here's a Datasette Lite URL that lets you explore the size metadata about those files: https://lite.datasette.io/?json=https://gist.github.com/simo...
And a SQL query that shows the breakdown across the different sources:
https://lite.datasette.io/?json=https://gist.github.com/simo...
Sizes here are in GB:
Common Crawl is in there a few times - they have the following folders: And then C4 as well, which is "a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset": https://paperswithcode.com/dataset/c4