The training data - all 1.2 trillion tokens - can be downloaded by grabbing each...

simonw · on April 17, 2023

Wrote this up as a blog post: https://simonwillison.net/2023/Apr/17/redpajama-data/

csris · on April 17, 2023

Hi! I'm the VP of Engineering at Together. Thanks for writing up these instructions! FYI, you can also download all the files with one wget command:

  wget -i https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt

This is also mentioned on the dataset card for redpajama-data-1T on Huggingface [1].

[1]: https://huggingface.co/datasets/togethercomputer/RedPajama-D...

simonw · on April 17, 2023

I made sure to include that in my blog post - along with a note that you need 2.67TB of disk space first!

bhaney · on April 17, 2023

> you need 2.67TB of disk space

The data looks like it should compress pretty well. If you use something like btrfs's transparent compression, I wouldn't be surprised if it all fit in less than 0.75TB of disk space while still being usable to any tool that expects uncompressed data.

Edit: It looks like some of this data is already compressed, so maybe not.

csris · on April 18, 2023

Note that you also need about 5TB of disk for the full decompressed dataset. However, only common crawl are compressed in jsonl.zst, everything else is uncompressed jsonl.

doctoboggan · on April 17, 2023

I am a little concerned that they have only about 60% of the code tokens (GitHub and stackexchange). Given that so far the only concrete use case I have for LLMs is coding assistance I wouldn't want this open source model to be and less quality in that area.

In your opinion do you think this will hamper the model at all? Or is it still more than enough to get good coding assistance?

csris · on April 17, 2023

Nice catch! We sampled the github dataset to match the total # tokens seen by LLaMA during training: ~64B tokens (they only pass through 0.64 of their total Github dataset according to the paper). We have a lot of Github data and will make them available soon. Note, we also have not built this for compute optimal training. We are following LLaMA's lead and are training on more data for longer to optimize for quality, not compute.

kir-gadjello · on April 18, 2023

Thank you for developing the pipeline and amassing considerable compute for gathering and preprocessing this dataset!

I'm not sure if this is the right place to ask about this, but could you consider training an LLM using a more advanced, sparse transformer architecture (specifically, "Terraformer" from this paper https://arxiv.org/abs/2111.12763 and this codebase https://github.com/google/trax/blob/master/trax/models/resea... by Google Brain and OpenAI)? I understand the pressure to focus on training a straightforward LLaMA replication, but of course you see that it's a legacy dense architecture which limits its inference performance. This new architecture is not just an academic curiosity but is already validated at scale by Google, providing 10x+ inference performance boost on the same hardware.

Frankly, the community's compute budget - for training and for inference - isn't infinite, and neither is the public's interest in models that do not have advantage (at least in convenience) over closed-source ones; and so we should utilize both those resources as efficiently as possible. It could be a big step forward if you trained at least LLaMA-Terraformer-7B and 13B foundation models on the whole dataset.

doctoboggan · on April 17, 2023

Very good to hear that you are optimizing for inference rather than training. I’ve tried llama and its various instruction tuned siblings and have yet to get equivalent performance to gpt-3.5 on coding tasks. Seeing how the base model performed relative to gpt-3 on the various benchmarks gives me hope that the difference is just in RLHF or other fine tuning steps. I really hope the community is able to get there, Especially if the resulting model is able to be quantized with minimal loss.

rwl4 · on April 17, 2023

I wonder if it would make sense to create tokens for each emoji so they don't have to be multi-token. Especially considering people have experimented with using them for makeshift compression.

sp332 · on April 17, 2023

As mentioned in the post, the smaller models are trained well past "compute-optimal" amounts of data and I would expect are well into diminishing returns. On the other hand, large models are good one-shot and few-shot learners, and might be able to pick up enough context from your prompt alone to be useable, even if it wasn't specifically trained on your use case.

Minus0 · on April 17, 2023

In this context compute optimal isn't quite the same as diminishing returns. If you look at the loss graphs in the Llama paper, you can see that even the curves for the smaller models were still going down at the time they stopped training and weren't anywhere near plateauing yet. LLMs are notoriously data hungry and will take a long time to reach convergence.

Compute optimal here means the point at which it makes sense to move from a smaller to a larger model assuming that: (a) you have a fixed compute budget of FLOPs, and (b) you want to train the best model possible. The problem is that this applies only to training and assumes nothing about the cost of inference. If you actually need to deploy these trained models and support them long-term for hundreds, thousands, even millions of people to use, would you rather deploy a 13B model or a 30B model at the same level of quality, even if the 13B model would be more costly to train?

There is going to be a point at which these models plateau and further improvement will not be possible without moving to a larger model, but Llama doesn't get there quite yet.

bkm · on April 17, 2023

Relevant: https://twitter.com/abacaj/status/1647999551964323844

totoglazer · on April 17, 2023

This tweet is misunderstanding the papers.

jstx1 · on April 17, 2023

Smaller % of training data doesn't necessarily mean lower quality.

spullara · on April 18, 2023

I agree with this as well. Code has been absolutely anemic outside of GPT-3/4. One trick they used was to train it on code first and then also use a lot more code than we see in Llama even.

harisec · on April 18, 2023

Same here. If you believe the following research (which I do), the ability to perform complex reasoning is likely to be from training on code:

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...

I think it's essential to increase the quantity of code tokens.

simonw · on April 17, 2023

No idea!

I wonder how hard it would be to fine-tune something built on RedPajama on further code examples to improve performance there.

rcpt · on April 17, 2023

I'm kind of surprised how small that dataset is

fnands · on April 17, 2023

Nice. Thanks for the summary.

So ~4x the size of the Pile, any idea how it stacks up in terms of quality to other big datasets?

afro88 · on April 17, 2023

Interesting they're allowed to use stackexchange. I don't know much about the legalities of scraping. Was this an agreement between them, or is it simply ok to scrape and use the data in a model?

jameshart · on April 17, 2023

The entire purpose of stackexchange was to create a scrapeable index of questions and answers. The scraper they expected was googlebot, not an LLM trainer, and what they expected it to do was build an index of what questions and answers are located on each of their pages.

progbits · on April 17, 2023

https://stackoverflow.com/help/licensing

Doesn't this imply the produced model has to be CC-BY-SA too?

gpm · on April 17, 2023

No. It means that if they are doing something that is prohibited by copyright law (without a license) then it needs to be CC-BY-SA.

The only theory under which training this sort of model is remotely legal is that doing so is not prohibited by copyright law in the first place. If that theory is correct they don't need a license, and they don't need to abide by any terms of licenses that they were granted without asking.

If that theory is incorrect, they have to comply with the stackoverflow license, but they also have to not use any of the (massive amounts) of unlicensed training data they are using, and comply with the numerous incompatible licenses other sources of training data are licensed under. In other words it's impossible to do this.

jameshart · on April 17, 2023

If ‘reading something and using the content to update your Bayesian priors about the world’ is a breach of copyright, then reading things is a breach of copyright. The tricky new world that the LLM opens up is that it lets you distribute an exact copy of the result of having read the thing. That’s something you can’t do with a human mind (although it’s sort of the job description of a ‘teacher’).

archontes · on April 19, 2023

That's not likely to be true. An AI can create a work that is infringing, a picture of a Marvel character, for example. But that doesn't make the AI or its weights or its training an infringement.

gpm · on April 17, 2023

Humans and machines are distinct with respect to copyright law. A human memorizing a book is legal. A machine scanning a book is creating a new copy and is in (at least some cases) illegal. It is not obvious that just because humans are allowed to learn from things that machines also are.

I tend to favour the view that in this case it is legal (by way of the de minimis doctrine), but I don't think it's a trivial question.

jameshart · on April 18, 2023

A human memorizing a book is legal; a human reciting that book aloud for an audience is not (performances of plays require licenses to the performing rights of a work, for example).

Distribution is when the issue arises - not consumption and construction of a mental model.

I acknowledge the parallels are imperfect and this all needs to be worked out in court. But it’s possible that at the pace LLMs are developing, by the time courts start addressing these questions we’ll already be questioning whether the distinction between machines and people is as big as we thought.

gpm · on April 18, 2023

Copyright law prohibits copying (some exceptions apply) amongst other things not just distribution.

pyth0 · on April 18, 2023

brb, acquiring the necessary license to read my son a bedtime story.

gpm · on April 18, 2023

You (typically) need a license to publicly perform a work, not to read it your son.

gattilorenz · on April 17, 2023

Welcome to this can of worms.

CC-BY-SA content needs attribution too, but I don’t see the(se) model(s) in the current state being able to do so.

I imagine we’re gonna see the IBM PC bios/Unix/ReactOS “tainted code” arguments again in court, this time is not the human who is more-or-less knowingly responsible for sneaking in copyrighted code.

wongarsu · on April 17, 2023

By that line of reasoning, GitHub copilot would have to be GPL. Until somebody fights about this in court we don't really know. But even in the worst case the CC-BY-SA is one of the easier licenses to fulfill, not much worse than the MIT-licensed code contained in the dataset.

taneq · on April 17, 2023

Even if the model doesn’t, where does code written with the aid of an llm end up after the various rulings about the output of Stable Diffusion etc. not being copyrightable at all?

patrakov · on April 18, 2023

Good that they disclosed it. In one of the places where I worked before, I had to sign a statement that I won't copy code from stackexchange, because of the unclear licensing. That is, the risk that the answer is quoted from or otherwise based on some open-source project, and because that could, in the worst case, force the company to disclose their code publicly.

andrewaylett · on April 18, 2023

No need to scrape, you can grab a dump from the Internet Archive: https://archive.org/details/stackexchange