Doesn't the Stack contain HumanEval? So you're basically comparing numbers on th...

amasad · on May 3, 2023

Can't find it now but pretty sure BigCode said somewhere they explicitly looked for it and removed it. Also subjective measure does match up to the benchmark. Our finetuned model performed +50% on HumanEval and then when using it felt at least that much improved.

godelski · on May 3, 2023

You can view the prompts, solutions, and checks here[0]. See my sibling comment (to yours) where I quote the Human Eval paper and do some more analysis. But I think if you look at [0] you'll see that these aren't really unique problems and are likely to have large repetitions in the dataset. I should add to that comment to include the dataset[1] (too late to edit) where they mention that they just scrape all of GitHub (Jan 1 2015 - Mar 31 2022). They do exact and near de-duplicate but near de-duplication is messy.

> We implement near-deduplication in our pre-processing pipeline on top of exact deduplication. We first split the files into words/tokens based on non-alphanumeric characters and remove files with fewer than 10 tokens. Next, we compute the MinHash with 256 permutations of all documents, and use Locality Sensitive Hashing to find clusters of duplicates. We further reduce these clusters by ensuring that each file in the original cluster is similar to at least one other file in the reduced cluster. We consider two files similar when their Jaccard similarity exceeds 0.85.

Near-duplicates are still difficult to measure. So we should expect duplication, and it should be proportional to the number of samples we have (even if the same variance, but I'd wager higher variance with larger duplications).

[0] https://github.com/openai/code-align-evals-data/tree/97446d9...

[1] https://arxiv.org/abs/2211.15533

godelski · on May 3, 2023

My favorite line from the HumanEval paper[0]

> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.

So to answer your question, yes, the evaluation dataset is spoiled. You can find such unique and never before seen docstrings like

> For a given list of input numbers calculate the Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the absolute difference between each element and a centerpoint (mean in this case)[1]

And here's a repo I found that is 8 years old[2]. But how about a more recent one that is even closer?[3] There's plenty more examples[4] (does anyone know how actually limit the date to prior to 2021? `pushed:<2021` doesn't work nor does using the `created` keyword. Date searching doesn't seem to work well).

In essence, we can still use this evaluation method to determine how good our model is at doing fuzzy searching. Which, mind you, is still a useful thing. But I would be careful in concluding that this means the model is good at generalizing arbitrary descriptions of code or novel pieces of code. That said, one may be able to argue that not many lines of code are actually that novel. Still, we need to be careful about our conclusions and understand the limitations of our metrics (something I am currently deeply troubled by)

[0] https://arxiv.org/abs/2107.03374

[1] https://github.com/openai/code-align-evals-data/blob/97446d9...

[2] https://github.com/bertomartin/stat4701/blob/ec2b64f629cbbf6...

[3] https://github.com/danielwatson6/hate-speech-project/blob/64...

[4] https://github.com/search?q=abs%28x+-+mean%29+for+language%3...

godelski · on May 3, 2023

(follow-up: Figured this should be a different comment)

I wanted to demonstrate what I said above so I came up with some examples of things I think a human would have an easy time implementing but might be hard to implement. BUT a key part is that I expect these to be in the dataset! I just don't expect these to be in hundreds or thousands of githubs because they will be uncommon (but not rare). Also, we'll pretty much ask for few-liners to give the model the biggest advantage we can (errors will compound).

Prompt:

from torch import nn

class LipSwish(nn.Module):

""""

The Swish activation function is defined by a gated linear unit,

where the gate is defined by a sigmoid function and multiplies the input with

a learnable parameter, beta. Beta is initialized as 0.5.

The Lipswish function normalizes the output by the upper bound of 1.1.

""""

    def __init__(self:

        super().__init__()

Result: Mostly correct but missing the division by 1.1. The forward is `return x * F.sigmoid(self.beta * x)`, which is Swish (it also assumes we had "import torch" and applied type hinting). It did properly set the parameter (this is just a 3 liner)

Discussion: The Swish function should be in the dataset and is a well known activation function (though beta is not in the pytorch version). Despite LipSwish being in the dataset (introduced in 2019 from Residual Flows[0]) it is not common. I could get the code to generate the swish function (initializing beta, and performing the gate) but could not get the code to divide the output by 1.1. I would not expect a human to have difficulties with this.

Okay, so let's try something else that might be a bit more common and older. The same paper uses a concatenated activation function, and those aren't "uncommon". CReLU was introduced in 2016[1] and there's plenty of concatenated activations around since then. The pytorch documentation even uses it as an example[2]. There's far more examples of CReLU (3k python results for "class CReLU" vs 58 for "class LipSwish. Use these numbers as weak hints because search sucks and isn't always accurate).

Prompt:

from torch import nn

from torch.nn import functional as F

class CReLU(nn.Module):

""""

Concatenated version of ReLU. The activation is applied to both the positive and

negative of our input and the result is concatenated.

""""

    def __init__(self):

        super().__init__()

    def forward(self, x):

Result: `return torch.cat([x.clamp(min=0), -x.clamp(min=0)], 1)`. This is correct but not the expected one-liner result.

Discussion: This was a bit surprising, it didn't use functional as we might expect (or hinted). But interestingly it will if we change the class name to "ConcatenatedReLU". I found exact copies on GitHub with the full name (memorization) but the fist page of instances for CReLU I found used functional (I did find one that was exactly the above code, when adding "clamp" to the search, but missing the minus sign. There were plenty of errors in CReLU implementations). Interesting side note: CReLU continues and defines a function CReLU6 with uses the same docstring but clamps with a max of 6 on the positive input whereas Concatenated starts to define a convolutional block (Conv + BatchNorm + ReLU) called Conv2d.

So we have kinda mixed results, and in both cases these are rather odd and probably not what we wanted. We can clearly see that there are issues where a human would not have too much trouble. There's a big issue in these types of problems: we need to memorize a lot of information (otherwise we can't even write code or know library calls) but too much memorization prevents creativity. There is a lot of gray area between the _pure_ "Stochastic Parrot"/"Fancy copy machine" vs a generalized intelligence (with a broad and flexible definition of intelligence). I'd still call them stochastic parrots because to me the evidence suggests that we're closer to the memorization side than the creation side. But that doesn't mean these frameworks aren't useful. We all know a lot of code is boiler plate (otherwise we wouldn't have the joke "copy paste from SO") and these tools can be very useful for that. But I think the utility is highly going to depend on what you are coding for and how you code. If you're doing standard stuff, this probably has high utility to you and can save you a lot of time. The same way writing macros does, but this is FAR more powerful. It can also help novices a lot. Also, if your main errors are reading mistakes (e.g. you're dyslexic) -- this is my largest problem -- then this might make things difficult as you have a tendency to gloss over text and miss minor errors. I also don't think these tools would help if you're a researcher or writing optimized or specialized code. These differences are probably why we see such differences in people's reactions. But it may also be a hint into what people do and how they work when we see who raves and who rants about these.

[0] https://arxiv.org/abs/1906.02735

[1] https://arxiv.org/abs/1603.05201

[2] https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html

Edit: We can also check if code is in the stack[3]. We see that [0] is indeed in the dataset so we know there is information leakage. Interestingly the exact copy I found in the previous comment[4] isn't! (The repo, though the user is)

[3] https://huggingface.co/spaces/bigcode/in-the-stack

[4] https://github.com/bertomartin/stat4701/blob/ec2b64f629cbbf6...