Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Glad to see your ideas here. Could you clarify a point to me? The W matrix in the paper is d_model x 2d. Does this mean a differential attention model will double the W matrix of a standard attention model, which is d_model x d? E g. Suppose llama3 has W of 8192 x 1024, does the diffattn model of the same architecture have W of 8192 x (1024 x 2)?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: