karpathy/llm.c
Project, Hindering Multi-GPU PerformanceThe karpathy/llm.c
project, aimed at training large language models using C/CUDA, is experiencing ongoing challenges with CUDA errors and multi-GPU support, as evidenced by numerous user-reported issues. This project is spearheaded by Andrej Karpathy and focuses on providing a lightweight alternative to larger frameworks for LLM training.
Recent activity highlights significant community engagement, with 71 open issues reflecting a mix of feature requests, bug reports, and discussions. Notably, CUDA-related problems such as memory management errors are prevalent, indicating persistent difficulties in optimizing the codebase for diverse hardware configurations. There is also a strong emphasis on educational content, with users seeking better documentation and guidance.
Recent issues and pull requests (PRs) indicate a focus on resolving CUDA compatibility and enhancing model features. Issues like #747 (FP16 training on Turing GPUs) and #727 (MPI run with 8 GPUs failing) underscore ongoing multi-GPU challenges. Meanwhile, PRs such as #757 (RMSNorm implementation) and #756 (RoPE positional encoding) suggest efforts to integrate advanced techniques for improved model performance.
Andrej Karpathy:
rmsnorm
in llmc/layernorm.cuh
.uint32_t
tokens for Llama 3.Aleksa Gordić:
Other team members have shown no recent commit activity, suggesting reliance on Andrej for current developments.
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 0 | 0 | 0 | 0 | 0 |
30 Days | 3 | 1 | 5 | 3 | 1 |
90 Days | 17 | 5 | 25 | 17 | 1 |
All Time | 132 | 61 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
Andrej | 2 | 1/0/1 | 9 | 7 | 2893 | |
NEWPLAN (NEWPLAN) | 0 | 1/0/0 | 0 | 0 | 0 | |
Biao Zhang (zhangpiu) | 0 | 0/1/0 | 0 | 0 | 0 | |
Yusong Gao (GaoYusong) | 0 | 0/1/0 | 0 | 0 | 0 | |
Jake (Jake-Song) | 0 | 1/0/0 | 0 | 0 | 0 | |
Jiahao Tan (KarhouTam) | 0 | 0/0/1 | 0 | 0 | 0 | |
Gabriel Castro (saladpalad) | 0 | 1/0/0 | 0 | 0 | 0 | |
Aleksa Gordić (gordicaleksa) | 0 | 3/1/0 | 0 | 0 | 0 | |
almao (invisiblepancake) | 0 | 0/0/1 | 0 | 0 | 0 | |
Gajanan Choudhary (gajanan-choudhary) | 0 | 1/0/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
The karpathy/llm.c
repository currently has 71 open issues, indicating ongoing engagement from the community. Recent activity shows a mix of feature requests, bug reports, and discussions about implementation details. Notably, issues related to CUDA errors and multi-GPU support are prevalent, suggesting challenges in optimizing performance across different hardware configurations. There is also a significant focus on educational aspects, with users seeking clarification on various components of the codebase.
Several issues exhibit common themes, particularly around memory management and CUDA compatibility. For instance, multiple users have reported "out of memory" errors or illegal memory access during training, which could indicate underlying problems with resource allocation or kernel execution in the CUDA implementation. Additionally, there are discussions about improving documentation and providing clearer guidelines for new users.
Issue #752: llm.c for inference
Issue #747: Can't train in FP16 on Turing
Issue #739: Suggestion: Test more Activation Functions
Issue #729: MPI run error
Issue #727: MPI run with 8 GPU fails
Issue #752: llm.c for inference
Issue #747: Can't train in FP16 on Turing
Issue #739: Suggestion: Test more Activation Functions
Issue #727: MPI run with 8 GPU fails
Issue #723: TypeError: normal_() got an unexpected keyword argument 'generator'
Overall, the activity within the karpathy/llm.c
repository reflects a vibrant community eager to contribute to and learn from this innovative project focused on LLM training using C/CUDA.
The repository karpathy/llm.c
has a total of 103 open pull requests (PRs), with a significant focus on enhancing performance, adding new features, and improving the overall architecture of the project. The recent PRs predominantly revolve around the implementation of advanced techniques such as RMSNorm, RoPE positional encoding, and SwiGLU activation functions, along with various optimizations for existing kernels.
PR #757: RMSNorm - WIP
Created by Aleksa Gordić, this work-in-progress PR aims to add RMSNorm support. It includes several commits related to kernel allocation and refactoring. The addition is significant as RMSNorm is a modern normalization technique that may improve model training stability.
PR #756: Add RoPE positional encoding - llama3 feature branch
Also by Aleksa Gordić, this PR implements rotary position embedding (RoPE) from the RoFormer paper. The author provides experimental results showing improved validation loss with RoPE compared to traditional embeddings.
PR #755: Add SwiGLU support - llama3 feature branch
This PR introduces the SwiGLU activation function based on a recent paper. The author notes an increase in memory footprint but suggests potential performance benefits. However, they express uncertainty about its advantages over existing activation functions.
PR #754: add llama 3 support to llm.c
Initiated by Andrej Karpathy, this draft PR starts integrating Llama 3 support into the codebase, indicating ongoing efforts to enhance compatibility with newer models.
PR #750: implement rmsnorm in C
Created by Jake Song, this PR implements RMSNorm in C for Llama 3. The author expresses some uncertainty regarding its accuracy but notes successful tests against PyTorch.
PR #718: Add SwiGLU support
This earlier PR by Aleksa Gordić also introduces SwiGLU but seems to have undergone further refinement in the latest iteration (#755).
PR #753: Adamw thread coarsening kernel
Gabriel Castro's PR focuses on optimizing the AdamW optimizer through thread coarsening techniques, which could enhance performance during training.
PR #748: Fix sizing typo in train_gpt2_fp32.cu
A simple typo fix by Gajanan Choudhary that corrects a variable name in the codebase.
PR #746: log with LINE and FILE for better addressing
A minor improvement by NEWPLAN that enhances logging clarity by including line numbers and file names.
PR #743: Fixed modal script for updated cudnn version, and read errors
Vyom Sharma's PR addresses issues with reading errors caused by recent changes in cuDNN versions.
The current set of open pull requests reflects a strong emphasis on enhancing model performance and stability through advanced techniques and optimizations. Notably, several PRs are focused on integrating new normalization methods (like RMSNorm) and activation functions (such as SwiGLU), which are critical for improving training dynamics in large language models (LLMs).
There are instances where contributors have engaged in discussions about code quality and design decisions (e.g., comments from Andrej Karpathy regarding the use of AutoTokenizer). This reflects an active community dynamic but also points to potential friction regarding codebase direction and standards.
The pull requests currently open in karpathy/llm.c
demonstrate a vibrant development environment focused on advancing LLM capabilities through innovative techniques and optimizations. However, attention should be given to streamlining the merging process and addressing any outstanding discussions or disputes to maintain momentum in project development.
rmsnorm
, unfused, forward functionality in llmc/layernorm.cuh
and modified related files.uint32_t
tokens, necessary for Llama 3.llama3cu
phony in the Makefile.The development team is currently experiencing a phase of concentrated activity led by Andrej Karpathy, primarily focused on advancing the Llama 3 implementation. Other team members appear inactive in terms of recent commits, which may impact collaborative progress if this trend continues.