Conversation
📝 WalkthroughWalkthroughThe PR reorganizes the ESM2 module structure, flattening the codebase from Changes
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant Main as main()
participant Loader as HF Model Loader
participant Converter as ESM Converter
participant Tokenizer as Tokenizer Handler
participant Config as Config Patcher
participant Template as Template Engine
participant FileOps as File Operations
participant TE as TE Model Loader
User->>Main: Run export pipeline
Main->>Main: Iterate ESM_TAGS
Main->>Loader: Load HF checkpoint
Loader-->>Converter: Return HF model
Converter->>Converter: Align pooler layers
Converter->>Converter: Convert to TE format
Converter-->>FileOps: Serialized TE model
FileOps->>FileOps: Save converted model
Main->>Tokenizer: Load & save tokenizer
Tokenizer-->>FileOps: Tokenizer artifacts
Main->>Config: Patch config auto_map
Config-->>FileOps: Updated config.json
Main->>Template: Render README (benchmark_results, param_count)
Template-->>FileOps: Generated README
FileOps->>FileOps: Copy LICENSE, cleanup
Main->>TE: Smoke test: load TE model
TE-->>Main: Verify successful load
Main->>Main: gc.collect(), torch.cuda.empty_cache()
Main-->>User: Export complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
0156c98 to
9f32261
Compare
|
Caution Review failedAn error occurred during the review process. Please try again later. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
d99aecd to
f82e07c
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
bionemo-recipes/models/esm2/README.md (1)
152-159:⚠️ Potential issue | 🟡 MinorInconsistent CLI command:
huggingface-cli uploadvshf upload.Line 152 uses
huggingface-cli uploadbut Line 159 useshf upload. Whilehfmay be a valid alias in some installations, this is inconsistent within the same document and could confuse users.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/esm2/README.md` around lines 152 - 159, The README shows two different CLI invocations—`huggingface-cli upload` and `hf upload`—which is inconsistent; pick one canonical command and update both occurrences so the upload example and the loop use the same CLI invocation (e.g., replace `huggingface-cli upload` with `hf upload` or vice versa) including the examples that reference `checkpoint_export` and the loop over `*/` to ensure the document consistently uses the chosen command name.bionemo-recipes/models/esm2/tests/conftest.py (1)
105-112:⚠️ Potential issue | 🟡 Minor
te_model_checkpointis now function-scoped — conversion cost per test.Without an explicit
scope=, this fixture runs the full HF download + TE conversion +save_pretrainedfor every test that uses it. The change is a correctness fix (function-scopedtmp_pathcan't be used in session-scoped fixtures), but if multiple tests consume this fixture,scope="module"withtmp_path_factorywould be a reasonable middle ground.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/esm2/tests/conftest.py` around lines 105 - 112, The te_model_checkpoint fixture currently runs per-test and downloads/converts the HF model each time; change its scope to "module" and switch from tmp_path to tmp_path_factory so the expensive setup runs once: update the fixture signature for te_model_checkpoint to accept tmp_path_factory, create a module-level temporary directory via tmp_path_factory.mktemp(...), perform AutoModelForMaskedLM.from_pretrained(...) and convert_esm_hf_to_te(...) once, save_pretrained into that tmp dir, and return the path; keep the fixture name te_model_checkpoint and calls to convert_esm_hf_to_te and model_te.save_pretrained unchanged.
🧹 Nitpick comments (6)
bionemo-recipes/models/esm2/export.py (1)
29-36: Consider deriving benchmark results fromESM_TAGSor validating consistency.
BENCHMARK_RESULTSkeys andESM_TAGSare maintained independently. If a tag is added toESM_TAGSwithout a corresponding entry inBENCHMARK_RESULTS, Line 103 (BENCHMARK_RESULTS[tag]) will raise aKeyErrorat runtime.Proposed fix: add a validation
+# Validate that all ESM_TAGS have benchmark results +assert set(ESM_TAGS) == set(BENCHMARK_RESULTS.keys()), ( + f"ESM_TAGS and BENCHMARK_RESULTS keys must match. " + f"Missing: {set(ESM_TAGS) - set(BENCHMARK_RESULTS.keys())}" +)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/esm2/export.py` around lines 29 - 36, BENCHMARK_RESULTS is maintained separately from ESM_TAGS which can cause a KeyError at runtime when the code does BENCHMARK_RESULTS[tag]; add a validation step (at module import or before use) that iterates over ESM_TAGS and checks each tag exists in BENCHMARK_RESULTS, raising a clear error or filling missing entries with a sensible default; update the check near the code that accesses BENCHMARK_RESULTS[tag] (referenced in the review as the access on line 103) and ensure the validation references the symbols ESM_TAGS and BENCHMARK_RESULTS so missing tags are detected early with an informative message.bionemo-recipes/models/esm2/requirements.txt (1)
1-11: No version pins on dependencies.All dependencies except
torchaoare unpinned. Since this runs inside the NGC PyTorch container which provides pre-installed, compatible versions of most of these packages, this is likely intentional. However, for reproducibility outside the container (e.g., local dev), unpinned deps can lead to breakage.Consider whether adding version lower-bounds (e.g.,
transformers>=4.x) would be worthwhile for the local dev workflow described in the README.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/esm2/requirements.txt` around lines 1 - 11, The requirements file currently lists open-ended dependencies which harms reproducibility; update bionemo-recipes/models/esm2/requirements.txt to add conservative lower-bound version pins for key packages (for example use transformers>=4.x, datasets>=2.x, hydra-core>=1.x, jinja2>=3.x, omegaconf>=2.x, accelerate>=0.x, peft>=0.x) while keeping the existing torchao!=0.14.0 exclusion; choose minimal compatible lower bounds that reflect the versions tested in the README/NGC container and add them to each package line so local dev installs are deterministic and compatible with the container-provided stack.bionemo-recipes/models/esm2/tests/conftest.py (1)
29-30: Prefersys.path.insert(0, ...)oversys.path.appendfor module lookup priority.
sys.path.appendadds the localesm2/directory at the end of the search path. If any installed package provides a module namedconvert,collator, ormodeling_esm_te, it would shadow the local file. All other files in this PR that manipulatesys.pathusesys.path.insert(0, ...)to guarantee the local directory takes precedence.🔧 Suggested fix
-sys.path.append(Path(__file__).parent.parent.as_posix()) -sys.path.append(Path(__file__).parent.as_posix()) +sys.path.insert(0, Path(__file__).parent.parent.as_posix()) +sys.path.insert(0, Path(__file__).parent.as_posix())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/esm2/tests/conftest.py` around lines 29 - 30, Replace sys.path.append calls so the local esm2 directories are prepended to module search path: change the two occurrences that call sys.path.append with the corresponding sys.path.insert(0, ...) using the same Path(__file__).parent.parent.as_posix() and Path(__file__).parent.as_posix() expressions, ensuring the local directories take precedence during import resolution.bionemo-recipes/models/esm2/tests/test_cp_dataloader.py (1)
33-35: Redundantimport sysinside__main__block after moving it to module level.
import sysis now imported at module level (line 33). The existingimport sysin theif __name__ == "__main__":block (around line 1104) is now redundant.🧹 Suggested cleanup
if __name__ == "__main__": - import sys ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/esm2/tests/test_cp_dataloader.py` around lines 33 - 35, Remove the redundant import sys inside the if __name__ == "__main__": block—since sys is already imported at module level (top of file), delete the second import statement near the bottom of the file (in the main block) so there are no duplicate imports; verify no conditional import-specific behavior relies on that inner import.bionemo-recipes/models/esm2/tests/test_cp_bshd.py (1)
23-26:sys.path.insertat module level mutates session-widesys.pathduring pytest collection.The insertion runs unconditionally whenever pytest imports the file — not only under
torchrun. Ifconftest.pyalready adds this path, a duplicate entry is created at position 0, and any module namedcollator,convert, ormodeling_esm_tein this directory will shadow identically named modules that appear later insys.pathfor the entire test session.An idempotency guard avoids the duplicate:
♻️ Proposed fix
-sys.path.insert(0, Path(__file__).resolve().parent.parent.as_posix()) +_model_dir = Path(__file__).resolve().parent.parent.as_posix() +if _model_dir not in sys.path: + sys.path.insert(0, _model_dir)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/esm2/tests/test_cp_bshd.py` around lines 23 - 26, Avoid unconditionally mutating session-wide sys.path by checking for the target path before inserting: compute the path (Path(__file__).resolve().parent.parent.as_posix()) and only call sys.path.insert(0, ...) if that string is not already present in sys.path; replace the current unconditional sys.path.insert(0, Path(...).as_posix()) with this idempotent guard so tests run under torchrun still get the path without creating duplicate entries or shadowing modules.bionemo-recipes/models/esm2/tests/test_cp_thd.py (1)
23-26: Same unconditionalsys.path.insertas intest_cp_bshd.py— same idempotency concern applies.Both files insert the same directory at
sys.path[0]unconditionally, creating two duplicate entries when pytest collects them in the same session. Consider the same guard:♻️ Proposed fix
-sys.path.insert(0, Path(__file__).resolve().parent.parent.as_posix()) +_model_dir = Path(__file__).resolve().parent.parent.as_posix() +if _model_dir not in sys.path: + sys.path.insert(0, _model_dir)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/esm2/tests/test_cp_thd.py` around lines 23 - 26, The unconditional sys.path.insert call in test_cp_thd.py is causing duplicate entries when multiple tests run; change it to compute the directory via Path(__file__).resolve().parent.parent.as_posix() (same expression already used) and only insert it if that string is not already present in sys.path (i.e., check "if model_dir not in sys.path: sys.path.insert(0, model_dir)"); mirror the idempotent guard used in the other test (test_cp_bshd.py) to avoid duplicates while preserving the conftest fallback behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/models/esm2/export.py`:
- Around line 113-115: The cleanup currently deletes model_hf and model_te but
misses freeing model_hf_masked_lm, causing a memory leak; update the teardown in
export.py to explicitly delete model_hf_masked_lm (e.g., call del
model_hf_masked_lm) before gc.collect() and torch.cuda.empty_cache() in the same
block that deletes model_hf and model_te (refer to the symbols
model_hf_masked_lm, model_hf, model_te), ensuring any other large objects tied
to that masked LM are also released prior to collecting and emptying the CUDA
cache.
- Around line 118-122: The from_pretrained call for AutoModelForMaskedLM is
passing dtype instead of the correct HuggingFace kwarg torch_dtype, so bfloat16
is ignored; update the call that constructs model_te
(AutoModelForMaskedLM.from_pretrained) to pass torch_dtype=torch.bfloat16
(keeping trust_remote_code=True and the same path) so the model actually loads
in bfloat16 during the smoke test.
---
Outside diff comments:
In `@bionemo-recipes/models/esm2/README.md`:
- Around line 152-159: The README shows two different CLI
invocations—`huggingface-cli upload` and `hf upload`—which is inconsistent; pick
one canonical command and update both occurrences so the upload example and the
loop use the same CLI invocation (e.g., replace `huggingface-cli upload` with
`hf upload` or vice versa) including the examples that reference
`checkpoint_export` and the loop over `*/` to ensure the document consistently
uses the chosen command name.
In `@bionemo-recipes/models/esm2/tests/conftest.py`:
- Around line 105-112: The te_model_checkpoint fixture currently runs per-test
and downloads/converts the HF model each time; change its scope to "module" and
switch from tmp_path to tmp_path_factory so the expensive setup runs once:
update the fixture signature for te_model_checkpoint to accept tmp_path_factory,
create a module-level temporary directory via tmp_path_factory.mktemp(...),
perform AutoModelForMaskedLM.from_pretrained(...) and convert_esm_hf_to_te(...)
once, save_pretrained into that tmp dir, and return the path; keep the fixture
name te_model_checkpoint and calls to convert_esm_hf_to_te and
model_te.save_pretrained unchanged.
---
Nitpick comments:
In `@bionemo-recipes/models/esm2/export.py`:
- Around line 29-36: BENCHMARK_RESULTS is maintained separately from ESM_TAGS
which can cause a KeyError at runtime when the code does BENCHMARK_RESULTS[tag];
add a validation step (at module import or before use) that iterates over
ESM_TAGS and checks each tag exists in BENCHMARK_RESULTS, raising a clear error
or filling missing entries with a sensible default; update the check near the
code that accesses BENCHMARK_RESULTS[tag] (referenced in the review as the
access on line 103) and ensure the validation references the symbols ESM_TAGS
and BENCHMARK_RESULTS so missing tags are detected early with an informative
message.
In `@bionemo-recipes/models/esm2/requirements.txt`:
- Around line 1-11: The requirements file currently lists open-ended
dependencies which harms reproducibility; update
bionemo-recipes/models/esm2/requirements.txt to add conservative lower-bound
version pins for key packages (for example use transformers>=4.x, datasets>=2.x,
hydra-core>=1.x, jinja2>=3.x, omegaconf>=2.x, accelerate>=0.x, peft>=0.x) while
keeping the existing torchao!=0.14.0 exclusion; choose minimal compatible lower
bounds that reflect the versions tested in the README/NGC container and add them
to each package line so local dev installs are deterministic and compatible with
the container-provided stack.
In `@bionemo-recipes/models/esm2/tests/conftest.py`:
- Around line 29-30: Replace sys.path.append calls so the local esm2 directories
are prepended to module search path: change the two occurrences that call
sys.path.append with the corresponding sys.path.insert(0, ...) using the same
Path(__file__).parent.parent.as_posix() and Path(__file__).parent.as_posix()
expressions, ensuring the local directories take precedence during import
resolution.
In `@bionemo-recipes/models/esm2/tests/test_cp_bshd.py`:
- Around line 23-26: Avoid unconditionally mutating session-wide sys.path by
checking for the target path before inserting: compute the path
(Path(__file__).resolve().parent.parent.as_posix()) and only call
sys.path.insert(0, ...) if that string is not already present in sys.path;
replace the current unconditional sys.path.insert(0, Path(...).as_posix()) with
this idempotent guard so tests run under torchrun still get the path without
creating duplicate entries or shadowing modules.
In `@bionemo-recipes/models/esm2/tests/test_cp_dataloader.py`:
- Around line 33-35: Remove the redundant import sys inside the if __name__ ==
"__main__": block—since sys is already imported at module level (top of file),
delete the second import statement near the bottom of the file (in the main
block) so there are no duplicate imports; verify no conditional import-specific
behavior relies on that inner import.
In `@bionemo-recipes/models/esm2/tests/test_cp_thd.py`:
- Around line 23-26: The unconditional sys.path.insert call in test_cp_thd.py is
causing duplicate entries when multiple tests run; change it to compute the
directory via Path(__file__).resolve().parent.parent.as_posix() (same expression
already used) and only insert it if that string is not already present in
sys.path (i.e., check "if model_dir not in sys.path: sys.path.insert(0,
model_dir)"); mirror the idempotent guard used in the other test
(test_cp_bshd.py) to avoid duplicates while preserving the conftest fallback
behavior.
| del model_hf, model_te | ||
| gc.collect() | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
model_hf_masked_lm is not freed — memory leak for large models.
Line 113 deletes model_hf and model_te but model_hf_masked_lm (loaded on Line 71) is still alive. When iterating over 6 models including a 15B-parameter one, this leaks a full model's worth of GPU/CPU memory per iteration.
Proposed fix
- del model_hf, model_te
+ del model_hf, model_hf_masked_lm, model_te📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| del model_hf, model_te | |
| gc.collect() | |
| torch.cuda.empty_cache() | |
| del model_hf, model_hf_masked_lm, model_te | |
| gc.collect() | |
| torch.cuda.empty_cache() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/models/esm2/export.py` around lines 113 - 115, The cleanup
currently deletes model_hf and model_te but misses freeing model_hf_masked_lm,
causing a memory leak; update the teardown in export.py to explicitly delete
model_hf_masked_lm (e.g., call del model_hf_masked_lm) before gc.collect() and
torch.cuda.empty_cache() in the same block that deletes model_hf and model_te
(refer to the symbols model_hf_masked_lm, model_hf, model_te), ensuring any
other large objects tied to that masked LM are also released prior to collecting
and emptying the CUDA cache.
| model_te = AutoModelForMaskedLM.from_pretrained( | ||
| export_path / tag, | ||
| dtype=torch.bfloat16, | ||
| trust_remote_code=True, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
HuggingFace transformers AutoModelForMaskedLM from_pretrained torch_dtype dtype parameter
💡 Result:
In 🤗 Transformers (PyTorch), AutoModelForMaskedLM.from_pretrained(...) uses the keyword torch_dtype (not dtype) to control the floating-point dtype used when loading the model.
torch_dtype (PyTorch)
Supported values include:
- A real torch dtype, e.g.
torch.float16,torch.bfloat16,torch.float32(fp32 is the default if you don’t set anything). (huggingface.co) - The string
"auto", which tries (1)torch_dtypein the model’sconfig.json, else (2) infers dtype from the first floating-point weight in the checkpoint. (huggingface.co)
Example:
from transformers import AutoModelForMaskedLM
import torch
model = AutoModelForMaskedLM.from_pretrained(
"bert-base-uncased",
torch_dtype=torch.float16, # or torch.bfloat16, or "auto"
)What about a dtype= parameter?
In the Transformers docs, dtype is documented for JAX/Flax model loading/computation (not PyTorch). For PyTorch you should use torch_dtype. (huggingface.co)
Citations:
- 1: https://huggingface.co/docs/transformers/v4.31.0/main_classes/model?utm_source=openai
- 2: https://huggingface.co/docs/transformers/v4.31.0/main_classes/model?utm_source=openai
- 3: https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/model?utm_source=openai
- 4: https://huggingface.co/docs/transformers/v4.31.0/main_classes/model?utm_source=openai
🏁 Script executed:
find . -name "export.py" -path "*/esm2/*" | head -5Repository: NVIDIA/bionemo-framework
Length of output: 107
🏁 Script executed:
cat -n bionemo-recipes/models/esm2/export.py | sed -n '110,130p'Repository: NVIDIA/bionemo-framework
Length of output: 804
Change dtype to torch_dtype in the from_pretrained() call.
The HuggingFace transformers API uses torch_dtype, not dtype, for PyTorch models. Using dtype will be silently ignored and absorbed by **kwargs, causing the model to load in the default precision (fp32) instead of bfloat16, which increases memory usage during the smoke test.
Current code (lines 118-122):
model_te = AutoModelForMaskedLM.from_pretrained(
export_path / tag,
dtype=torch.bfloat16,
trust_remote_code=True,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/models/esm2/export.py` around lines 118 - 122, The
from_pretrained call for AutoModelForMaskedLM is passing dtype instead of the
correct HuggingFace kwarg torch_dtype, so bfloat16 is ignored; update the call
that constructs model_te (AutoModelForMaskedLM.from_pretrained) to pass
torch_dtype=torch.bfloat16 (keeping trust_remote_code=True and the same path) so
the model actually loads in bfloat16 during the smoke test.
As we're adding the ESM-C model, we're going to need to install the
esmpackage in the devcontainer for running esm-c model tests. This will conflict with ouresmpseudo-package I created for the original ESM-2 model folder.This PR just flattens the model into a single requirements.txt folder like llama3 and the other, newer models
Summary by CodeRabbit
Release Notes
New Features
Chores