⚡️ Speed up function `get_analyzer_for_file` by 32% in PR #1384 (`non-unicode-pytest-fail`) #1385

codeflash-ai · 2026-02-04T17:03:37Z

⚡️ This pull request contains optimizations for PR #1384

If you approve this dependent PR, these changes will be merged into the original PR branch non-unicode-pytest-fail.

This PR will be automatically closed if the original PR is merged.

📄 32% (0.32x) speedup for `get_analyzer_for_file` in `codeflash/languages/treesitter_utils.py`

⏱️ Runtime : 1.10 milliseconds → 834 microseconds (best of 235 runs)

📝 Explanation and details

The optimized code achieves a 31% runtime improvement by introducing @lru_cache to cache TreeSitterAnalyzer instances based on file extensions, eliminating redundant object creation.

Key Optimization

Added LRU caching: The new _analyzer_for_suffix() helper function uses @lru_cache(maxsize=16) to cache analyzer instances. When the same file extension is encountered multiple times, the cached analyzer is returned instead of creating a new TreeSitterAnalyzer object.

Why This Improves Runtime

Eliminates repeated object instantiation: The original code created a new TreeSitterAnalyzer every time get_analyzer_for_file() was called, even for the same file type. Line profiler shows that in the original version, TreeSitterAnalyzer.__init__ was called 1,082 times, consuming 1.15ms. In the optimized version, it's only called 38 times (cache misses), consuming just 55μs - a 95% reduction.
Fast dictionary lookup vs object creation: The LRU cache uses a fast dictionary lookup (O(1)) to return cached analyzers. This is significantly faster than the original flow which required:
- Creating a new object
- Running isinstance() check
- Assigning attributes (self.language, self._parser)
Reduced memory allocation overhead: Each new TreeSitterAnalyzer instance requires memory allocation and initialization. Reusing cached instances eliminates this overhead for repeated file extensions.

Impact on Hot Path Usage

The function references show get_analyzer_for_file() is called extensively in test discovery code across multiple test files. The function is invoked within loops for processing JavaScript/TypeScript test files, making it a hot path. For example:

Processing 100+ files in test_multiple_ts_files_consistent_results
Called repeatedly in test batches and nested loops

Since the same file extensions (.ts, .tsx, .js) are processed repeatedly in these loops, the cache hit rate is very high, maximizing the optimization's benefit.

Test Case Performance

The annotated tests confirm this optimization excels when:

Processing the same extension multiple times: Tests like test_multiple_ts_files_consistent_results show 33.8% speedup
Common extensions (.ts, .tsx, .js): 35-48% faster on individual calls
Batch operations: Processing lists of files with repeated extensions sees consistent 30-40% improvements

Edge cases with uncommon extensions (.txt, .py) may show slight regression (12-19% slower) due to cache lookup overhead, but these are rare in practice given the function's usage for JavaScript/TypeScript file analysis.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 1081 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Click to see Generated Regression Tests

from enum import Enum
from pathlib import Path

# imports
import pytest  # used for our unit tests
from codeflash.languages.treesitter_utils import get_analyzer_for_file

# function to test
# (Preserve the original function signature and implementation exactly as provided)
# We also define any minimal supporting types (TreeSitterLanguage, Parser) that the
# original code expects so the function can be exercised in tests.
class TreeSitterLanguage(Enum):
    # values are lowercase so TreeSitterLanguage("typescript") works via value lookup
    TYPESCRIPT = "typescript"
    TSX = "tsx"
    JAVASCRIPT = "javascript"

def test_basic_typescript_suffix_returns_typescript():
    # Basic: .ts should map to TYPESCRIPT
    path = Path("example.ts")
    codeflash_output = get_analyzer_for_file(path); analyzer = codeflash_output # 3.04μs -> 2.22μs (36.9% faster)

def test_basic_tsx_suffix_returns_tsx_case_insensitive():
    # Basic + edge: .tsx (even with mixed case) should map to TSX due to .lower()
    path = Path("Component.TsX")  # mixed-case extension
    codeflash_output = get_analyzer_for_file(path); analyzer = codeflash_output # 3.06μs -> 2.14μs (43.0% faster)

def test_js_variants_default_to_javascript():
    # Basic: several JavaScript-related extensions should map to JAVASCRIPT
    js_extensions = ["app.js", "widget.jsx", "module.mjs", "script.cjs"]
    for name in js_extensions:
        p = Path(name)
        codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 6.53μs -> 4.74μs (37.9% faster)

def test_no_extension_defaults_to_javascript():
    # Edge: files without an extension (suffix == '') should default to JAVASCRIPT
    p = Path("Makefile")  # no dot-suffix
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 2.75μs -> 3.34μs (17.4% slower)

    # Also a filename that ends with a dot yields an empty suffix ('.' is not a suffix)
    p2 = Path("strangefile.")
    codeflash_output = get_analyzer_for_file(p2); a2 = codeflash_output # 1.33μs -> 1.04μs (27.9% faster)

def test_multi_suffix_d_ts_map_to_typescript():
    # Edge: filenames like index.d.ts have suffix '.ts' (Path.suffix is last suffix)
    p = Path("index.d.ts")
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 2.90μs -> 2.12μs (36.8% faster)

def test_uppercase_ts_suffix_still_recognized():
    # Edge: uppercase extension should be lowercased internally and recognized
    p = Path("LIB.TS")
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 2.92μs -> 2.15μs (35.4% faster)

def test_unusual_suffix_defaults_to_javascript():
    # Edge: completely unknown extensions should default to JAVASCRIPT per implementation
    p = Path("notes.txt")
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 2.95μs -> 3.63μs (18.8% slower)

def test_returned_language_is_enum_member_and_not_string():
    # Ensure the analyzer.language is an enum member, not a raw string
    codeflash_output = get_analyzer_for_file(Path("file.ts")); a = codeflash_output # 2.98μs -> 2.07μs (44.0% faster)

def test_path_with_multiple_dots_but_last_suffix_counts():
    # Edge: filenames like 'a.b.c.TS' should consider only the last suffix
    p = Path("a.b.c.TS")
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 3.50μs -> 2.42μs (44.8% faster)

def test_api_contract_does_not_modify_input_path():
    # Ensure the function does not mutate the provided Path object
    original = Path("original.ts")
    copy = Path(str(original))  # create a separate Path with same string
    codeflash_output = get_analyzer_for_file(original); _ = codeflash_output # 3.29μs -> 2.31μs (41.9% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from pathlib import Path

# imports
import pytest
from codeflash.languages.treesitter_utils import (TreeSitterAnalyzer,
                                                  TreeSitterLanguage,
                                                  get_analyzer_for_file)

def test_get_analyzer_for_typescript_file():
    """Test that .ts files return a TypeScript analyzer."""
    file_path = Path("example.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.18μs (37.1% faster)

def test_get_analyzer_for_tsx_file():
    """Test that .tsx files return a TSX analyzer."""
    file_path = Path("example.tsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.08μs -> 2.26μs (35.9% faster)

def test_get_analyzer_for_javascript_file():
    """Test that .js files return a JavaScript analyzer."""
    file_path = Path("example.js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.18μs (37.2% faster)

def test_get_analyzer_for_jsx_file():
    """Test that .jsx files default to JavaScript analyzer."""
    file_path = Path("example.jsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.17μs (37.8% faster)

def test_get_analyzer_for_mjs_file():
    """Test that .mjs files default to JavaScript analyzer."""
    file_path = Path("example.mjs")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.11μs (42.1% faster)

def test_get_analyzer_for_cjs_file():
    """Test that .cjs files default to JavaScript analyzer."""
    file_path = Path("example.cjs")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.95μs -> 2.14μs (37.4% faster)

def test_analyzer_has_language_attribute():
    """Test that the returned analyzer has a language attribute set."""
    file_path = Path("test.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.15μs (39.2% faster)

def test_analyzer_has_parser_attribute():
    """Test that the returned analyzer has a _parser attribute."""
    file_path = Path("test.js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.10μs -> 2.08μs (48.6% faster)

def test_uppercase_typescript_extension():
    """Test that uppercase .TS extension is handled correctly (case-insensitive)."""
    file_path = Path("EXAMPLE.TS")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.98μs -> 2.13μs (39.5% faster)

def test_uppercase_tsx_extension():
    """Test that uppercase .TSX extension is handled correctly (case-insensitive)."""
    file_path = Path("EXAMPLE.TSX")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.10μs (42.4% faster)

def test_mixed_case_javascript_extension():
    """Test that mixed case extensions are handled correctly."""
    file_path = Path("example.Js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.97μs -> 2.17μs (36.4% faster)

def test_mixed_case_jsx_extension():
    """Test that mixed case .Jsx extension defaults to JavaScript."""
    file_path = Path("example.JsX")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.03μs -> 2.10μs (43.9% faster)

def test_file_with_multiple_dots():
    """Test that files with multiple dots in name are handled correctly."""
    file_path = Path("my.module.service.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.95μs -> 2.21μs (33.1% faster)

def test_file_with_multiple_dots_tsx():
    """Test that .tsx files with multiple dots in name are handled correctly."""
    file_path = Path("my.component.container.tsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.07μs -> 2.14μs (43.1% faster)

def test_unknown_extension_defaults_to_javascript():
    """Test that unknown extensions default to JavaScript analyzer."""
    file_path = Path("example.txt")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.94μs -> 2.09μs (40.6% faster)

def test_unknown_extension_py_defaults_to_javascript():
    """Test that Python files default to JavaScript analyzer."""
    file_path = Path("script.py")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.01μs -> 3.73μs (19.3% slower)

def test_unknown_extension_go_defaults_to_javascript():
    """Test that Go files default to JavaScript analyzer."""
    file_path = Path("main.go")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.96μs -> 3.39μs (12.5% slower)

def test_file_without_extension():
    """Test that files without an extension default to JavaScript analyzer."""
    file_path = Path("Makefile")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.71μs -> 1.71μs (57.8% faster)

def test_file_with_only_dot():
    """Test that files that are only a dot extension default to JavaScript."""
    file_path = Path(".ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.60μs -> 1.70μs (53.0% faster)

def test_hidden_typescript_file():
    """Test that hidden TypeScript files are handled correctly."""
    file_path = Path(".example.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.94μs -> 2.14μs (36.9% faster)

def test_path_with_directories():
    """Test that file paths with multiple directory components work correctly."""
    file_path = Path("src/components/Button.tsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.04μs -> 2.10μs (44.2% faster)

def test_absolute_path_typescript():
    """Test that absolute file paths work correctly for TypeScript."""
    file_path = Path("/home/user/project/src/main.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.93μs -> 2.11μs (38.5% faster)

def test_absolute_path_javascript():
    """Test that absolute file paths work correctly for JavaScript."""
    file_path = Path("/var/www/app/index.js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.94μs -> 2.07μs (42.0% faster)

def test_windows_style_path_typescript():
    """Test that Windows-style paths work correctly for TypeScript."""
    file_path = Path("C:\\Users\\user\\project\\src\\main.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.88μs -> 2.15μs (33.9% faster)

def test_empty_string_filename():
    """Test that path with empty string and extension defaults to JavaScript."""
    file_path = Path(".js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.65μs -> 1.73μs (53.2% faster)

def test_very_long_filename():
    """Test that very long filenames are handled correctly."""
    long_name = "a" * 200 + ".ts"
    file_path = Path(long_name)
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.25μs (33.3% faster)

def test_special_characters_in_filename():
    """Test that filenames with special characters are handled correctly."""
    file_path = Path("my-file_name@2.0.tsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.12μs -> 2.23μs (39.5% faster)

def test_unicode_characters_in_filename():
    """Test that filenames with unicode characters are handled correctly."""
    file_path = Path("файл_名前.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.20μs -> 2.31μs (38.1% faster)

def test_multiple_ts_files_consistent_results():
    """Test that processing multiple TypeScript files returns consistent results."""
    results = []
    for i in range(100):
        file_path = Path(f"file_{i}.ts")
        codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 93.5μs -> 69.9μs (33.8% faster)
        results.append(analyzer.language)

def test_multiple_different_extension_files():
    """Test that processing various file extensions maintains correct mappings."""
    extensions_and_expected_languages = [
        ("file.ts", TreeSitterLanguage.TYPESCRIPT),
        ("file.tsx", TreeSitterLanguage.TSX),
        ("file.js", TreeSitterLanguage.JAVASCRIPT),
        ("file.jsx", TreeSitterLanguage.JAVASCRIPT),
        ("file.mjs", TreeSitterLanguage.JAVASCRIPT),
        ("file.cjs", TreeSitterLanguage.JAVASCRIPT),
    ]
    
    # Test each extension multiple times to ensure consistency
    for _ in range(20):
        for ext, expected_lang in extensions_and_expected_languages:
            file_path = Path(ext)
            codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output

def test_large_batch_of_mixed_case_extensions():
    """Test handling of large batch of mixed-case extensions."""
    test_cases = [
        ("FILE.TS", TreeSitterLanguage.TYPESCRIPT),
        ("File.Ts", TreeSitterLanguage.TYPESCRIPT),
        ("file.ts", TreeSitterLanguage.TYPESCRIPT),
        ("FILE.TSX", TreeSitterLanguage.TSX),
        ("File.Tsx", TreeSitterLanguage.TSX),
        ("file.tsx", TreeSitterLanguage.TSX),
        ("FILE.JS", TreeSitterLanguage.JAVASCRIPT),
        ("File.Js", TreeSitterLanguage.JAVASCRIPT),
        ("file.js", TreeSitterLanguage.JAVASCRIPT),
    ]
    
    # Test each case multiple times
    for _ in range(50):
        for file_name, expected_lang in test_cases:
            file_path = Path(file_name)
            codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output

def test_large_number_of_unknown_extensions():
    """Test that large number of unknown extensions consistently default to JavaScript."""
    unknown_extensions = [
        ".txt", ".md", ".py", ".go", ".rb", ".java", ".cpp", ".c",
        ".h", ".swift", ".kt", ".rs", ".sh", ".bash", ".json", ".xml",
        ".html", ".css", ".scss", ".less", ".sql", ".yaml", ".toml",
    ]
    
    results = []
    for ext in unknown_extensions:
        for i in range(10):
            file_path = Path(f"file_{i}{ext}")
            codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output
            results.append(analyzer.language)

def test_various_path_components_with_tsx():
    """Test TSX files with various path component depths."""
    for depth in range(1, 20):
        path_parts = ["component"] * depth + ["Button.tsx"]
        file_path = Path("/".join(path_parts))
        codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 21.2μs -> 17.3μs (22.6% faster)

def test_batch_of_complex_filenames():
    """Test batch of complex filenames with multiple dots and extensions."""
    complex_names = [
        "my.service.module.ts",
        "button.component.tsx",
        "util.helper.function.js",
        "index.page.router.mjs",
        "bootstrap.config.cjs",
        "styles.theme.module.jsx",
    ]
    
    for name in complex_names:
        for i in range(20):
            file_path = Path(f"dir_{i}/{name}")
            codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1384-2026-02-04T17.03.31 and push.

The optimized code achieves a **31% runtime improvement** by introducing `@lru_cache` to cache `TreeSitterAnalyzer` instances based on file extensions, eliminating redundant object creation. ## Key Optimization **Added LRU caching**: The new `_analyzer_for_suffix()` helper function uses `@lru_cache(maxsize=16)` to cache analyzer instances. When the same file extension is encountered multiple times, the cached analyzer is returned instead of creating a new `TreeSitterAnalyzer` object. ## Why This Improves Runtime 1. **Eliminates repeated object instantiation**: The original code created a new `TreeSitterAnalyzer` every time `get_analyzer_for_file()` was called, even for the same file type. Line profiler shows that in the original version, `TreeSitterAnalyzer.__init__` was called **1,082 times**, consuming 1.15ms. In the optimized version, it's only called **38 times** (cache misses), consuming just 55μs - a **95% reduction**. 2. **Fast dictionary lookup vs object creation**: The LRU cache uses a fast dictionary lookup (O(1)) to return cached analyzers. This is significantly faster than the original flow which required: - Creating a new object - Running `isinstance()` check - Assigning attributes (`self.language`, `self._parser`) 3. **Reduced memory allocation overhead**: Each new `TreeSitterAnalyzer` instance requires memory allocation and initialization. Reusing cached instances eliminates this overhead for repeated file extensions. ## Impact on Hot Path Usage The function references show `get_analyzer_for_file()` is called extensively in test discovery code across multiple test files. The function is invoked **within loops** for processing JavaScript/TypeScript test files, making it a hot path. For example: - Processing 100+ files in `test_multiple_ts_files_consistent_results` - Called repeatedly in test batches and nested loops Since the same file extensions (.ts, .tsx, .js) are processed repeatedly in these loops, the cache hit rate is very high, maximizing the optimization's benefit. ## Test Case Performance The annotated tests confirm this optimization excels when: - **Processing the same extension multiple times**: Tests like `test_multiple_ts_files_consistent_results` show 33.8% speedup - **Common extensions** (.ts, .tsx, .js): 35-48% faster on individual calls - **Batch operations**: Processing lists of files with repeated extensions sees consistent 30-40% improvements Edge cases with uncommon extensions (.txt, .py) may show slight regression (12-19% slower) due to cache lookup overhead, but these are rare in practice given the function's usage for JavaScript/TypeScript file analysis.

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 4, 2026

codeflash-ai bot mentioned this pull request Feb 4, 2026

Prevent pytest fail with non unicode chars #1384

Open

claude bot mentioned this pull request Feb 4, 2026

Multiple fixes around vitest #1386

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `get_analyzer_for_file` by 32% in PR #1384 (`non-unicode-pytest-fail`) #1385

⚡️ Speed up function `get_analyzer_for_file` by 32% in PR #1384 (`non-unicode-pytest-fail`) #1385

Uh oh!

codeflash-ai bot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up function get_analyzer_for_file by 32% in PR #1384 (non-unicode-pytest-fail) #1385

Are you sure you want to change the base?

⚡️ Speed up function get_analyzer_for_file by 32% in PR #1384 (non-unicode-pytest-fail) #1385

Uh oh!

Conversation

codeflash-ai bot commented Feb 4, 2026

⚡️ This pull request contains optimizations for PR #1384

📄 32% (0.32x) speedup for get_analyzer_for_file in codeflash/languages/treesitter_utils.py

📝 Explanation and details

Key Optimization

Why This Improves Runtime

Impact on Hot Path Usage

Test Case Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up function `get_analyzer_for_file` by 32% in PR #1384 (`non-unicode-pytest-fail`) #1385

⚡️ Speed up function `get_analyzer_for_file` by 32% in PR #1384 (`non-unicode-pytest-fail`) #1385

📄 32% (0.32x) speedup for `get_analyzer_for_file` in `codeflash/languages/treesitter_utils.py`