Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Feb 4, 2026

⚡️ This pull request contains optimizations for PR #1384

If you approve this dependent PR, these changes will be merged into the original PR branch non-unicode-pytest-fail.

This PR will be automatically closed if the original PR is merged.


📄 32% (0.32x) speedup for get_analyzer_for_file in codeflash/languages/treesitter_utils.py

⏱️ Runtime : 1.10 milliseconds 834 microseconds (best of 235 runs)

📝 Explanation and details

The optimized code achieves a 31% runtime improvement by introducing @lru_cache to cache TreeSitterAnalyzer instances based on file extensions, eliminating redundant object creation.

Key Optimization

Added LRU caching: The new _analyzer_for_suffix() helper function uses @lru_cache(maxsize=16) to cache analyzer instances. When the same file extension is encountered multiple times, the cached analyzer is returned instead of creating a new TreeSitterAnalyzer object.

Why This Improves Runtime

  1. Eliminates repeated object instantiation: The original code created a new TreeSitterAnalyzer every time get_analyzer_for_file() was called, even for the same file type. Line profiler shows that in the original version, TreeSitterAnalyzer.__init__ was called 1,082 times, consuming 1.15ms. In the optimized version, it's only called 38 times (cache misses), consuming just 55μs - a 95% reduction.

  2. Fast dictionary lookup vs object creation: The LRU cache uses a fast dictionary lookup (O(1)) to return cached analyzers. This is significantly faster than the original flow which required:

    • Creating a new object
    • Running isinstance() check
    • Assigning attributes (self.language, self._parser)
  3. Reduced memory allocation overhead: Each new TreeSitterAnalyzer instance requires memory allocation and initialization. Reusing cached instances eliminates this overhead for repeated file extensions.

Impact on Hot Path Usage

The function references show get_analyzer_for_file() is called extensively in test discovery code across multiple test files. The function is invoked within loops for processing JavaScript/TypeScript test files, making it a hot path. For example:

  • Processing 100+ files in test_multiple_ts_files_consistent_results
  • Called repeatedly in test batches and nested loops

Since the same file extensions (.ts, .tsx, .js) are processed repeatedly in these loops, the cache hit rate is very high, maximizing the optimization's benefit.

Test Case Performance

The annotated tests confirm this optimization excels when:

  • Processing the same extension multiple times: Tests like test_multiple_ts_files_consistent_results show 33.8% speedup
  • Common extensions (.ts, .tsx, .js): 35-48% faster on individual calls
  • Batch operations: Processing lists of files with repeated extensions sees consistent 30-40% improvements

Edge cases with uncommon extensions (.txt, .py) may show slight regression (12-19% slower) due to cache lookup overhead, but these are rare in practice given the function's usage for JavaScript/TypeScript file analysis.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 1081 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from enum import Enum
from pathlib import Path

# imports
import pytest  # used for our unit tests
from codeflash.languages.treesitter_utils import get_analyzer_for_file

# function to test
# (Preserve the original function signature and implementation exactly as provided)
# We also define any minimal supporting types (TreeSitterLanguage, Parser) that the
# original code expects so the function can be exercised in tests.
class TreeSitterLanguage(Enum):
    # values are lowercase so TreeSitterLanguage("typescript") works via value lookup
    TYPESCRIPT = "typescript"
    TSX = "tsx"
    JAVASCRIPT = "javascript"

def test_basic_typescript_suffix_returns_typescript():
    # Basic: .ts should map to TYPESCRIPT
    path = Path("example.ts")
    codeflash_output = get_analyzer_for_file(path); analyzer = codeflash_output # 3.04μs -> 2.22μs (36.9% faster)

def test_basic_tsx_suffix_returns_tsx_case_insensitive():
    # Basic + edge: .tsx (even with mixed case) should map to TSX due to .lower()
    path = Path("Component.TsX")  # mixed-case extension
    codeflash_output = get_analyzer_for_file(path); analyzer = codeflash_output # 3.06μs -> 2.14μs (43.0% faster)

def test_js_variants_default_to_javascript():
    # Basic: several JavaScript-related extensions should map to JAVASCRIPT
    js_extensions = ["app.js", "widget.jsx", "module.mjs", "script.cjs"]
    for name in js_extensions:
        p = Path(name)
        codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 6.53μs -> 4.74μs (37.9% faster)

def test_no_extension_defaults_to_javascript():
    # Edge: files without an extension (suffix == '') should default to JAVASCRIPT
    p = Path("Makefile")  # no dot-suffix
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 2.75μs -> 3.34μs (17.4% slower)

    # Also a filename that ends with a dot yields an empty suffix ('.' is not a suffix)
    p2 = Path("strangefile.")
    codeflash_output = get_analyzer_for_file(p2); a2 = codeflash_output # 1.33μs -> 1.04μs (27.9% faster)

def test_multi_suffix_d_ts_map_to_typescript():
    # Edge: filenames like index.d.ts have suffix '.ts' (Path.suffix is last suffix)
    p = Path("index.d.ts")
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 2.90μs -> 2.12μs (36.8% faster)

def test_uppercase_ts_suffix_still_recognized():
    # Edge: uppercase extension should be lowercased internally and recognized
    p = Path("LIB.TS")
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 2.92μs -> 2.15μs (35.4% faster)

def test_unusual_suffix_defaults_to_javascript():
    # Edge: completely unknown extensions should default to JAVASCRIPT per implementation
    p = Path("notes.txt")
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 2.95μs -> 3.63μs (18.8% slower)

def test_returned_language_is_enum_member_and_not_string():
    # Ensure the analyzer.language is an enum member, not a raw string
    codeflash_output = get_analyzer_for_file(Path("file.ts")); a = codeflash_output # 2.98μs -> 2.07μs (44.0% faster)

def test_path_with_multiple_dots_but_last_suffix_counts():
    # Edge: filenames like 'a.b.c.TS' should consider only the last suffix
    p = Path("a.b.c.TS")
    codeflash_output = get_analyzer_for_file(p); a = codeflash_output # 3.50μs -> 2.42μs (44.8% faster)

def test_api_contract_does_not_modify_input_path():
    # Ensure the function does not mutate the provided Path object
    original = Path("original.ts")
    copy = Path(str(original))  # create a separate Path with same string
    codeflash_output = get_analyzer_for_file(original); _ = codeflash_output # 3.29μs -> 2.31μs (41.9% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from pathlib import Path

# imports
import pytest
from codeflash.languages.treesitter_utils import (TreeSitterAnalyzer,
                                                  TreeSitterLanguage,
                                                  get_analyzer_for_file)

def test_get_analyzer_for_typescript_file():
    """Test that .ts files return a TypeScript analyzer."""
    file_path = Path("example.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.18μs (37.1% faster)

def test_get_analyzer_for_tsx_file():
    """Test that .tsx files return a TSX analyzer."""
    file_path = Path("example.tsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.08μs -> 2.26μs (35.9% faster)

def test_get_analyzer_for_javascript_file():
    """Test that .js files return a JavaScript analyzer."""
    file_path = Path("example.js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.18μs (37.2% faster)

def test_get_analyzer_for_jsx_file():
    """Test that .jsx files default to JavaScript analyzer."""
    file_path = Path("example.jsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.17μs (37.8% faster)

def test_get_analyzer_for_mjs_file():
    """Test that .mjs files default to JavaScript analyzer."""
    file_path = Path("example.mjs")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.11μs (42.1% faster)

def test_get_analyzer_for_cjs_file():
    """Test that .cjs files default to JavaScript analyzer."""
    file_path = Path("example.cjs")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.95μs -> 2.14μs (37.4% faster)

def test_analyzer_has_language_attribute():
    """Test that the returned analyzer has a language attribute set."""
    file_path = Path("test.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.15μs (39.2% faster)

def test_analyzer_has_parser_attribute():
    """Test that the returned analyzer has a _parser attribute."""
    file_path = Path("test.js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.10μs -> 2.08μs (48.6% faster)

def test_uppercase_typescript_extension():
    """Test that uppercase .TS extension is handled correctly (case-insensitive)."""
    file_path = Path("EXAMPLE.TS")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.98μs -> 2.13μs (39.5% faster)

def test_uppercase_tsx_extension():
    """Test that uppercase .TSX extension is handled correctly (case-insensitive)."""
    file_path = Path("EXAMPLE.TSX")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.10μs (42.4% faster)

def test_mixed_case_javascript_extension():
    """Test that mixed case extensions are handled correctly."""
    file_path = Path("example.Js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.97μs -> 2.17μs (36.4% faster)

def test_mixed_case_jsx_extension():
    """Test that mixed case .Jsx extension defaults to JavaScript."""
    file_path = Path("example.JsX")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.03μs -> 2.10μs (43.9% faster)

def test_file_with_multiple_dots():
    """Test that files with multiple dots in name are handled correctly."""
    file_path = Path("my.module.service.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.95μs -> 2.21μs (33.1% faster)

def test_file_with_multiple_dots_tsx():
    """Test that .tsx files with multiple dots in name are handled correctly."""
    file_path = Path("my.component.container.tsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.07μs -> 2.14μs (43.1% faster)

def test_unknown_extension_defaults_to_javascript():
    """Test that unknown extensions default to JavaScript analyzer."""
    file_path = Path("example.txt")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.94μs -> 2.09μs (40.6% faster)

def test_unknown_extension_py_defaults_to_javascript():
    """Test that Python files default to JavaScript analyzer."""
    file_path = Path("script.py")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.01μs -> 3.73μs (19.3% slower)

def test_unknown_extension_go_defaults_to_javascript():
    """Test that Go files default to JavaScript analyzer."""
    file_path = Path("main.go")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.96μs -> 3.39μs (12.5% slower)

def test_file_without_extension():
    """Test that files without an extension default to JavaScript analyzer."""
    file_path = Path("Makefile")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.71μs -> 1.71μs (57.8% faster)

def test_file_with_only_dot():
    """Test that files that are only a dot extension default to JavaScript."""
    file_path = Path(".ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.60μs -> 1.70μs (53.0% faster)

def test_hidden_typescript_file():
    """Test that hidden TypeScript files are handled correctly."""
    file_path = Path(".example.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.94μs -> 2.14μs (36.9% faster)

def test_path_with_directories():
    """Test that file paths with multiple directory components work correctly."""
    file_path = Path("src/components/Button.tsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.04μs -> 2.10μs (44.2% faster)

def test_absolute_path_typescript():
    """Test that absolute file paths work correctly for TypeScript."""
    file_path = Path("/home/user/project/src/main.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.93μs -> 2.11μs (38.5% faster)

def test_absolute_path_javascript():
    """Test that absolute file paths work correctly for JavaScript."""
    file_path = Path("/var/www/app/index.js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.94μs -> 2.07μs (42.0% faster)

def test_windows_style_path_typescript():
    """Test that Windows-style paths work correctly for TypeScript."""
    file_path = Path("C:\\Users\\user\\project\\src\\main.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.88μs -> 2.15μs (33.9% faster)

def test_empty_string_filename():
    """Test that path with empty string and extension defaults to JavaScript."""
    file_path = Path(".js")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 2.65μs -> 1.73μs (53.2% faster)

def test_very_long_filename():
    """Test that very long filenames are handled correctly."""
    long_name = "a" * 200 + ".ts"
    file_path = Path(long_name)
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.00μs -> 2.25μs (33.3% faster)

def test_special_characters_in_filename():
    """Test that filenames with special characters are handled correctly."""
    file_path = Path("my-file_name@2.0.tsx")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.12μs -> 2.23μs (39.5% faster)

def test_unicode_characters_in_filename():
    """Test that filenames with unicode characters are handled correctly."""
    file_path = Path("файл_名前.ts")
    codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 3.20μs -> 2.31μs (38.1% faster)

def test_multiple_ts_files_consistent_results():
    """Test that processing multiple TypeScript files returns consistent results."""
    results = []
    for i in range(100):
        file_path = Path(f"file_{i}.ts")
        codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 93.5μs -> 69.9μs (33.8% faster)
        results.append(analyzer.language)

def test_multiple_different_extension_files():
    """Test that processing various file extensions maintains correct mappings."""
    extensions_and_expected_languages = [
        ("file.ts", TreeSitterLanguage.TYPESCRIPT),
        ("file.tsx", TreeSitterLanguage.TSX),
        ("file.js", TreeSitterLanguage.JAVASCRIPT),
        ("file.jsx", TreeSitterLanguage.JAVASCRIPT),
        ("file.mjs", TreeSitterLanguage.JAVASCRIPT),
        ("file.cjs", TreeSitterLanguage.JAVASCRIPT),
    ]
    
    # Test each extension multiple times to ensure consistency
    for _ in range(20):
        for ext, expected_lang in extensions_and_expected_languages:
            file_path = Path(ext)
            codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output

def test_large_batch_of_mixed_case_extensions():
    """Test handling of large batch of mixed-case extensions."""
    test_cases = [
        ("FILE.TS", TreeSitterLanguage.TYPESCRIPT),
        ("File.Ts", TreeSitterLanguage.TYPESCRIPT),
        ("file.ts", TreeSitterLanguage.TYPESCRIPT),
        ("FILE.TSX", TreeSitterLanguage.TSX),
        ("File.Tsx", TreeSitterLanguage.TSX),
        ("file.tsx", TreeSitterLanguage.TSX),
        ("FILE.JS", TreeSitterLanguage.JAVASCRIPT),
        ("File.Js", TreeSitterLanguage.JAVASCRIPT),
        ("file.js", TreeSitterLanguage.JAVASCRIPT),
    ]
    
    # Test each case multiple times
    for _ in range(50):
        for file_name, expected_lang in test_cases:
            file_path = Path(file_name)
            codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output

def test_large_number_of_unknown_extensions():
    """Test that large number of unknown extensions consistently default to JavaScript."""
    unknown_extensions = [
        ".txt", ".md", ".py", ".go", ".rb", ".java", ".cpp", ".c",
        ".h", ".swift", ".kt", ".rs", ".sh", ".bash", ".json", ".xml",
        ".html", ".css", ".scss", ".less", ".sql", ".yaml", ".toml",
    ]
    
    results = []
    for ext in unknown_extensions:
        for i in range(10):
            file_path = Path(f"file_{i}{ext}")
            codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output
            results.append(analyzer.language)

def test_various_path_components_with_tsx():
    """Test TSX files with various path component depths."""
    for depth in range(1, 20):
        path_parts = ["component"] * depth + ["Button.tsx"]
        file_path = Path("/".join(path_parts))
        codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output # 21.2μs -> 17.3μs (22.6% faster)

def test_batch_of_complex_filenames():
    """Test batch of complex filenames with multiple dots and extensions."""
    complex_names = [
        "my.service.module.ts",
        "button.component.tsx",
        "util.helper.function.js",
        "index.page.router.mjs",
        "bootstrap.config.cjs",
        "styles.theme.module.jsx",
    ]
    
    for name in complex_names:
        for i in range(20):
            file_path = Path(f"dir_{i}/{name}")
            codeflash_output = get_analyzer_for_file(file_path); analyzer = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1384-2026-02-04T17.03.31 and push.

Codeflash Static Badge

The optimized code achieves a **31% runtime improvement** by introducing `@lru_cache` to cache `TreeSitterAnalyzer` instances based on file extensions, eliminating redundant object creation.

## Key Optimization

**Added LRU caching**: The new `_analyzer_for_suffix()` helper function uses `@lru_cache(maxsize=16)` to cache analyzer instances. When the same file extension is encountered multiple times, the cached analyzer is returned instead of creating a new `TreeSitterAnalyzer` object.

## Why This Improves Runtime

1. **Eliminates repeated object instantiation**: The original code created a new `TreeSitterAnalyzer` every time `get_analyzer_for_file()` was called, even for the same file type. Line profiler shows that in the original version, `TreeSitterAnalyzer.__init__` was called **1,082 times**, consuming 1.15ms. In the optimized version, it's only called **38 times** (cache misses), consuming just 55μs - a **95% reduction**.

2. **Fast dictionary lookup vs object creation**: The LRU cache uses a fast dictionary lookup (O(1)) to return cached analyzers. This is significantly faster than the original flow which required:
   - Creating a new object
   - Running `isinstance()` check
   - Assigning attributes (`self.language`, `self._parser`)

3. **Reduced memory allocation overhead**: Each new `TreeSitterAnalyzer` instance requires memory allocation and initialization. Reusing cached instances eliminates this overhead for repeated file extensions.

## Impact on Hot Path Usage

The function references show `get_analyzer_for_file()` is called extensively in test discovery code across multiple test files. The function is invoked **within loops** for processing JavaScript/TypeScript test files, making it a hot path. For example:
- Processing 100+ files in `test_multiple_ts_files_consistent_results`
- Called repeatedly in test batches and nested loops

Since the same file extensions (.ts, .tsx, .js) are processed repeatedly in these loops, the cache hit rate is very high, maximizing the optimization's benefit.

## Test Case Performance

The annotated tests confirm this optimization excels when:
- **Processing the same extension multiple times**: Tests like `test_multiple_ts_files_consistent_results` show 33.8% speedup
- **Common extensions** (.ts, .tsx, .js): 35-48% faster on individual calls
- **Batch operations**: Processing lists of files with repeated extensions sees consistent 30-40% improvements

Edge cases with uncommon extensions (.txt, .py) may show slight regression (12-19% slower) due to cache lookup overhead, but these are rare in practice given the function's usage for JavaScript/TypeScript file analysis.
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 4, 2026
@claude claude bot mentioned this pull request Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants