Skip to content

Add AGENTS.md with project conventions for AI coding agents#15529

Open
RussellSpitzer wants to merge 4 commits intoapache:mainfrom
RussellSpitzer:AddAgentsMD
Open

Add AGENTS.md with project conventions for AI coding agents#15529
RussellSpitzer wants to merge 4 commits intoapache:mainfrom
RussellSpitzer:AddAgentsMD

Conversation

@RussellSpitzer
Copy link
Member

@RussellSpitzer RussellSpitzer commented Mar 8, 2026

Using my Anthropic Dollars Appropriately

I've been working recently using my own personal review skill to help me go over PR's and quickly get up to date on new requests before I take my own look at the PR and I realized this is probably useful to far more folks than just me. I decided to get ambitious and just analyze absolutely everything in the repo and try to get a good amalgam skill based on everything that's happened so far. Below is the result.

As noted below this was produced by Cursor and I used claude-4.6-opus-high for the work.

To avoid anyone else having to wait for all the comments to download I made a copy here for you to pull if you want to do your own analysis.

The raw review comment dataset is available as a public gist for reproducibility: https://gist.github.com/RussellSpitzer/8dddd1915d0c9fb9e027ab5fd5331c87

--- Bot Text Follows --

Summary

Adds an AGENTS.md file at the repository root following the AGENTS.md open standard — a convention adopted by 60,000+ repositories for providing AI coding agents with project-specific context. This file is automatically discovered by tools like Cursor, Copilot, Claude Code, Codex, and others.

The conventions were synthesized from 58,381 review comments across 4,309 merged PRs spanning the full history of the Apache Iceberg project. Comments were collected via the GitHub GraphQL API from every merged PR with 3+ reviews, filtered to PMC members and committers.

Creation Process

  1. Data collection: Fetched all review comments from merged PRs using gh api graphql with pagination across the full PR history.
  2. Topic clustering: Classified comments into ~17 topic buckets (API design, naming, testing, performance, serialization, REST/spec, error handling, configuration, code style, module boundaries, etc.) using keyword-based pattern matching.
  3. Sampling: Within each bucket, selected the most substantive comments (>80 chars, diverse file paths, de-duplicated).
  4. Rule synthesis: Extracted concrete, actionable conventions from each topic cluster, focusing on patterns that appeared repeatedly and consistently across different reviewers and time periods.
  5. Depersonalization: All rules are generic — no attribution to specific reviewers or behavioral profiles.

What's Covered

  • Architecture: Module boundaries, high-sensitivity areas
  • Design patterns: Refinement, CloseableIterable, null-over-Optional, builder, boolean-threw, Tasks, immutable metadata, etc.
  • Coding conventions: API design, naming, code style, placement, serialization, error handling, performance, configuration, testing, REST/OpenAPI

What's NOT included

  • No reviewer behavioral profiles or personal attributions // Russ note - It really wanted to make comments about what specific folks like to say and I thought that was inappropriate

Test plan

  • Verify file renders correctly on GitHub
  • Verify conventions are accurate against existing codebase patterns
  • Solicit feedback from PMC members on coverage and accuracy

Made with Cursor

Introduces an AGENTS.md file at the repository root following the
open standard for providing AI coding agents with project-specific
context. The conventions were synthesized from analysis of 58,000+
review comments across 4,300+ merged PRs spanning the project history.

Covers module boundaries, high-sensitivity areas, design patterns,
coding style, naming, serialization, error handling, performance,
testing, and REST/OpenAPI conventions.

Made-with: Cursor
Following the pattern established by apache/airflow's AGENTS.md,
add the three remaining recommended sections: executable build/test
commands, PR/commit conventions, and explicit safety boundaries.

Made-with: Cursor
@RussellSpitzer RussellSpitzer marked this pull request as ready for review March 8, 2026 02:13
Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add link to https://www.apache.org/legal/generative-tooling.html as the ASF is working on guidelines for the projects. There are several efforts at foundation level, so this page will probably evolve (and maybe define policies). So I think it's important to be aligned at project level.

Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the mention to ASF guidelines are already included in the corresponding Iceberg documentation.

@jbonofre
Copy link
Member

jbonofre commented Mar 8, 2026

I would suggest to add link to https://www.apache.org/legal/generative-tooling.html as the ASF is working on guidelines for the projects. There are several efforts at foundation level, so this page will probably evolve (and maybe define policies). So I think it's important to be aligned at project level.

Nevermind, already done in the corresponding Iceberg documentation.

AGENTS.md Outdated
- Compute expected values, don't hardcode. Tests belong in the module that owns the code.
- Write the most direct test for the bug. Parameterized tests for type variations.
- JUnit 5 + AssertJ: `@Test` (no `test` prefix), `assertThat`, `assertThatThrownBy`.
- `waitUntilAfter` for time-dependent tests. `boolean threw` for cleanup. Separate tests over combined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "boolean threw" for cleanup means?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern of

Do a whole lot of stuff

boolean threw = false
try {
something
} Catch {
  cleanup try stuff
}
if (threw) {
   cleanup stuff from outside try block
}

I don't think we actually do this that often as I mentioned to @nastra , I think it's just overrepresented in our comments. I think we should just remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

AGENTS.md Outdated
- **Builder pattern**: For complex creation. Never require passing `null` for optional parameters.
- **Package-private by default**: Only make things public with demonstrated need.
- **Postel's Law**: Accept case-insensitive input, produce canonical output.
- **`boolean threw`**: `boolean threw = true; try { ...; threw = false; } finally { if (threw) cleanup(); }`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a common pattern? We only use it in a very small set of cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably just have lots of comments about it in TableOperations modifications so it probably is talked about alot even if it doesn't happen alot. This is based on comments not actual code usage

@ajantha-bhat
Copy link
Member

ajantha-bhat commented Mar 9, 2026

I also have my own local CLAUDE.MD generated recently.

I think we can add some more rules like below to AGENTS.md.
But I think this file will keep on evolving. So, we can do this in follow up PRs too.

Build/Test:
  - ./gradlew build (full build with tests)                                                                                                                                                  
  - ./gradlew spotlessApply -DallModules (format all Spark/Flink versions)                                                                                                                   
  - Multi-version testing: -DsparkVersions=3.4,3.5,4.0,4.1, -DflinkVersions=1.20,2.0,2.1, -DscalaVersion=2.13                                                                                
  - ./gradlew integrationTest (requires Docker)                                                                                                                                              
  - Test logs location: build/testlogs/ per module                                                                                                                                           
                                                                                                                                                                                             
  Code Style (enforced by tooling):                                                                                                                                                          
  - Google Java Format 1.22.0 via Spotless; Scalafmt 3.9.7 for Scala                                                                                                                         
  - Logger field must be named LOG (not log) — error-prone LoggerEnclosingClass check                                                                                                        
  - No wildcard imports (individual imports only); specific allowed static wildcards (Preconditions, Collections, Collectors, AssertJ, Spark functions)
  - Banned packages: repackaged/shaded Guava, sun.*, deprecated commons libs                                                                                                                 
  - Preconditions.checkNotNull() must include a message parameter                                                                                                                            
  - No C-style array declarations (int arr[] → int[] arr)                                                                                                                                    
  - One statement per line, no more than two consecutive blank lines                                                                                                                         
  - @Override required on all overriding methods                                                                                                                                             
  - Inner classes that can be static must be static (ClassCanBeStatic error-prone check)                                                                                                     
  - LF only line endings (no CRLF)                                                                                                                                                           
                                                                                                                                                                                             
  Architecture:                                                                                                                                                                              
  - Explicit module dependency hierarchy (API → Core → Format → Catalog → Cloud → Engine)                                                                                                    
  - iceberg-common and iceberg-data modules not mentioned                                                                                                                                    
  - Catalog abstraction trio: Catalog (tables), ViewCatalog (views), SupportsNamespaces (namespaces)
  - REST client chain: RESTCatalog → RESTSessionCatalog → HTTP client                                                                                                                        
  - FileIO/InputFile/OutputFile abstraction explained                                                                                                                                        
  - Expression system uses visitor pattern                                                                                                                                                   
                                                                                                                                                                                             
  Binary Compatibility:                                                                                                                                                                      
  - RevAPI checks core API modules (api, core, parquet, orc, common, data) against baseline in .palantir/revapi.yml                                                                          
                                                                                                                                                                                             
  Config Locations:
  - .baseline/checkstyle/checkstyle.xml and suppressions                                                                                                                                     
  - baseline.gradle for Spotless/error-prone config                                                                                                                                          
  - .baseline/copyright/copyright-header-java.txt for license template
  - project/scalastyle_config.xml for Scala                                                                                                                                                  
                                                                                                                                                                                             
  Naming:                                                                                                                                                                                    
  - Test classes must use Test prefix (TestNewFeature, NOT NewFeatureTest)                                                                                                                   
  - Type variables: single capital letter or CamelCase ending in T                                                                                                                           
  - Constants: UPPER_SNAKE_CASE           

Copy link

@zheliu2 zheliu2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many people are using Claude today as well as Cursor, CLAUDE.md is a special filename that Claude Code automatically loads into context at the start of every conversation. Should we also create a file CLAUDE.md and add a one-line in CLAUDE.md that says "Read AGENTS.md first"

@RussellSpitzer
Copy link
Member Author

I think many people are using Claude today as well as Cursor, CLAUDE.md is a special filename that Claude Code automatically loads into context at the start of every conversation. Should we also create a file CLAUDE.md and add a one-line in CLAUDE.md that says "Read AGENTS.md first"

I don't think we can just use a vendor specific file name in this project. It's probably better to pressure Claude Code to support AGENTS or some more neutral standard. For individual users I think you can always make your own claude.md file and add it to a global .gitignore or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants