-
Notifications
You must be signed in to change notification settings - Fork 9
Feature Request: Add getCodelistFromConceptSet() function for direct database query of concept sets #248
Description
Currently, CodelistGenerator provides codesFromConceptSet() and codesFromCohort() functions that extract codelists from JSON files containing concept set expressions or cohort definitions.
We propose to expand the management of the concept_sets to the database. In some OMOP CDM setups (such as those managed by IOMED), concept_sets are stored directly in database tables (concept_set, concept_set_item, etc.) within the same database instance as the analysis data.
This feature request proposes adding a new function, getCodelistFromConceptSet(), that queries these database tables directly to build formal codelist objects, similar to how other functions in the package query vocabulary tables directly (e.g., getDrugIngredientCodes(), getICD10StandardCodes()).
Rationale
• Cleaner workflow: Eliminates the need to export/import JSON files when concept sets are already stored natively in the database.
• Consistency: Aligns with the package's philosophy of direct database queries for vocabulary-based codelists.
• Tested workflow: At IOMED, we maintain concept sets in dedicated database tables within the OMOP instance, allowing for streamlined querying without intermediate file handling.
• Efficiency: Reduces overhead of JSON parsing and file I/O when database access is already available.
Proposed Database Schema
The function would work with the OMOP CDM tables and a small extension:
erDiagram
concept_set ||--o{ concept_set_item : "has items"
concept ||--o{ concept_set_item : "is included in"
concept_set {
int concept_set_id PK
text concept_set_name
}
concept {
int concept_id PK
varchar concept_name
varchar domain_id
varchar vocabulary_id
varchar concept_class_id
varchar standard_concept
varchar concept_code
date valid_start_date
date valid_end_date
varchar invalid_reason
}
concept_set_item {
int concept_set_id PK,FK
int concept_id PK,FK
}
concept_class ||--o{ concept : "classifies"
domain ||--o{ concept : "belongs to"
vocabulary ||--o{ concept : "from"
Proposed Function Signature and Implementation
See OmopHelpers for the full implementation.
getCodelistFromConceptSet <- function(conceptSetId, con, cdmSchema) {
# Point to the required tables in the database
concept_set_tbl <- dplyr::tbl(con, dbplyr::in_schema(cdmSchema, "concept_set"))
concept_set_item_tbl <- dplyr::tbl(con, dbplyr::in_schema(cdmSchema, "concept_set_item"))
# Retrieve the name of the concept set to use as the codelist name
codelistName <- concept_set_tbl |>
dplyr::filter(.data$concept_set_id == conceptSetId) |>
dplyr::pull("concept_set_name") |>
unique()
# Error handling: check if the concept set ID was found
if (length(codelistName) == 0) {
stop(glue::glue("No concept set found for concept_set_id: {conceptSetId}"))
}
# Warning if multiple names exist for the same ID
if (length(codelistName) > 1) {
warning(glue::glue("Multiple names found for concept_set_id: {conceptSetId}. Using the first one: '{codelistName[1]}'"))
codelistName <- codelistName[1]
}
codelistName <- clean_name(codelistName)
# Retrieve all unique concept IDs associated with the concept set ID
concept_ids <- concept_set_item_tbl |>
dplyr::filter(.data$concept_set_id == conceptSetId) |>
dplyr::pull("concept_id") |>
unique()
# Create a named list structure required by newCodelist
codelist <- list(concept_ids) |>
magrittr::set_names(codelistName)
# Return the formal, validated codelist object
return(omopgenerics::newCodelist(codelist))
}Implementation Details
The function would:
- Query concept_set table: Retrieve the concept_set_name for the given conceptSetId to use as the codelist name.
- Query concept_set_item table: Get all associated concept_ids for the concept set.
- Name cleaning: Apply name standardization (e.g., via a clean_name() helper function).
- Codelist creation: Build a named list and return an omopgenerics::newCodelist object.
- Error handling: Validate that the concept set exists and handle edge cases like multiple names.
Dependencies
• Requires omopgenerics package for newCodelist()
• Uses dplyr for database operations
• Assumes clean_name() helper function (could be added or use existing package utilities)
Related Functions
• codesFromConceptSet(): Current JSON-based approach
• getDrugIngredientCodes(): Similar direct database querying pattern
• getICD10StandardCodes(): Another vocabulary table query function
Testing Considerations
• Unit tests with mock database containing concept_set tables
• Integration tests with real OMOP CDM databases
• Edge case testing (missing concept sets, empty results, etc.)