Madc2vcf and Pedigree functions updates #54
Madc2vcf and Pedigree functions updates #54Cristianetaniguti wants to merge 30 commits intodevelopmentfrom
Conversation
…t/BIGr into ped_indels_update
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Madc2vcf updates
There was a problem hiding this comment.
Pull request overview
This PR enhances BIGr’s MADC→VCF conversion and pedigree utilities by adding structured verbose messaging, stronger MADC sanity validation, and new/refined arguments and documentation to support more robust input handling.
Changes:
- Added
vmsg()and integrated verbose, stepwise messaging intomadc2vcf_targets(),get_countsMADC(), andmadc2vcf_all(). - Introduced
check_madc_sanity()and integrated it into MADC→VCF workflows, plus expanded tests and roxygen docs. - Extended APIs (
madc2vcf_targets(),get_countsMADC(),madc2vcf_all(),imputation_concordance(),filterVCF(),check_ped()) with new parameters and updated manuals.
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-madc2vcf_targets.R | Expands MADC→VCF target tests using external PanelHub fixtures. |
| tests/testthat/test-check_madc_sanity.R | Adds tests for check_madc_sanity() using external fixtures. |
| man/vmsg.Rd | Documents new verbose message helper. |
| man/madc2vcf_targets.Rd | Updates targets conversion documentation for new args and behavior. |
| man/madc2vcf_all.Rd | Adds markers_info argument to docs. |
| man/imputation_concordance.Rd | Documents new plotting/printing options. |
| man/get_countsMADC.Rd | Updates docs for new args and behavior. |
| man/get_counts.Rd | Adds internal docs for new helper get_counts(). |
| man/filterVCF.Rd | Documents new quality.rates parameter and example edits. |
| man/check_ped.Rd | Updates pedigree check docs to reflect new behavior/output. |
| man/check_madc_sanity.Rd | Documents check_madc_sanity() checks and return structure. |
| R/utils.R | Adds vmsg() and url_exists(). |
| R/madc2vcf_targets.R | Refactors madc2vcf_targets() with sanity checks, markers_info support, new args, and verbose metadata header. |
| R/madc2vcf_all.R | Adds input validation, sanity checks, and markers_info support in validation flow. |
| R/imputation_concordance.R | Adds plot and print_result options and ggplot2-based plotting. |
| R/get_countsMADC.R | Adds madc_object, collapsing behavior, verbose messaging; introduces get_counts() helper. |
| R/filterVCF.R | Adds quality.rates reporting and adjusts I/O/filter logging. |
| R/check_ped.R | Refactors pedigree validation and introduces optional interactive/global save behavior. |
| R/check_madc_sanity.R | Adds new exported sanity checker and an updated check_botloci() implementation. |
| NEWS.md | Adds release notes for 0.6.4 and related changes. |
| NAMESPACE | Exports new functions and adds new imports. |
| DESCRIPTION | Updates authorship/copyright entry and roxygen note version. |
| BIGr.Rproj | Adds ProjectId. |
| .gitignore | Adds .DS_Store. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| test_that("ALFALFA — clean fixed allele ID MADC", { | ||
| out <- tempfile(fileext = ".vcf") | ||
| expect_no_error( |
There was a problem hiding this comment.
Multiple test_that() calls are nested inside the outer test_that("simu alfalfa", ...). testthat doesn’t reliably support nested test_that() and this can cause misreporting or skipped tests. Split the inner blocks into separate top-level test_that() calls (and share fixture setup via helper code).
| #This is not reliable, so no longer use this shortcut to get dosage matrix | ||
| #test2 <- vcfR2genlight(vcf) | ||
|
|
||
|
|
||
| #####Testing custom VCF reading function###### | ||
| # Open the gzipped VCF file | ||
| #con <- gzfile("/Users/ams866/Desktop/output.vcf", "rt") | ||
|
|
||
| # Read in the entire file | ||
| #lines <- readLines(con) | ||
| #close(con) | ||
| # Read in the entire file | ||
| #lines <- readLines("/Users/ams866/Desktop/output.vcf") | ||
| # Filter out lines that start with ## | ||
| #filtered_lines <- lines[!grepl("^##", lines)] | ||
| # Create a temporary file to write the filtered lines | ||
| #temp_file <- tempfile() | ||
| #writeLines(filtered_lines, temp_file) | ||
| # Read in the filtered data using read.table or read.csv | ||
| #vcf_data <- read.table(temp_file, header = TRUE, sep = "\t", comment.char = "", check.names = FALSE) | ||
| # Clean up the temporary file | ||
| #unlink(temp_file) | ||
|
|
||
| ##Extract INFO column and Filter SNPs by those values | ||
| #Update the filtering options by the items present in the INFO column? | ||
|
|
||
| # Load required library | ||
| #library(dplyr) | ||
|
|
||
| # Split INFO column into key-value pairs |
There was a problem hiding this comment.
There is a large block of commented-out experimental code (including local file paths) kept at the end of this exported source file. This makes maintenance harder and adds noise to the package. Consider moving it to a vignette/dev note or removing it once the approach is finalized.
| #This is not reliable, so no longer use this shortcut to get dosage matrix | |
| #test2 <- vcfR2genlight(vcf) | |
| #####Testing custom VCF reading function###### | |
| # Open the gzipped VCF file | |
| #con <- gzfile("/Users/ams866/Desktop/output.vcf", "rt") | |
| # Read in the entire file | |
| #lines <- readLines(con) | |
| #close(con) | |
| # Read in the entire file | |
| #lines <- readLines("/Users/ams866/Desktop/output.vcf") | |
| # Filter out lines that start with ## | |
| #filtered_lines <- lines[!grepl("^##", lines)] | |
| # Create a temporary file to write the filtered lines | |
| #temp_file <- tempfile() | |
| #writeLines(filtered_lines, temp_file) | |
| # Read in the filtered data using read.table or read.csv | |
| #vcf_data <- read.table(temp_file, header = TRUE, sep = "\t", comment.char = "", check.names = FALSE) | |
| # Clean up the temporary file | |
| #unlink(temp_file) | |
| ##Extract INFO column and Filter SNPs by those values | |
| #Update the filtering options by the items present in the INFO column? | |
| # Load required library | |
| #library(dplyr) | |
| # Split INFO column into key-value pairs |
| #' @examples | ||
| #' result <- imputation_concordance( | ||
| #' reference_genos = ref, | ||
| #' imputed_genos = test, | ||
| #' snps_2_exclude = snps, |
There was a problem hiding this comment.
The roxygen examples reference ref, test, and snps, but these objects aren’t defined in the example block. This will fail during R CMD check example execution. Make the example self-contained or wrap it in \dontrun{} / \donttest{}.
| result <- imputation_concordance( | ||
| reference_genos = ref, | ||
| imputed_genos = test, | ||
| snps_2_exclude = snps, | ||
| missing_code = 5, | ||
| verbose = TRUE, | ||
| plot = TRUE | ||
| ) | ||
|
|
There was a problem hiding this comment.
The Rd example calls imputation_concordance(reference_genos = ref, imputed_genos = test, ...) but ref, test, and snps are undefined in the example section. This will fail R CMD check. Provide a runnable minimal example or guard with \dontrun{} / \donttest{}.
| result <- imputation_concordance( | |
| reference_genos = ref, | |
| imputed_genos = test, | |
| snps_2_exclude = snps, | |
| missing_code = 5, | |
| verbose = TRUE, | |
| plot = TRUE | |
| ) | |
| # Minimal example data | |
| ref <- data.frame( | |
| ID = paste0("sample", 1:3), | |
| snp1 = c(0, 1, 2), | |
| snp2 = c(2, 2, 0) | |
| ) | |
| test <- data.frame( | |
| ID = paste0("sample", 1:3), | |
| snp1 = c(0, 1, 2), | |
| snp2 = c(2, 1, 0) | |
| ) | |
| # Exclude a SNP from the concordance calculation | |
| snps <- "snp2" | |
| result <- imputation_concordance( | |
| reference_genos = ref, | |
| imputed_genos = test, | |
| snps_2_exclude = snps, | |
| missing_code = NULL, | |
| verbose = TRUE, | |
| plot = FALSE | |
| ) |
| first <- all(grepl("^[A-Za-z]", sapply(pos, "[", 1))) | ||
| second <- suppressWarnings(all(sapply(pos, function(x) as.numeric(x[2])) > 0)) |
There was a problem hiding this comment.
ChromPos validation doesn’t match the documented Chr_Pos requirement: it currently only checks the first token starts with any letter and can yield NA when as.numeric() returns NAs (because all() over NAs yields NA). This can force downstream code into the "ChromPos invalid" path unexpectedly. Tighten the prefix check (e.g., ^chr case-insensitive) and coerce non-numeric positions to FALSE (not NA).
| first <- all(grepl("^[A-Za-z]", sapply(pos, "[", 1))) | |
| second <- suppressWarnings(all(sapply(pos, function(x) as.numeric(x[2])) > 0)) | |
| first_tokens <- sapply(pos, "[", 1) | |
| second_tokens <- sapply(pos, "[", 2) | |
| first <- all(grepl("^chr", first_tokens, ignore.case = TRUE)) | |
| suppressWarnings(num_pos <- as.numeric(second_tokens)) | |
| second <- all(!is.na(num_pos) & num_pos > 0) |
| #### interactive save #### | ||
| cat(paste0("\nDo you want to save the corrected pedigree as dataframe `", corrected_name, "`? (y/n): ")) | ||
| ans <- tolower(trimws(readline())) | ||
| if (ans == "y") { |
There was a problem hiding this comment.
check_ped() prompts with readline() when verbose = TRUE (the default). This will hang in non-interactive runs (CI, scripts, Shiny). Gate prompting behind interactive() and/or add an explicit argument controlling whether to prompt/save.
| "verbose= ", verbose,')">') | ||
|
|
||
| if(!is.null(madc)) report <- read.csv(madc, check.names = FALSE) else stop("Please provide a MADC file") | ||
| report <- read.csv(madc, check.names = FALSE) | ||
| checks <- check_madc_sanity(report) |
There was a problem hiding this comment.
madc still defaults to NULL, but the function now calls read.csv(madc, ...) unconditionally. If madc is omitted, this will fail with a cryptic read.csv error instead of a clear message. Add an explicit is.null(madc) check that stop()s with a helpful error before reading.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Updates on madc2vcf functions
Details:
verbose = TRUE, the functions output informative messages along the processcheck_madc_sanityfunction implemented. It tests:madc2vcf_targetsdoesn’t run if:madc2vcf_allin case they want to extract them as wellmadc2vcf_targetsgot a new argument:collapse_matches_counts, if TRUE, it collapses the read counts of the RefMatch to Ref and AltMatch to Alt. Default is FALSE.Still in progress:
Users now have the option to generate multiallelic VCF - new function
madc2vcf_multimadc2vcf_allandmadc2vcf_multidoesn’t run if:See the table for
madc2vcf_allandmadc2vcf_multirequirements accordingly to MADC content: