Skip to content

Madc2vcf and Pedigree functions updates #54

Open
Cristianetaniguti wants to merge 30 commits intodevelopmentfrom
ped_indels_update
Open

Madc2vcf and Pedigree functions updates #54
Cristianetaniguti wants to merge 30 commits intodevelopmentfrom
ped_indels_update

Conversation

@Cristianetaniguti
Copy link
Copy Markdown
Collaborator

@Cristianetaniguti Cristianetaniguti commented Mar 27, 2026

Updates on madc2vcf functions

Details:

  • If verbose = TRUE, the functions output informative messages along the process
  • both functions (targets and all (targets + off-targets) markers now have check_madc_sanity function implemented. It tests:
    • [Columns] If MADC has the expected columns
    • [allNArow | allNAcol] Presence of columns and rows with all NA (happens often when people open the MADC in excel before loading in R)
    • [IUPACcodes] Presence of IUPAC codes on AlleleSequence
    • [LowerCase] Presence of lower case bases on AlleleSequence
    • [Indels] Presence of Indels
    • [ChromPos] If CloneID follows the format Chr_Pos
    • [RefAltSeqs] If all Ref Allele has corresponding Alt and vice-versa
    • [OtherAlleles] If "Other" exists in the MADC AlleleID
  • madc2vcf_targets doesn’t run if:
    • MADC Column names are not correct
    • Ignore Other alleles - but inform the user if they exist or not and direct them to madc2vcf_all in case they want to extract them as well
  • See the table for madc2vcf_targets requirements accordingly to MADC content:
  check status get_REF_ALT Requires
IUPAC TRUE TRUE markers_info REF/ALT
  TRUE FALSE -
  FALSE TRUE botloci or markers_info REF/ALT
  FALSE FALSE -
Indels TRUE TRUE markers_info REF/ALT
  TRUE FALSE -
  FALSE TRUE botloci or markers_info REF/ALT
  FALSE FALSE -
ChromPos TRUE TRUE botloci or markers_info REF/ALT
  TRUE FALSE -
  FALSE TRUE markers_info CHR/POS/REF/ALT or markers_info CHR/POS/ + botloci
  FALSE FALSE markers_info CHR/POS
FixAlleleIDs TRUE TRUE botloci or markers_info REF/ALT
  TRUE FALSE -
  FALSE TRUE markers_info REF/ALT
  FALSE FALSE -
  • madc2vcf_targets got a new argument: collapse_matches_counts, if TRUE, it collapses the read counts of the RefMatch to Ref and AltMatch to Alt. Default is FALSE.

Still in progress:

  • Users now have the option to generate multiallelic VCF - new function madc2vcf_multi

  • madc2vcf_all and madc2vcf_multi doesn’t run if:

    • MADC Column names are not correct
    • If it is raw MADC
    • If it has IUPAC codes
  • See the table for madc2vcf_all and madc2vcf_multi requirements accordingly to MADC content:

  Check status Requires
Indels TRUE markers_info REF/ALT/IndelPos/IndelLenght + botloci
  FALSE botloci
ChromPos TRUE botloci
  FALSE markers_info CHR/POS + botloci
RefAltSeqs TRUE botloci
  FALSE botloci + microhapdb
 

Cristianetaniguti and others added 22 commits October 3, 2025 15:11
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances BIGr’s MADC→VCF conversion and pedigree utilities by adding structured verbose messaging, stronger MADC sanity validation, and new/refined arguments and documentation to support more robust input handling.

Changes:

  • Added vmsg() and integrated verbose, stepwise messaging into madc2vcf_targets(), get_countsMADC(), and madc2vcf_all().
  • Introduced check_madc_sanity() and integrated it into MADC→VCF workflows, plus expanded tests and roxygen docs.
  • Extended APIs (madc2vcf_targets(), get_countsMADC(), madc2vcf_all(), imputation_concordance(), filterVCF(), check_ped()) with new parameters and updated manuals.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
tests/testthat/test-madc2vcf_targets.R Expands MADC→VCF target tests using external PanelHub fixtures.
tests/testthat/test-check_madc_sanity.R Adds tests for check_madc_sanity() using external fixtures.
man/vmsg.Rd Documents new verbose message helper.
man/madc2vcf_targets.Rd Updates targets conversion documentation for new args and behavior.
man/madc2vcf_all.Rd Adds markers_info argument to docs.
man/imputation_concordance.Rd Documents new plotting/printing options.
man/get_countsMADC.Rd Updates docs for new args and behavior.
man/get_counts.Rd Adds internal docs for new helper get_counts().
man/filterVCF.Rd Documents new quality.rates parameter and example edits.
man/check_ped.Rd Updates pedigree check docs to reflect new behavior/output.
man/check_madc_sanity.Rd Documents check_madc_sanity() checks and return structure.
R/utils.R Adds vmsg() and url_exists().
R/madc2vcf_targets.R Refactors madc2vcf_targets() with sanity checks, markers_info support, new args, and verbose metadata header.
R/madc2vcf_all.R Adds input validation, sanity checks, and markers_info support in validation flow.
R/imputation_concordance.R Adds plot and print_result options and ggplot2-based plotting.
R/get_countsMADC.R Adds madc_object, collapsing behavior, verbose messaging; introduces get_counts() helper.
R/filterVCF.R Adds quality.rates reporting and adjusts I/O/filter logging.
R/check_ped.R Refactors pedigree validation and introduces optional interactive/global save behavior.
R/check_madc_sanity.R Adds new exported sanity checker and an updated check_botloci() implementation.
NEWS.md Adds release notes for 0.6.4 and related changes.
NAMESPACE Exports new functions and adds new imports.
DESCRIPTION Updates authorship/copyright entry and roxygen note version.
BIGr.Rproj Adds ProjectId.
.gitignore Adds .DS_Store.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +121 to +123
test_that("ALFALFA — clean fixed allele ID MADC", {
out <- tempfile(fileext = ".vcf")
expect_no_error(
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple test_that() calls are nested inside the outer test_that("simu alfalfa", ...). testthat doesn’t reliably support nested test_that() and this can cause misreporting or skipped tests. Split the inner blocks into separate top-level test_that() calls (and share fixture setup via helper code).

Copilot uses AI. Check for mistakes.
Comment on lines +410 to +439
#This is not reliable, so no longer use this shortcut to get dosage matrix
#test2 <- vcfR2genlight(vcf)


#####Testing custom VCF reading function######
# Open the gzipped VCF file
#con <- gzfile("/Users/ams866/Desktop/output.vcf", "rt")

# Read in the entire file
#lines <- readLines(con)
#close(con)
# Read in the entire file
#lines <- readLines("/Users/ams866/Desktop/output.vcf")
# Filter out lines that start with ##
#filtered_lines <- lines[!grepl("^##", lines)]
# Create a temporary file to write the filtered lines
#temp_file <- tempfile()
#writeLines(filtered_lines, temp_file)
# Read in the filtered data using read.table or read.csv
#vcf_data <- read.table(temp_file, header = TRUE, sep = "\t", comment.char = "", check.names = FALSE)
# Clean up the temporary file
#unlink(temp_file)

##Extract INFO column and Filter SNPs by those values
#Update the filtering options by the items present in the INFO column?

# Load required library
#library(dplyr)

# Split INFO column into key-value pairs
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a large block of commented-out experimental code (including local file paths) kept at the end of this exported source file. This makes maintenance harder and adds noise to the package. Consider moving it to a vignette/dev note or removing it once the approach is finalized.

Suggested change
#This is not reliable, so no longer use this shortcut to get dosage matrix
#test2 <- vcfR2genlight(vcf)
#####Testing custom VCF reading function######
# Open the gzipped VCF file
#con <- gzfile("/Users/ams866/Desktop/output.vcf", "rt")
# Read in the entire file
#lines <- readLines(con)
#close(con)
# Read in the entire file
#lines <- readLines("/Users/ams866/Desktop/output.vcf")
# Filter out lines that start with ##
#filtered_lines <- lines[!grepl("^##", lines)]
# Create a temporary file to write the filtered lines
#temp_file <- tempfile()
#writeLines(filtered_lines, temp_file)
# Read in the filtered data using read.table or read.csv
#vcf_data <- read.table(temp_file, header = TRUE, sep = "\t", comment.char = "", check.names = FALSE)
# Clean up the temporary file
#unlink(temp_file)
##Extract INFO column and Filter SNPs by those values
#Update the filtering options by the items present in the INFO column?
# Load required library
#library(dplyr)
# Split INFO column into key-value pairs

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +56
#' @examples
#' result <- imputation_concordance(
#' reference_genos = ref,
#' imputed_genos = test,
#' snps_2_exclude = snps,
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roxygen examples reference ref, test, and snps, but these objects aren’t defined in the example block. This will fail during R CMD check example execution. Make the example self-contained or wrap it in \dontrun{} / \donttest{}.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to 75
result <- imputation_concordance(
reference_genos = ref,
imputed_genos = test,
snps_2_exclude = snps,
missing_code = 5,
verbose = TRUE,
plot = TRUE
)

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Rd example calls imputation_concordance(reference_genos = ref, imputed_genos = test, ...) but ref, test, and snps are undefined in the example section. This will fail R CMD check. Provide a runnable minimal example or guard with \dontrun{} / \donttest{}.

Suggested change
result <- imputation_concordance(
reference_genos = ref,
imputed_genos = test,
snps_2_exclude = snps,
missing_code = 5,
verbose = TRUE,
plot = TRUE
)
# Minimal example data
ref <- data.frame(
ID = paste0("sample", 1:3),
snp1 = c(0, 1, 2),
snp2 = c(2, 2, 0)
)
test <- data.frame(
ID = paste0("sample", 1:3),
snp1 = c(0, 1, 2),
snp2 = c(2, 1, 0)
)
# Exclude a SNP from the concordance calculation
snps <- "snp2"
result <- imputation_concordance(
reference_genos = ref,
imputed_genos = test,
snps_2_exclude = snps,
missing_code = NULL,
verbose = TRUE,
plot = FALSE
)

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +157
first <- all(grepl("^[A-Za-z]", sapply(pos, "[", 1)))
second <- suppressWarnings(all(sapply(pos, function(x) as.numeric(x[2])) > 0))
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChromPos validation doesn’t match the documented Chr_Pos requirement: it currently only checks the first token starts with any letter and can yield NA when as.numeric() returns NAs (because all() over NAs yields NA). This can force downstream code into the "ChromPos invalid" path unexpectedly. Tighten the prefix check (e.g., ^chr case-insensitive) and coerce non-numeric positions to FALSE (not NA).

Suggested change
first <- all(grepl("^[A-Za-z]", sapply(pos, "[", 1)))
second <- suppressWarnings(all(sapply(pos, function(x) as.numeric(x[2])) > 0))
first_tokens <- sapply(pos, "[", 1)
second_tokens <- sapply(pos, "[", 2)
first <- all(grepl("^chr", first_tokens, ignore.case = TRUE))
suppressWarnings(num_pos <- as.numeric(second_tokens))
second <- all(!is.na(num_pos) & num_pos > 0)

Copilot uses AI. Check for mistakes.
Comment on lines +220 to +223
#### interactive save ####
cat(paste0("\nDo you want to save the corrected pedigree as dataframe `", corrected_name, "`? (y/n): "))
ans <- tolower(trimws(readline()))
if (ans == "y") {
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_ped() prompts with readline() when verbose = TRUE (the default). This will hang in non-interactive runs (CI, scripts, Shiny). Gate prompting behind interactive() and/or add an explicit argument controlling whether to prompt/save.

Copilot uses AI. Check for mistakes.
Comment on lines 99 to +102
"verbose= ", verbose,')">')

if(!is.null(madc)) report <- read.csv(madc, check.names = FALSE) else stop("Please provide a MADC file")
report <- read.csv(madc, check.names = FALSE)
checks <- check_madc_sanity(report)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

madc still defaults to NULL, but the function now calls read.csv(madc, ...) unconditionally. If madc is omitted, this will fail with a cryptic read.csv error instead of a clear message. Add an explicit is.null(madc) check that stop()s with a helpful error before reading.

Copilot uses AI. Check for mistakes.
Cristianetaniguti and others added 5 commits March 27, 2026 10:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants