Closed
Conversation
…t/BIGr into ped_indels_update
There was a problem hiding this comment.
Pull request overview
This PR merges updates from indels_support and check_ped_updated into development, expanding BIGr’s DArTag/MADC handling (sanity checks + additional metadata support), improving documentation, and enhancing some utilities (VCF filtering, concordance reporting).
Changes:
- Add
check_madc_sanity()plus a basic test and documentation; integrate sanity checks intomadc2vcf_targets(). - Extend
madc2vcf_targets()withmarkers_infoto populate CHROM/POS/REF/ALT directly (incl. indel metadata). - Update
check_ped(),filterVCF(), andimputation_concordance()behaviors/docs (new options, new checks, additional outputs).
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-check_madc_sanity.R | Adds a test file for check_madc_sanity(). |
| man/madc2vcf_targets.Rd | Expands madc2vcf_targets() documentation for new args and behavior. |
| man/check_ped.Rd | Updates check_ped() documentation to reflect new behavior/output. |
| man/check_madc_sanity.Rd | Adds generated Rd for new check_madc_sanity() function. |
| R/utils.R | Adjusts check_botloci() to rewrite AlleleID after padding changes. |
| R/madc2vcf_targets.R | Adds MADC sanity checks + markers_info branch for CHROM/POS/REF/ALT. |
| R/madc2vcf_all.R | Adds input validation + passes botloci into hap-seq helper; padding checks. |
| R/imputation_concordance.R | Adds plotting/printing options and ggplot2-based plot. |
| R/filterVCF.R | Adds optional pre-filter quality-rate outputs; refactors messages and export logic. |
| R/check_ped.R | Refactors pedigree checks and adds interactive/global-env save behavior. |
| R/check_madc_sanity.R | Implements new MADC sanity-check helper with messages and checks. |
| NEWS.md | Adds 0.6.3 release notes. |
| NAMESPACE | Exports check_madc_sanity. |
| DESCRIPTION | Bumps version/roxygen note; updates cph affiliation. |
| BIGr.Rproj | Adds ProjectId metadata. |
| .gitignore | Adds .DS_Store. |
Comments suppressed due to low confidence (1)
R/filterVCF.R:488
- There is a large block of commented-out experimental code after the end of
filterVCF()(custom VCF reading tests, hard-coded local paths, etc.). Keeping this in the package source makes maintenance harder and bloats the file. Please remove it or move it to a vignette/dev script underdev/if it needs to be kept for reference.
#This is not reliable, so no longer use this shortcut to get dosage matrix
#test2 <- vcfR2genlight(vcf)
#####Testing custom VCF reading function######
# Open the gzipped VCF file
#con <- gzfile("/Users/ams866/Desktop/output.vcf", "rt")
# Read in the entire file
#lines <- readLines(con)
#close(con)
# Read in the entire file
#lines <- readLines("/Users/ams866/Desktop/output.vcf")
# Filter out lines that start with ##
#filtered_lines <- lines[!grepl("^##", lines)]
# Create a temporary file to write the filtered lines
#temp_file <- tempfile()
#writeLines(filtered_lines, temp_file)
# Read in the filtered data using read.table or read.csv
#vcf_data <- read.table(temp_file, header = TRUE, sep = "\t", comment.char = "", check.names = FALSE)
# Clean up the temporary file
#unlink(temp_file)
##Extract INFO column and Filter SNPs by those values
#Update the filtering options by the items present in the INFO column?
# Load required library
#library(dplyr)
# Split INFO column into key-value pairs
#vcf_data_parsed <- vcf_data %>%
# mutate(INFO_PARSED = strsplit(INFO, ";")) %>%
# unnest(INFO_PARSED) %>%
# separate(INFO_PARSED, into = c("KEY", "VALUE"), sep = "=") %>%
# spread(KEY, VALUE)
#Filter by DP
#filtered_vcf_data <- vcf_data_parsed %>%
# filter(as.numeric(DP) > 10)
# View the filtered dataframe
#print(filtered_vcf_data)
##Extracting and filtering by FORMAT column
# Identify the columns that are not sample columns
#non_sample_cols <- c("#CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT")
# Identify the sample columns
#sample_cols <- setdiff(names(vcf_data), non_sample_cols)
# Extract FORMAT keys
#format_keys <- strsplit(as.character(vcf_data$FORMAT[1]), ":")[[1]]
# Split SAMPLE columns based on FORMAT
#vcf_data_samples <- vcf_data %>%
# mutate(across(all_of(sample_cols), ~strsplit(as.character(.), ":"))) %>%
# mutate(across(all_of(sample_cols), ~map(., ~setNames(as.list(.), format_keys)))) %>%
# unnest_wider(all_of(sample_cols), names_sep = "_")
# View the parsed dataframe
#print(head(vcf_data_samples))
# Create separate dataframes for each FORMAT variable
#format_dfs <- lapply(format_keys, function(format_key) {
# vcf_data_samples %>%
# select(ID, ends_with(paste0("_", format_key))) %>%
# column_to_rownames("ID")
#})
# Assign names to the list elements
#names(format_dfs) <- format_keys
# Access the separate dataframes
#gt_df <- format_dfs$GT # Genotype dataframe
#ad_df <- format_dfs$AD # Allelic depths dataframe
#*I think the above method is okay if you only need to filter at the INFO level,
#*But I think if you want to filter for FORMAT, that vcfR is probably best,
#*Will need to explore further if I can easily just filter for MPP by checking if it is above a
#*threshold, and then converting the GT and UD values to NA if so...
#*If that is efficient and works, then I will just use this custom VCF method...
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+14
to
+21
| \value{ | ||
| A list with: | ||
| \describe{ | ||
| \item{checks}{Named logical vector with entries | ||
| \code{Columns}, \code{FixAlleleIDs}, \code{IUPACcodes}, \code{LowerCase}, \code{Indels}.} | ||
| \item{indel_clone_ids}{Character vector of \code{CloneID}s where ref/alt lengths differ. | ||
| Returns \code{character(0)} if none, or \code{NULL} when required columns are missing.} | ||
| } |
Comment on lines
+235
to
+252
| if(!all(rownames(ad_df)%in% df$BI_markerID)) | ||
| warning("Not all MADC CloneID was found in the markers_info file. These markers will be removed.") | ||
|
|
||
| matched <- df[match(rownames(ad_df), df$BI_markerID),] | ||
|
|
||
| new_df <- data.frame( | ||
| CHROM = matched$Chr, | ||
| POS = matched$Pos | ||
| ) | ||
|
|
||
| #Get read count sums | ||
| new_df$TotalRef <- rowSums(ref_df) | ||
| new_df$TotalAlt <- rowSums(alt_df) | ||
| new_df$TotalSize <- rowSums(size_df) | ||
|
|
||
| ref_base <- matched$Ref | ||
| alt_base <- matched$Alt | ||
| } |
Comment on lines
+232
to
+234
| # Silent automatic mode | ||
| assign(corrected_name, data, envir = .GlobalEnv) | ||
| assign(report_name, input_ped_report, envir = .GlobalEnv) |
| report$CloneID <- paste0(sub("_(.*)", "", report$CloneID), "_", | ||
| sprintf(paste0("%0", pad_botloci, "d"), as.integer(sub(".*_", "", report$CloneID))) | ||
| ) | ||
| report$AlleleID <- paste0(report$CloneID, "|", sapply(strsplit(report$AlleleID, "[|]"), "[[",2)) |
Comment on lines
+137
to
+150
| plot_df <- data.frame( | ||
| ID = imputed_genos$ID, | ||
| Concordance = percentage_match * 100 | ||
| ) | ||
|
|
||
| concordance_plot <- ggplot(plot_df, | ||
| aes(x = reorder(ID, Concordance), | ||
| y = Concordance)) + | ||
| geom_bar(stat = "identity") + | ||
| labs(title = "Imputation Concordance by Sample", | ||
| x = "Sample ID", | ||
| y = "Concordance (%)") + | ||
| theme_minimal() + | ||
| theme(axis.text.x = element_text(angle = 90, hjust = 1)) |
| if (!is.null(output.file)) { | ||
| output_name <- paste0(output.file, ".vcf.gz") | ||
| cat("Exporting VCF\n") | ||
| if (!class(vcf.file) == "vcfR"){ |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
indels_supportandcheck_ped_updatedmerged and submitted todevelopment.development