Genomic Availability & Metadata Analysis Tool
⚠️ Development Status: Active / Early Stage
GAMA is currently in early development. Interfaces, outputs, and scoring methods may change.
Users are encouraged to validate results independently and report issues.
GAMA is an R-based framework for surveying publicly available sequencing data across NCBI Assembly, SRA, and BioSample. Its aim is to support feasibility assessments for in silico research on underutilised plant species.
GAMA:
- Unifies NCBI database searches
- Computes a data richness score
- Classifies SRA accessions by experimental modality
- Enables strategic parsing of Assembly and SRA results
- Generates publication-ready visuals
Install the development version from GitHub using pak:
install.packages('pak')
pak::pak('JLewis-dev/GAMA')library(GAMA)To improve rate limits and ensure responsible use of NCBI services:
options(ENTREZ_EMAIL = 'your.email@example.com')
#rentrez::set_entrez_key('YOUR_API_KEY')Uncomment and add your API key if you have one.
RESULTS <- query_species(c('Vigna angularis', 'Vigna vexillata'))SUMMARY <- summarise_availability(RESULTS)
print(SUMMARY)plot_availability(SUMMARY)META <- summarise_sra_availability(RESULTS)
print(META)plot_sra_availability(META)ASM <- extract_assembly_metadata(RESULTS, best = TRUE)
print(ASM)SRA <- extract_sra_metadata(RESULTS, species = 'Vigna vexillata', class = 'genomic')
print(SRA)citation('GAMA')GAMA includes built-in plotting functions for rapid assessment.
plot_availability() produces stacked bar plots showing:
- Assembly contribution
- SRA contribution
- BioSample contribution
- Overall data richness score
Supports custom ranking, colour palettes, and ggplot2 theming.
plot_sra_availability() visualises:
- Relative abundance of sequencing strategies
- Ontology-classified experiment types
- Cross-species comparisons
Optional GEO overlays and experimental distribution summaries are available via additional functions.
Queries automatically record:
- Tool version
- Timestamp
- Database sources
This metadata is embedded in outputs.
The data richness score is defined as:
Score = A + S + B
Where A, S, and B are the transformed contributions of Assembly, SRA, and BioSample accession counts.
A = best + ln(1 + total − best), with assemblies weighted as:
- Complete = 10
- Chromosome = 8
- Scaffold = 5
- Contig = 2
Here, best is the maximum weight assembly (ties broken by highest N50) and total is the sum of all accession weights.
S = 2·ln(1 + SRA)
B = ln(1 + BioSample)
This formulation prioritises high-quality assemblies while incorporating diminishing returns for extensively sampled taxa.
SRA experiments are classified using an ontology derived from large-scale metadata mining and manual curation.
- WGS
- Amplicon-seq
- RAD-seq
- Targeted-Capture
- Clone-based
- RNA-seq
- small-RNA
- Long-read
- Bisulfite-seq
- ChIP-seq
- CUT&RUN
- CUT&Tag
- ATAC-seq
- DNase-seq
- FAIRE-seq
- MNase-seq
- SELEX
- Hi-C
- 3C-based
- ChIA-PET
- TCC
- Other
Fallback rules are applied when primary metadata fields are missing or ambiguous.
GAMA is designed for:
- Grant and project scoping
- Identification of under-studied taxa
- Strategic prioritisation of existing datasets
It is particularly suited to investigations of underutilised and non-model plant species.
- Dependent on NCBI metadata quality
- Runtime increases with species list size
- Novel protocols may not be fully captured by the ontology
- Results should be interpreted cautiously during early development
See the LICENSE file for details.