Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -397,7 +397,7 @@ The earth sciences folder contain subfolders for different data formats encounte
- 'sk-n-as-severus-ont.svcf': a minimal SVCF file obtained from ONT data using severus and octopusv/correct (SK-N-AS cell line)
- mills_and_1000G: Indels file downsampled based on reference position
- savana
- 'sv_breakpoints.vcf': a minimal VCF file obtained from ONT data using SAVANA
- 'sv_breakpoints.vcf': a minimal VCF file obtained from ONT data using SAVANA
- vcfanno
- 'vcfanno_grch38_module_test.tar.gz': exac.vcf.gz + exac.vcf.gz.tbi and they're reference ExAC vcf used to query
- 'vcfanno.toml': configuration file for vcfanno to operate
Expand Down Expand Up @@ -741,6 +741,14 @@ The earth sciences folder contain subfolders for different data formats encounte
- 1000GP.chr*.chunks.txt: chunks of the chromosome obtain with GLIMPSE_chunk
- AFR.gwas: Study locus file. From [SuShiE](https://github.com/mancusolab/sushie).
- AFR.ld: LD matrix file. From [SuShiE](https://github.com/mancusolab/sushie).
- hdl/reference/chr1.1_toy.bim: Synthetic toy HDL-format BIM sidecar for chunk 1.1, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/chr1.1_toy.rda: Synthetic toy HDL-format LD reference payload for chunk 1.1, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/chr1.2_toy.bim: Synthetic toy HDL-format BIM sidecar for chunk 1.2, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/chr1.2_toy.rda: Synthetic toy HDL-format LD reference payload for chunk 1.2, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/toy_snp_counter.RData: Synthetic toy HDL-format SNP count metadata, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/toy_snp_list.RData: Synthetic toy HDL-format SNP list metadata, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- sumstats/trait1_canonical.tsv: Synthetic canonical toy summary statistics for trait 1, generated by `hdl/generate_toy_hdl_data.R` for small GWAS-style module inputs.
- sumstats/trait2_canonical.tsv: Synthetic canonical toy summary statistics for trait 2, generated by `hdl/generate_toy_hdl_data.R` for small GWAS-style module inputs.
- popgen/clustering/: Test data for the new custom modules (`custom/clustering`, `custom/clustermetrics`, `custom/clustervisualization`) - `test.eigenvec`, `test_clusters.csv`, `test_features.tsv` + `README.md` (200-sample PCA derived from `popgen/plink_simulated` via `plink2 --pca`, plus KMeans cluster assignments)
- svsig:

Expand Down
41 changes: 41 additions & 0 deletions data/genomics/homo_sapiens/popgen/hdl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# HDL Toy Test Dataset

These files are synthetic toy fixtures for HDL module testing in the companion
`nf-core/modules` work for `nf-core/modules#10912`. They are intended to exercise
[HDL](https://github.com/zhenin/HDL) inputs in tests, not to provide a
scientific LD reference panel or redistributed upstream reference bundle.

## Layout

- `reference/`: toy HDL LD reference chunks and metadata sidecars
- `../sumstats/`: canonical toy summary-statistics tables aligned to the toy SNPs

## Regeneration

From this directory:

```bash
Rscript generate_toy_hdl_data.R
```

From the root of the `nf-core/test-datasets` worktree:

```bash
Rscript data/genomics/homo_sapiens/popgen/hdl/generate_toy_hdl_data.R
```

## R Objects

The `.bim` sidecars, both canonical `sumstats/*.tsv` files, and the R binary
payloads are all generated locally by `generate_toy_hdl_data.R` from fully
synthetic constants in this directory.

- `reference/chr1.1_toy.rda` and `reference/chr1.2_toy.rda` each contain
synthetic `LDsc`, `lam`, and `V` objects for one toy HDL chunk.
- `reference/toy_snp_counter.RData` contains `nsnps.list` and
`nsnps.list.imputed`, each as a named one-element list with the toy chunk SNP
counts.
- `reference/toy_snp_list.RData` contains `snps.list.imputed.vector`, the four
synthetic SNP IDs shared by the toy fixtures.
- `../sumstats/trait1_canonical.tsv` and `../sumstats/trait2_canonical.tsv` are
tiny canonical summary-statistics tables keyed to those synthetic SNP IDs.
111 changes: 111 additions & 0 deletions data/genomics/homo_sapiens/popgen/hdl/generate_toy_hdl_data.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
#!/usr/bin/env Rscript

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a (MIT) licensing statement on top of this script.

# SPDX-License-Identifier: MIT

args <- commandArgs(trailingOnly = FALSE)
file_arg <- "--file="
script_path <- sub(file_arg, "", args[grep(file_arg, args)])

if (length(script_path) != 1 || script_path == "") {
stop("Unable to determine the script path from commandArgs().")
}

script_dir <- dirname(normalizePath(script_path))
reference_dir <- file.path(script_dir, "reference")
sumstats_dir <- file.path(script_dir, "..", "sumstats")

dir.create(reference_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(sumstats_dir, recursive = TRUE, showWarnings = FALSE)

writeLines(
c(
"1 rs1 0 101 A G",
"1 rs2 0 102 C T"
),
file.path(reference_dir, "chr1.1_toy.bim")
)

writeLines(
c(
"1 rs3 0 201 A C",
"1 rs4 0 202 G T"
Comment on lines +21 to +30
),
file.path(reference_dir, "chr1.2_toy.bim")
)

lam <- c(1.3, 0.85)
LDsc <- c(1.1, 1.4)
V <- diag(2)
save(
LDsc,
lam,
V,
file = file.path(reference_dir, "chr1.1_toy.rda"),
compress = "gzip"
)

lam <- c(1.25, 0.9)
LDsc <- c(1.2, 1.35)
V <- diag(2)
save(
LDsc,
lam,
V,
file = file.path(reference_dir, "chr1.2_toy.rda"),
compress = "gzip"
)

nsnps.list <- list("1" = c(2, 2))
nsnps.list.imputed <- list("1" = c(2, 2))
save(
nsnps.list.imputed,
nsnps.list,
file = file.path(reference_dir, "toy_snp_counter.RData"),
compress = "gzip"
)

snps.list.imputed.vector <- c("rs1", "rs2", "rs3", "rs4")
save(
snps.list.imputed.vector,
file = file.path(reference_dir, "toy_snp_list.RData"),
compress = "gzip"
)

trait1 <- data.frame(
SNP = c("rs1", "rs2", "rs3", "rs4"),
A1 = c("A", "C", "A", "G"),
A2 = c("G", "T", "C", "T"),
CHR = c(1, 1, 1, 1),
POS = c(101, 102, 201, 202),
RSID = c("rs1", "rs2", "rs3", "rs4"),
EffectAllele = c("A", "C", "A", "G"),
OtherAllele = c("G", "T", "C", "T"),
N = c(10000, 10000, 10000, 10000),
Z = c(0.5, -0.2, 0.4, -0.1)
)
write.table(
trait1,
file.path(sumstats_dir, "trait1_canonical.tsv"),
sep = "\t",
quote = FALSE,
row.names = FALSE
)

trait2 <- data.frame(
SNP = c("rs1", "rs2", "rs3", "rs4"),
A1 = c("A", "C", "A", "G"),
A2 = c("G", "T", "C", "T"),
CHR = c(1, 1, 1, 1),
POS = c(101, 102, 201, 202),
RSID = c("rs1", "rs2", "rs3", "rs4"),
EffectAllele = c("A", "C", "A", "G"),
OtherAllele = c("G", "T", "C", "T"),
N = c(12000, 12000, 12000, 12000),
Z = c(0.3, -0.4, 0.2, -0.2)
)
write.table(
trait2,
file.path(sumstats_dir, "trait2_canonical.tsv"),
sep = "\t",
quote = FALSE,
row.names = FALSE
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1 rs1 0 101 A G
1 rs2 0 102 C T
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1 rs3 0 201 A C
1 rs4 0 202 G T
Binary file not shown.
Binary file not shown.
Binary file not shown.
31 changes: 31 additions & 0 deletions data/genomics/homo_sapiens/popgen/sumstats/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Toy Population-Genetics Summary Statistics

These files are tiny synthetic GWAS-style summary-statistics tables intended for
module testing. They are generated from fixed constants by the companion HDL
fixture generator at `../hdl/generate_toy_hdl_data.R`.

## Layout

- `trait1_canonical.tsv`: synthetic canonical summary statistics for trait 1
- `trait2_canonical.tsv`: synthetic canonical summary statistics for trait 2

## Regeneration

From the `hdl/` directory:

```bash
Rscript generate_toy_hdl_data.R
```

From the root of the `nf-core/test-datasets` worktree:

```bash
Rscript data/genomics/homo_sapiens/popgen/hdl/generate_toy_hdl_data.R
```

## Notes

These tables are not HDL-specific at the file-format level. They are kept under
`popgen/sumstats/` so they can be reused by modules that consume small
GWAS-style tabular inputs, while the HDL reference panel assets remain grouped
under `popgen/hdl/reference/`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
SNP A1 A2 CHR POS RSID EffectAllele OtherAllele N Z
rs1 A G 1 101 rs1 A G 10000 0.5
rs2 C T 1 102 rs2 C T 10000 -0.2
rs3 A C 1 201 rs3 A C 10000 0.4
rs4 G T 1 202 rs4 G T 10000 -0.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
SNP A1 A2 CHR POS RSID EffectAllele OtherAllele N Z
rs1 A G 1 101 rs1 A G 12000 0.3
rs2 C T 1 102 rs2 C T 12000 -0.4
rs3 A C 1 201 rs3 A C 12000 0.2
rs4 G T 1 202 rs4 G T 12000 -0.2