Sample MendelVar input:
User input is a tab-separated (.tsv), comma-separated (.csv) or space-separated table of single 1-indexed positions Accepted chromosome names: 1-22, X, Y. Accepted assemblies: GRCh37 (hg19) or GRCh38 (hg38).
Table needs to be header-less.
We accept two columns with chromosome and position. Examples:
chr1 10177
Or
1 10177
We also accept (and encourage) a third column with unique custom IDs, e.g.:
1 10177 rs367896724
If no custom ID is provided in the input, custom ID is created by MendelVar by concatenating the input chromosome and position, e.g. chr1_10177.
<aside> 💡 If you only have a list of SNP rsIDs and require to obtain their GRCh37 or GRCh38 genomic coordinates to run MendelVar, you can easily do so on the web using Ensembl Biomart GRCh37 or Ensembl Biomart GRCh38. An easy tutorial on how to do it available from 1000 Genomes.
</aside>
Going with the route 3 of input of single genomic positions, we need to first be able to generate genomic intervals which will be subsequently searched for overlap with Mendelian disease-causing genes and variants. In route 3, flexible linkage disequilibrium (LD)-based intervals can be created using LDlink.
MendelVar offers user to query bi-allelic variation in 1000 Genomes populations via LDlink LDproxy app, either through dbSNP rsIDs ('yes' for Use provided rsIDs?) or single positional coordinates ('no' for Use provided rsIDs?) and then generates custom intervals based on LD-dependent boundaries.
<aside> 💡 Warning: LDLink runs on dbSNP151 database. Whenever possible, choosing Use provided rsIDs option is recommended, due to: 1) possibility of multiple variants overlapping the same position; 2) incongruence between user-provided coordinates for a given dbSNP variant and those in the dbSNP database; 3) in particular for indels, 1000 Genomes positions often vary off by a few bases from those in the current release of dbSNP and for that reason, they cannot often be found when looked up by position in the 1000 Genomes dataset. If the user chooses the "Use provided rsIDs" option and provides dbSNP rsIDs in the third column, these IDs will used for lookup in LDLink LDproxy directly, rather than using positional coordinates.
</aside>
<aside> 💡 Warning: LDLink accepts only GRCh37 coordinates. As LDlink accepts only GRCh37 coordinates, we map user-submitted GRCh38 coordinates to GRCh37 coordinates with the UCSC liftOver tool. Using GRCh38 coordinates in the MendelVar pipeline with the LDLink step can result in loss of loci, if inputted genomic regions are absent in GRCh37, if positions lifted over to GRCh37 cannot be found in the 1000 Genomes reference or if LDLink-generated intervals cannot be fully lifted over back to GRCh38 without splits. All the loci lost at any stage are listed in the MendelVar log file.
</aside>
Using an LD statistic of choice, r^2 or D’, LD is calculated within the 1 Mbp window of input variant with LDlink. Boundary around the input variant is generated by finding the most distant SNP in the window with a minimum specified LD value (between 0-1) in the selected target 1000 Genomes population: CEU (Utah residents of northern European descent), EUR (European), EAS (East Asian), SAS (South Asian), AFR (African) or AMR (Ad-mixed American). The boundaries can be optionally extended to the nearest recombination hotspot with recombination rate >3 cM/Mb (Myers et al., 2005) based on HapMap II (Extend to the nearest recombination hotspot? option).