Overview
SEMplR (SNP Effect Matrix Pipeline in R) is an R package that predicts transcription factor (TF) binding. SEMplR can be used to predict binding affinity of TFs at genomic loci or predict the affect of genetic variation on TF binding.
SEMplR scores genomic regions or sequences of interest against SNP Effect Matrices (SEMs). SEMs are position x nucleotide matrix, generated by integrating information from position weighted matrices (PWMs), ChIP-seq, and DNase-seq data. This integration of binding data means that motif analysis with SEMs is more indicative of true binding potential compared to traditional motif analyses with PWMs where scores are more indicative of sequence similarity to consensus motifs. You can read more about SEMs and how they are generated in the original SEMpl paper.
This package extends the functionality of the SEMpl (SNP Effect Matrix pipeline) command line tool developed by the Boyle Lab at the University of Michigan. To support data analysis and visualizations with SEMs.
Citation
If you use SEMplR in your work, please also cite SEMpl:
Sierra S Nishizaki, Natalie Ng, Shengcheng Dong, Robert S Porter, Cody Morterud, Colten Williams, Courtney Asman, Jessica A Switzenberg, Alan P Boyle, Predicting the effects of SNPs on transcription factor binding affinity, Bioinformatics, Volume 36, Issue 2, 15 January 2020, Pages 364–372, https://doi.org/10.1093/bioinformatics/btz612
Installation
devtools::install_github("grkenney/SEMplR")
Basic Usage
Below are some examples of basic usage. Please see the vignette for more detailed workflow examples.
Predicting transcription factor binding
SEMplR accepts GRanges objects or lists of sequences to score. Here, we analyze two loci with SEMplR’s default set of 223 pre-computed SEMs, stored in the SEMC
object. The scoreBinding
function produces a data object with information about the ranges analyzed, SEM meta data, and a table with 446 rows (an entry for each loci and SEM combination).
library(BSgenome.Hsapiens.UCSC.hg19)
# load the default set of SEMs
data(SEMC)
# define genomic loci to score
gr <- GenomicRanges::GRanges(seqnames = c("chr12", "chr19"),
ranges = c(94136009, 10640062))
# score TF binding at each loci
sb <- scoreBinding(gr,
sem = SEMC,
genome = Hsapiens)
When analyzing large sets of loci, it can be helpful to know if one or more TFs are bound more than we would expect by chance. SEMplR includes enrichment and plotting functions to address this question.
# compute enrichment
e <- enrichSEMs(sb, SEMC)
# plot enrichment results
plotEnrich(e, SEMC)
Predicting effect of genetic variation on transcription factor binding
SEMplR accepts both VRanges and GRanges objects, specifying a reference an alternative allele. Every variant is scored against every SEM and a scoring is done for each allele independently.
The resulting object contains three slots containing the variants scored, SEM meta data, and the scoring table. These can be accessed with the variants()
, semData()
, and scores()
functions respectively.
vr <- VRanges(seqnames = c("chr12", "chr19"),
ranges = c(94136009, 10640062),
ref = c("G", "T"), alt = c("C", "A"),
id = c("A", "B"))
sv <- scoreVariants(vr = vr,
sem = SEMC,
genome = Hsapiens)
SEMplR includes two plotting functions to help users predict (1) which TFs change binding with a genetic variant and (2) which variants change the binding of a TF.
plotSEMVariants(s, "IKZF1")
plotSEMMotifs(s, "A")
Please see more information on these plots and their interpretation in our vignette.