GRanges-class
of methylation
read counts filtered by coverage.R/uniqueGRfilterByCov.R
uniqueGRfilterByCov.Rd
Given two GRanges-class
objects,
samples '1' and '2', carrying the counts of methylated (mC) and unmethylated
(uC) cytosines in their metacolumns, this function will filter by coverage
each cytosine site from each GRanges-class
object.
uniqueGRfilterByCov(
x,
y = NULL,
min.coverage = 4,
and.min.cov = TRUE,
min.meth = 0,
min.umeth = 0,
min.sitecov = 4,
percentile = 0.9999,
min.percentile = TRUE,
high.coverage = NULL,
columns = c(mC = 1, uC = 2),
ignore.strand = FALSE,
y.centroid = NULL,
num.cores = 1L,
tasks = 0L,
verbose = TRUE,
...
)
An object from the classes 'GRanges', 'InfDiv', or 'pDMP' with
methylated and unmethylated counts in its meta-column. If the argument 'y'
is not given, then it is assumed that the first four columns of the
GRanges-class
metadata 'x' are counts:
methylated and unmethylated counts for samples '1' and '2'. If \(x\) is a
'InfDiv', or 'pDMP', then \('y'\) is not needed, since samples '1' and '2'
are the first four columns of these objects. In the context of MethylIT
methylation analysis, \(x\) is considered a control sample or, more
appropriated (if available) the centroide of control group.
A GRanges-class
or a
GRangesList-class
object with methylated and
unmethylated counts in its meta-column. Default is NULL. If 'y' is a
GRangesList-class
object, then the first two
metacolumns from 'x' are used in the pair-wise filtering toguether with the
first two metacolumns from each GRanges-class
element from 'y'.
An integer or an integer vector of length 2. If 'min.coverage' is an integer vector, then the corresponding min coverage is applied to each sample. Default is 4, i.e., \(min.coverage = c(4,4)\).
Logical. Whether to apply the logical AND to select the cytosine sites based on \(min.coverage\). If FALSE, then a logical OR is applied, and cytosine sites where at least one sample hold the \(min.coverage\) are preserved. Default is TRUE.
An integer or an integer vector of length 2. Cytosine sites where the numbers of read counts of methylated cytosine in both samples, '1' and '2', are less than 'min.meth' are discarded. If 'min.meth' is an integer vector, then the corresponding min number of reads is applied to each sample. That is, \(min.meth\) implement a logical OR.
An integer or an integer vector of length 2. Minimum number of reads to consider cytosine position. Specifically cytosine positions where \(uC \leq min.umeth\) are discarded. Default is \(min.umeth = 0\).
An integer. The minimum total coverage. Only sites where the total coverage \(cov1 + cov2\) is greater than 'min.sitecov' are considered for downstream analysis, where cov1 and cov2 are the coverages for samples 1 and 2, respectively.
Threshold to remove the outliers (PCR bias) from each file and all files stacked. If 'high.coverage = NULL', then the threshold \(q\) will be computed as: $$q1 = quantile(cov1, probs=percentile)$$ $$q2 = quantile(cov2, probs=percentile)$$.
where \(cov1\) and \(cov2\) are the coverage vectors from samples 1 and 2, respectively.
Logical. Each sample yield a percentile value. The user must decide whether to use the minimum or the maximum value from these percentile values. Default is TRUE. Hence, \(min.percentile = FALSE\) implies to use the maximum value.
An integer for read counts. Cytosine sites having higher coverage than this are discarded. Default is NULL.
Vector of integer numbers of the columns (from each GRanges meta-column) where the methylated and unmethylated counts are provided. If not provided, then the methylated and unmethylated counts are assumed to be at columns 1 and 2, respectively.
When set to TRUE, the strand information is ignored in
the overlap of GRanges-class
objects. This is a
parameter passed to uniqueGRanges
function. Default value:
FALSE.
Optional. A GRanges-class
object corresponding to the treatment/individual centroide. This information
is applied to reduce the bias originated by missing cytosine sites. The
centroide can be compute applying function poolFromGRlist
.
Parameters for parallel computation using package
BiocParallel-package
: the number of cores to
use, i.e. at most how many child processes will be run simultaneously (see
bplapply
and the number of tasks per job (only
for Linux OS).
if TRUE, prints the function log to stdout
Additional parameters for uniqueGRanges
function.
if 'x' and 'y' are GRanges-class
object, then a GRanges-class
with the columns
of methylated and unmethylated counts filtered for each cytosine position.
A GRangesList-class
object will returned, if
'y' is a GRangesList-class
object of same
length as length(y) and named as names(y).
Cytosine sites with 'coverage' > 'min.coverage' in at least one of the samples are preserved. Positions with 'coverage' < 'min.coverage' in both samples, 'x' and 'y', are removed. Positions with 'coverage' > 'percentile' (e.g., 99.9 percentile) are removed as well. It is expected that the columns of methylated and unmethylated counts are given.
This function is addressed to create pair-wise
GRanges-class
object with four metacolumns of
count: samples '1' and '2', carrying the counts of methylated (mC) and
unmethylated (uC) cytosines in their metacolumns, respectively. Counts from
sample 1 are typically used as reference counts in computing information
divergences in the downtstream analysis.
The cut-off value to remove PCR bias is computed as:
If high.coverage is NULL, then the cut-off point is: \(q = min(q1, q2)\) (if \(min.percentile\) = TRUE) or \(q = max(q1, q2)\). If high.coverage is not NULL, then $$q = max(q, high.coverage)$$.
Another source of bias is originated by missing cytosine sites. Missing data are frequently found in experimental data sets and, in particular, in bisulfite genomic sequencing data. Typically, in statistical analyses, the bias originated by missing data (for given variable) is mitigated by using the mean of the known values for the corresponding variable. That is, in present case, if the reads for some cytosine site are missed in a sample from a set of, e.g., three individuals, then the means of reads (methylated and unmethylated) for such site are applied as an estimation of the best expected ("guessed") value of missed reads. Obviously, if the reads are missed in all the samples, then the site is discarded (see examples).
The treatment centroide can be compute applying function
poolFromGRlist
. Also notice that, since the centroide
correction is only available for the treat group, it is assumed that sample
\(x\) carries reads for each (or almost all) cytosine sites are provide.
## Create new data. It is assumed that sample 'x' carries reads
## for each cytosine sites are provide.
strands <- c("+","-","+","-", "+","-","+","+","+","+","+")
pos <- c(10,11,11,12,13,13,14,15,16,17,18)
x <- data.frame(chr = 'chr1', start = pos, end = pos,
mC = c(2,3,2,5,10,7,9,11,4,10,7),
uC = c(2,30,20,4,8,0,10,3,0,8,1),
strand = strands)
x <- makeGRangesFromDataFrame(x, keep.extra.columns = TRUE)
x
#> GRanges object with 11 ranges and 2 metadata columns:
#> seqnames ranges strand | mC uC
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr1 10 + | 2 2
#> [2] chr1 11 - | 3 30
#> [3] chr1 11 + | 2 20
#> [4] chr1 12 - | 5 4
#> [5] chr1 13 + | 10 8
#> [6] chr1 13 - | 7 0
#> [7] chr1 14 + | 9 10
#> [8] chr1 15 + | 11 3
#> [9] chr1 16 + | 4 0
#> [10] chr1 17 + | 10 8
#> [11] chr1 18 + | 7 1
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## sample y
y <- data.frame(chr = 'chr1', start = 11:18, end = 11:18,
mC2 = c(4,1,2,1,4,5:7), uC2 = c(0,0,2:7),
strand = c("+","-","-","+","+","+","+","+"))
y <- makeGRangesFromDataFrame(y, keep.extra.columns = TRUE)
y
#> GRanges object with 8 ranges and 2 metadata columns:
#> seqnames ranges strand | mC2 uC2
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr1 11 + | 4 0
#> [2] chr1 12 - | 1 0
#> [3] chr1 13 - | 2 2
#> [4] chr1 14 + | 1 3
#> [5] chr1 15 + | 4 4
#> [6] chr1 16 + | 5 5
#> [7] chr1 17 + | 6 6
#> [8] chr1 18 + | 7 7
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## The default settings. Sites where one of the samples has zero methylation
## calling or min.coverage is lesser than 4 in at least one sample are
## discarded. This setting implies a drastic decision and the amount of
## cytosine removed can be lead to strong biased conclusions in the
## downstream analysis.
uniqueGRfilterByCov(x = x, y = y,
percentile = 1,
ignore.strand = FALSE)
#> *** Building coordinates for the new GRanges object ...
#> *** Strand information is preserved ...
#>
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
#>
#> *** Sorting by chromosomes and start positions...
#> GRanges object with 4 ranges and 4 metadata columns:
#> seqnames ranges strand | mC uC mC2 uC2
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
#> [1] chr1 13 - | 7 0 2 2
#> [2] chr1 15 + | 11 3 4 4
#> [3] chr1 16 + | 4 0 5 5
#> [4] chr1 18 + | 7 1 7 7
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Setting 'and.min.cov = FALSE' undesired cytosine sites are preseved. For
## example, meaningless situations with methylation levels
## p = 1/(1 + 0) = 1.
uniqueGRfilterByCov(x = x, y = y,
and.min.cov = FALSE,
ignore.strand = FALSE,
percentile = 1,
verbose = FALSE)
#> GRanges object with 6 ranges and 4 metadata columns:
#> seqnames ranges strand | mC uC mC2 uC2
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
#> [1] chr1 10 + | 2 2 0 0
#> [2] chr1 12 - | 5 4 1 0
#> [3] chr1 13 - | 7 0 2 2
#> [4] chr1 15 + | 11 3 4 4
#> [5] chr1 16 + | 4 0 5 5
#> [6] chr1 18 + | 7 1 7 7
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Setting 'min.coverage = 8' does not solves the previous issue, but still
## it preserves cytosine sites with one of the samples with zero or small
## coverage:
uniqueGRfilterByCov(x = x, y = y,
and.min.cov = FALSE,
min.coverage = 8,
percentile = 1,
ignore.strand = FALSE,
verbose = FALSE)
#> GRanges object with 4 ranges and 4 metadata columns:
#> seqnames ranges strand | mC uC mC2 uC2
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
#> [1] chr1 12 - | 5 4 1 0
#> [2] chr1 15 + | 11 3 4 4
#> [3] chr1 16 + | 4 0 5 5
#> [4] chr1 18 + | 7 1 7 7
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## A centroid, a vector of means of methylation read for each cytosine site
## from the treatment group can be used as the best estimation to replace
## missing data: 'mC=0' and 'uC=0' or low coverage sites 'mC=1' and 'uC=0'.
y_centroid <- data.frame(chr = 'chr1',
start = pos, end = pos,
mC2 = c(8,7,6,7,5,8,1:5), uC2 = 0:10,
strand = c("+","-","+","-", "+","-","+","+","+","+","+"))
y_centroid <- makeGRangesFromDataFrame(y_centroid,
keep.extra.columns = TRUE)
y_centroid
#> GRanges object with 11 ranges and 2 metadata columns:
#> seqnames ranges strand | mC2 uC2
#> <Rle> <IRanges> <Rle> | <numeric> <integer>
#> [1] chr1 10 + | 8 0
#> [2] chr1 11 - | 7 1
#> [3] chr1 11 + | 6 2
#> [4] chr1 12 - | 7 3
#> [5] chr1 13 + | 5 4
#> [6] chr1 13 - | 8 5
#> [7] chr1 14 + | 1 6
#> [8] chr1 15 + | 2 7
#> [9] chr1 16 + | 3 8
#> [10] chr1 17 + | 4 9
#> [11] chr1 18 + | 5 10
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## The cytosine sites with missing data or low coverage will be still,
## included, using the centroid of the centroinde of th sample group to
## which 'y' belong to.
uniqueGRfilterByCov(x = x, y = y,
and.min.cov = FALSE,
min.coverage = c(1, 8),
ignore.strand = FALSE,
y.centroid = y_centroid,
min.percentile = FALSE,
percentile = 1,
verbose = FALSE)
#> GRanges object with 11 ranges and 4 metadata columns:
#> seqnames ranges strand | mC uC mC2 uC2
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
#> [1] chr1 10 + | 2 2 8 0
#> [2] chr1 11 - | 3 30 7 1
#> [3] chr1 11 + | 2 20 6 2
#> [4] chr1 12 - | 5 4 7 3
#> [5] chr1 13 + | 10 8 5 4
#> [6] chr1 13 - | 7 0 8 5
#> [7] chr1 14 + | 9 10 1 6
#> [8] chr1 15 + | 11 3 4 4
#> [9] chr1 16 + | 4 0 5 5
#> [10] chr1 17 + | 10 8 6 6
#> [11] chr1 18 + | 7 1 7 7
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths