Given two GRanges-class objects, samples '1' and '2', carrying the counts of methylated (mC) and unmethylated (uC) cytosines in their metacolumns, this function will filter by coverage each cytosine site from each GRanges-class object.

uniqueGRfilterByCov(
  x,
  y = NULL,
  min.coverage = 4,
  and.min.cov = TRUE,
  min.meth = 0,
  min.umeth = 0,
  min.sitecov = 4,
  percentile = 0.9999,
  min.percentile = TRUE,
  high.coverage = NULL,
  columns = c(mC = 1, uC = 2),
  ignore.strand = FALSE,
  y.centroid = NULL,
  num.cores = 1L,
  tasks = 0L,
  verbose = TRUE,
  ...
)

Arguments

x

An object from the classes 'GRanges', 'InfDiv', or 'pDMP' with methylated and unmethylated counts in its meta-column. If the argument 'y' is not given, then it is assumed that the first four columns of the GRanges-class metadata 'x' are counts: methylated and unmethylated counts for samples '1' and '2'. If \(x\) is a 'InfDiv', or 'pDMP', then \('y'\) is not needed, since samples '1' and '2' are the first four columns of these objects. In the context of MethylIT methylation analysis, \(x\) is considered a control sample or, more appropriated (if available) the centroide of control group.

y

A GRanges-class or a GRangesList-class object with methylated and unmethylated counts in its meta-column. Default is NULL. If 'y' is a GRangesList-class object, then the first two metacolumns from 'x' are used in the pair-wise filtering toguether with the first two metacolumns from each GRanges-class element from 'y'.

min.coverage

An integer or an integer vector of length 2. If 'min.coverage' is an integer vector, then the corresponding min coverage is applied to each sample. Default is 4, i.e., \(min.coverage = c(4,4)\).

and.min.cov

Logical. Whether to apply the logical AND to select the cytosine sites based on \(min.coverage\). If FALSE, then a logical OR is applied, and cytosine sites where at least one sample hold the \(min.coverage\) are preserved. Default is TRUE.

min.meth

An integer or an integer vector of length 2. Cytosine sites where the numbers of read counts of methylated cytosine in both samples, '1' and '2', are less than 'min.meth' are discarded. If 'min.meth' is an integer vector, then the corresponding min number of reads is applied to each sample. That is, \(min.meth\) implement a logical OR.

min.umeth

An integer or an integer vector of length 2. Minimum number of reads to consider cytosine position. Specifically cytosine positions where \(uC \leq min.umeth\) are discarded. Default is \(min.umeth = 0\).

min.sitecov

An integer. The minimum total coverage. Only sites where the total coverage \(cov1 + cov2\) is greater than 'min.sitecov' are considered for downstream analysis, where cov1 and cov2 are the coverages for samples 1 and 2, respectively.

percentile

Threshold to remove the outliers (PCR bias) from each file and all files stacked. If 'high.coverage = NULL', then the threshold \(q\) will be computed as: $$q1 = quantile(cov1, probs=percentile)$$ $$q2 = quantile(cov2, probs=percentile)$$.

where \(cov1\) and \(cov2\) are the coverage vectors from samples 1 and 2, respectively.

min.percentile

Logical. Each sample yield a percentile value. The user must decide whether to use the minimum or the maximum value from these percentile values. Default is TRUE. Hence, \(min.percentile = FALSE\) implies to use the maximum value.

high.coverage

An integer for read counts. Cytosine sites having higher coverage than this are discarded. Default is NULL.

columns

Vector of integer numbers of the columns (from each GRanges meta-column) where the methylated and unmethylated counts are provided. If not provided, then the methylated and unmethylated counts are assumed to be at columns 1 and 2, respectively.

ignore.strand

When set to TRUE, the strand information is ignored in the overlap of GRanges-class objects. This is a parameter passed to uniqueGRanges function. Default value: FALSE.

y.centroid

Optional. A GRanges-class object corresponding to the treatment/individual centroide. This information is applied to reduce the bias originated by missing cytosine sites. The centroide can be compute applying function poolFromGRlist.

num.cores, tasks

Parameters for parallel computation using package BiocParallel-package: the number of cores to use, i.e. at most how many child processes will be run simultaneously (see bplapply and the number of tasks per job (only for Linux OS).

verbose

if TRUE, prints the function log to stdout

...

Additional parameters for uniqueGRanges function.

Value

if 'x' and 'y' are GRanges-class

object, then a GRanges-class with the columns of methylated and unmethylated counts filtered for each cytosine position.

A GRangesList-class object will returned, if 'y' is a GRangesList-class object of same length as length(y) and named as names(y).

Details

Cytosine sites with 'coverage' > 'min.coverage' in at least one of the samples are preserved. Positions with 'coverage' < 'min.coverage' in both samples, 'x' and 'y', are removed. Positions with 'coverage' > 'percentile' (e.g., 99.9 percentile) are removed as well. It is expected that the columns of methylated and unmethylated counts are given.

This function is addressed to create pair-wise GRanges-class object with four metacolumns of count: samples '1' and '2', carrying the counts of methylated (mC) and unmethylated (uC) cytosines in their metacolumns, respectively. Counts from sample 1 are typically used as reference counts in computing information divergences in the downtstream analysis.

The cut-off value to remove PCR bias is computed as:

If high.coverage is NULL, then the cut-off point is: \(q = min(q1, q2)\) (if \(min.percentile\) = TRUE) or \(q = max(q1, q2)\). If high.coverage is not NULL, then $$q = max(q, high.coverage)$$.

Another source of bias is originated by missing cytosine sites. Missing data are frequently found in experimental data sets and, in particular, in bisulfite genomic sequencing data. Typically, in statistical analyses, the bias originated by missing data (for given variable) is mitigated by using the mean of the known values for the corresponding variable. That is, in present case, if the reads for some cytosine site are missed in a sample from a set of, e.g., three individuals, then the means of reads (methylated and unmethylated) for such site are applied as an estimation of the best expected ("guessed") value of missed reads. Obviously, if the reads are missed in all the samples, then the site is discarded (see examples).

The treatment centroide can be compute applying function poolFromGRlist. Also notice that, since the centroide correction is only available for the treat group, it is assumed that sample \(x\) carries reads for each (or almost all) cytosine sites are provide.

Examples

## Create new data. It is assumed that sample 'x' carries reads
## for each cytosine sites are provide.
strands <- c("+","-","+","-", "+","-","+","+","+","+","+")
pos <- c(10,11,11,12,13,13,14,15,16,17,18)

x <- data.frame(chr = 'chr1', start = pos, end = pos,
                mC = c(2,3,2,5,10,7,9,11,4,10,7),
                uC = c(2,30,20,4,8,0,10,3,0,8,1),
                strand = strands)

x <- makeGRangesFromDataFrame(x, keep.extra.columns = TRUE)
x
#> GRanges object with 11 ranges and 2 metadata columns:
#>        seqnames    ranges strand |        mC        uC
#>           <Rle> <IRanges>  <Rle> | <numeric> <numeric>
#>    [1]     chr1        10      + |         2         2
#>    [2]     chr1        11      - |         3        30
#>    [3]     chr1        11      + |         2        20
#>    [4]     chr1        12      - |         5         4
#>    [5]     chr1        13      + |        10         8
#>    [6]     chr1        13      - |         7         0
#>    [7]     chr1        14      + |         9        10
#>    [8]     chr1        15      + |        11         3
#>    [9]     chr1        16      + |         4         0
#>   [10]     chr1        17      + |        10         8
#>   [11]     chr1        18      + |         7         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## sample y
y <- data.frame(chr = 'chr1', start = 11:18, end = 11:18,
                mC2 = c(4,1,2,1,4,5:7), uC2 = c(0,0,2:7),
                strand = c("+","-","-","+","+","+","+","+"))

y <- makeGRangesFromDataFrame(y, keep.extra.columns = TRUE)
y
#> GRanges object with 8 ranges and 2 metadata columns:
#>       seqnames    ranges strand |       mC2       uC2
#>          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
#>   [1]     chr1        11      + |         4         0
#>   [2]     chr1        12      - |         1         0
#>   [3]     chr1        13      - |         2         2
#>   [4]     chr1        14      + |         1         3
#>   [5]     chr1        15      + |         4         4
#>   [6]     chr1        16      + |         5         5
#>   [7]     chr1        17      + |         6         6
#>   [8]     chr1        18      + |         7         7
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## The default settings. Sites where one of the samples has zero methylation
## calling or min.coverage is lesser than 4 in at least one sample are
## discarded. This setting implies a drastic decision and the amount of
## cytosine removed can be lead to strong biased conclusions in the
## downstream analysis.
uniqueGRfilterByCov(x = x, y = y,
                    percentile = 1,
                    ignore.strand = FALSE)
#>  *** Building coordinates for the new GRanges object ...
#>  *** Strand information is preserved ...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
#> 
#>  *** Sorting by chromosomes and start positions...
#> GRanges object with 4 ranges and 4 metadata columns:
#>       seqnames    ranges strand |        mC        uC       mC2       uC2
#>          <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>   [1]     chr1        13      - |         7         0         2         2
#>   [2]     chr1        15      + |        11         3         4         4
#>   [3]     chr1        16      + |         4         0         5         5
#>   [4]     chr1        18      + |         7         1         7         7
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Setting 'and.min.cov = FALSE' undesired cytosine sites are preseved. For
## example, meaningless situations with  methylation levels
## p = 1/(1 + 0) = 1.
uniqueGRfilterByCov(x = x, y = y,
                    and.min.cov = FALSE,
                    ignore.strand = FALSE,
                    percentile = 1,
                    verbose = FALSE)
#> GRanges object with 6 ranges and 4 metadata columns:
#>       seqnames    ranges strand |        mC        uC       mC2       uC2
#>          <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>   [1]     chr1        10      + |         2         2         0         0
#>   [2]     chr1        12      - |         5         4         1         0
#>   [3]     chr1        13      - |         7         0         2         2
#>   [4]     chr1        15      + |        11         3         4         4
#>   [5]     chr1        16      + |         4         0         5         5
#>   [6]     chr1        18      + |         7         1         7         7
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Setting 'min.coverage = 8' does not solves the previous issue, but still
## it preserves cytosine sites with one of the samples with zero or small
## coverage:
uniqueGRfilterByCov(x = x, y = y,
                    and.min.cov = FALSE,
                    min.coverage = 8,
                    percentile = 1,
                    ignore.strand = FALSE,
                    verbose = FALSE)
#> GRanges object with 4 ranges and 4 metadata columns:
#>       seqnames    ranges strand |        mC        uC       mC2       uC2
#>          <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>   [1]     chr1        12      - |         5         4         1         0
#>   [2]     chr1        15      + |        11         3         4         4
#>   [3]     chr1        16      + |         4         0         5         5
#>   [4]     chr1        18      + |         7         1         7         7
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## A centroid, a vector of means of methylation read for each cytosine site
## from the treatment group can be used as the best estimation to replace
## missing data: 'mC=0' and 'uC=0' or low coverage sites 'mC=1' and 'uC=0'.
y_centroid <- data.frame(chr = 'chr1',
                start = pos, end = pos,
                mC2 = c(8,7,6,7,5,8,1:5), uC2 = 0:10,
                strand = c("+","-","+","-", "+","-","+","+","+","+","+"))
y_centroid <- makeGRangesFromDataFrame(y_centroid,
                                      keep.extra.columns = TRUE)
y_centroid
#> GRanges object with 11 ranges and 2 metadata columns:
#>        seqnames    ranges strand |       mC2       uC2
#>           <Rle> <IRanges>  <Rle> | <numeric> <integer>
#>    [1]     chr1        10      + |         8         0
#>    [2]     chr1        11      - |         7         1
#>    [3]     chr1        11      + |         6         2
#>    [4]     chr1        12      - |         7         3
#>    [5]     chr1        13      + |         5         4
#>    [6]     chr1        13      - |         8         5
#>    [7]     chr1        14      + |         1         6
#>    [8]     chr1        15      + |         2         7
#>    [9]     chr1        16      + |         3         8
#>   [10]     chr1        17      + |         4         9
#>   [11]     chr1        18      + |         5        10
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## The cytosine sites with missing data or low coverage will be still,
## included, using the centroid of the centroinde of th sample group to
## which 'y' belong to.
uniqueGRfilterByCov(x = x, y = y,
                    and.min.cov = FALSE,
                    min.coverage = c(1, 8),
                    ignore.strand = FALSE,
                    y.centroid = y_centroid,
                    min.percentile = FALSE,
                    percentile = 1,
                    verbose = FALSE)
#> GRanges object with 11 ranges and 4 metadata columns:
#>        seqnames    ranges strand |        mC        uC       mC2       uC2
#>           <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>    [1]     chr1        10      + |         2         2         8         0
#>    [2]     chr1        11      - |         3        30         7         1
#>    [3]     chr1        11      + |         2        20         6         2
#>    [4]     chr1        12      - |         5         4         7         3
#>    [5]     chr1        13      + |        10         8         5         4
#>    [6]     chr1        13      - |         7         0         8         5
#>    [7]     chr1        14      + |         9        10         1         6
#>    [8]     chr1        15      + |        11         3         4         4
#>    [9]     chr1        16      + |         4         0         5         5
#>   [10]     chr1        17      + |        10         8         6         6
#>   [11]     chr1        18      + |         7         1         7         7
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths