public class MarkovModel extends MotifFinder
useVarLen=true|false
. When
useVarLen==true
, the MM is a variable length markov model that differs from other MotifFinders only in that it includes a
small LAPLACE correction. The other major functionality of MarkovModel is that it has methods that directly compute the probability of a
motif or string, as well as the expected number of occurrences.
To use MarkovModel more traditionally, leave useVarLen==false
(the default). Then different orders of Markov model are
generated by calls to generateModel(int)
, which generates all orders of Markov model from the specified order down to 0. From
there, all attempts to compute the probability of a string use the highest available model or maximum likelihood estimation of the string
length is shorter than the highest model.
Another useful property of this class is the ability to compute cross entropies between models or of long sequences. Also, models can be constructed from Motif arrays as well as string arrays, simple Strings, and other MotifFinders.
Modifier and Type | Class and Description |
---|---|
protected static class |
MarkovModel.KeyComparator
Implements an ordering on markov keys.
|
MotifFinder.EmptyFindResults, MotifFinder.UncompressedFindResults
Modifier and Type | Field and Description |
---|---|
protected double[] |
counts |
protected MotifFinder |
finder |
protected static double |
LAPLACE |
protected java.util.HashMap<java.lang.String,java.lang.Double> |
model |
private static long |
serialVersionUID |
protected boolean |
useVarLen |
BIPARTITE_INIT_CACHE_SIZE, BIPARTITE_INIT_LEN, BIPARTITE_INIT_N_COUNT_MAX, BIPARTITE_INIT_N_COUNT_MIN, checkMotifFinderResults, countCache, DEFAULT_MAX_HASH_LENGTH, discardOverlaps, doPersisting, ExtraGeneMotifFinderIdentifier, extraGeneSoftCache, geneLength, geneStarts, genomeSoftCache, groupSoftCache, maxHashLength, NO_WILDCARD_OPTIMIZATION, rand, seqID, sequence, simpleFindCache, TemporaryMotifFinderIdentifier, UNEQUAL_GENE_LENGTHS, wildcardOptimizationCutoff
Constructor and Description |
---|
MarkovModel(MotifFinder finder)
Creates a new MarkovModel using the finder as the basis.
|
MarkovModel(java.lang.String sequence,
int geneLength)
Calls
super(String, int) . |
MarkovModel(java.lang.String seqID,
java.lang.String sequence,
int geneLength)
Calls
super(String, String, int) . |
Modifier and Type | Method and Description |
---|---|
void |
clear()
Clears all the stored lookups from this model.
|
java.lang.Object |
clone()
Returns a deep copy of this markov model.
|
double |
computeEntropy() |
double |
computeEntropy(int order)
Computes the entropy of the given order, as H = -SUM[p lg p] where p = probability of the given motif.
|
int |
count(Motif m)
The expected number of occurences of the motif.
|
int |
count(Motif m,
int k)
Returns the number of times that any motif with a hamming distance from m of up to k occurs in the sequence.
|
int |
count(java.lang.String seq)
Computes the expected number of occurences in the given sequence.
|
MotifFinder |
createNewInstance(java.lang.String seqID,
java.lang.String sequence,
int geneLength)
|
double |
crossEntropy(MarkovModel MM)
Computes the cross entropy (KL Distance) between the highest shared order of this model vs the given model.
|
double |
crossEntropy(MarkovModel MM,
Motif[] motifs,
int start,
int stop)
Computes the cross entropy (KL Distance) between the highest shared order of this model vs the given model, using only the specified
motifs, in the range [start,stop).
|
double |
crossEntropy(Motif[] motifs)
Computes the cross entropy of the motifs versus the background model.
|
double |
crossEntropy(java.lang.String[] motifs)
Computes the cross entropy of the motifs versus the background model.
|
int[] |
find(Motif m,
int k)
Returns a sorted array of the indices of where any motif with a hamming distance from m of up to k occurs in the sequence.
|
int[] |
find(java.lang.String seq)
Returns the result of the finder's find call.
|
FindResults |
findResults(java.lang.String pattern) |
void |
generateModel(int order)
Generates a new markov model of the given order.
|
int |
getHighestOrder()
Returns the highest order model we have.
|
char |
getRandomBase()
Returns a random base from a 0th order MM.
|
char |
getRandomBase(java.lang.String history)
Returns a base drawn randomly given the history.
|
protected void |
initializeDataStructure()
Initializes the model and counts array.
|
double |
logProbabilityOf(Motif m)
The probability of any substring of this motif's length actually being an instance of this motif; returns in Natural log space.
|
double |
logProbabilityOf(java.lang.String seq)
Computes the probability of seq given the highest order model we have.
|
static MarkovModel |
makeMarkovModel(Motif[] motifs)
Constructs a suffix array for a markov model using the given String[]; motifs with R have both orientations added, unless the motif is
palindromic.
|
static MarkovModel |
makeMarkovModel(java.lang.String[] sequences)
Constructs a suffix array for a markov model using the given String[].
|
java.lang.String |
makeRandomSequence(int length,
int order)
Generates a random string from the given order markov model.
|
double |
modelProbability(java.lang.String seq)
The simple probability of the given sequence.
|
int |
numMotifs(int len)
Computes the total number of motifs of the given length in the finder.
|
double |
probabilityOf(char x,
java.lang.String seq)
Computes the probability of x given the sequence.
|
double |
probabilityOf(Motif m)
Converts the log probability into real space.
|
double |
probabilityOf(java.lang.String seq)
Converts the log probability into real space.
|
java.lang.String |
sampleRandomSequence(int length)
Samples a random sequence from the finder's sequence, ensuring that the result doesn't contain any FILLER or SPACER characters.
|
protected void |
setNums()
Sets numGenes and numBases appropriately from the finder.
|
void |
setUseVarLen(boolean b) |
java.lang.String |
toString()
Prints out a given model in order.
|
private double |
varLenLogProbabilityOf(java.lang.String seq) |
areOnSameGene, count, countGenes, createFindResults, createNewInstance, find, find, findResults, findResults, getAveragePercentageLengthVariation, getAvgGeneLength, getBaseCount, getCountCacheSize, getDiscardOverlaps, getEndIndexOf, getFindCacheSize, getFindResults, getGeneCount, getGeneIDOf, getGeneSequence, getInternalGeneLength, getPositionInGene, getSequence, getSequenceLength, getWildcardOptimizationCutoff, hasUniformGeneLengths, initialize, isCachingOn, loadBipartiteCaches, resetExtraGeneSoftCache, setDiscardOverlaps, setWildcardOptimization, simpleFind, strandCount, strandCount, turnOffCaching, turnOnCaching, turnOnCaching, turnOnCaching, wildcardFind
protected static double LAPLACE
private static final long serialVersionUID
protected double[] counts
protected MotifFinder finder
protected java.util.HashMap<java.lang.String,java.lang.Double> model
protected boolean useVarLen
public MarkovModel(MotifFinder finder)
public MarkovModel(java.lang.String sequence, int geneLength)
super(String, int)
.public MarkovModel(java.lang.String seqID, java.lang.String sequence, int geneLength)
super(String, String, int)
.public static MarkovModel makeMarkovModel(Motif[] motifs)
public static MarkovModel makeMarkovModel(java.lang.String[] sequences)
public void clear()
public java.lang.Object clone()
clone
in class MotifFinder
public double computeEntropy()
public double computeEntropy(int order)
public int count(Motif m)
count
in class MotifFinder
public int count(Motif m, int k)
MotifFinder
count
in class MotifFinder
java.lang.IllegalArgumentException
- because this method hasn't been implemented.public int count(java.lang.String seq)
count
in class MotifFinder
public MotifFinder createNewInstance(java.lang.String seqID, java.lang.String sequence, int geneLength)
createNewInstance
in class MotifFinder
public double crossEntropy(MarkovModel MM)
public double crossEntropy(MarkovModel MM, Motif[] motifs, int start, int stop)
public double crossEntropy(Motif[] motifs)
Sum(-log(pr[motif_n]/(n*ave_motif_len)). <\pre> This comes from Jurafsky & Martin's Speech and Language Processing textbook.
public double crossEntropy(java.lang.String[] motifs)
Sum(-log(pr[motif_n]/(n*ave_motif_len)). <\pre> This comes from Jurafsky & Martin's Speech and Language Processing textbook.
public int[] find(Motif m, int k)
MotifFinder
find
in class MotifFinder
java.lang.IllegalArgumentException
- because this method hasn't been implemented.public int[] find(java.lang.String seq)
find
in class MotifFinder
public FindResults findResults(java.lang.String pattern)
findResults
in class MotifFinder
public void generateModel(int order)
public int getHighestOrder()
public char getRandomBase()
public char getRandomBase(java.lang.String history)
protected void initializeDataStructure()
initializeDataStructure
in class MotifFinder
public double logProbabilityOf(Motif m)
public double logProbabilityOf(java.lang.String seq)
public java.lang.String makeRandomSequence(int length, int order)
public double modelProbability(java.lang.String seq)
public int numMotifs(int len)
public double probabilityOf(char x, java.lang.String seq)
public double probabilityOf(Motif m)
public double probabilityOf(java.lang.String seq)
public java.lang.String sampleRandomSequence(int length)
protected void setNums()
public void setUseVarLen(boolean b)
public java.lang.String toString()
toString
in class MotifFinder
private double varLenLogProbabilityOf(java.lang.String seq)