public abstract class MotifFinder
extends java.lang.Object
implements java.lang.Cloneable, java.io.Serializable
Bipartite lookups can be sped up by calling setWildcardOptimization(int)
and by loading serialized bipartite caches using
loadBipartiteCaches()
.
Motif finders are serializable. The variables discardOverlaps
and wildcardOptimizationCutoff
are transient, and upon
deserialization are set to false and NO_WILDCARD_OPTIMIZATION, respectively. The caching variables are also transient. Caching is turned
off upon deserialization.
Each motif finder keeps track of a geneLength, which represents the distance between break characters in the biological sequence. This
should be set to UNEQUAL_GENE_LENGTHS
if the breaks are not evenly spaced. (The main purpose of forcing genes to be the same
length is to allow the use of KS statistics in edu.dartmouth.bglab.score, but having genes of the same length is used also useful in
motif finders in order to speed up wildcard searches.) Note that because of the addition of break characters between genes in the
biological sequence, each of the biological genes should be extended or clipped to be geneLength-1 bases long-- NOT geneLength bases
long.
Subclasses need to implement the following methods based on the specific data structures they use to perform searches:
public <constructor>(String sequence, int geneLength) public <constructor>(String seqID, String sequence, int geneLength) protected void initializeDataStructure() public MotifFinder createNewInstance(String seqID, String sequence, int geneLength) protected int count(String pattern) protected int[] find(String pattern) public int count(Motif m, int k) public int[] find(Motif m, int k) public Object clone()Subclasses should also deal with any special work that needs to be done in serializing and deserializing their own member data.
Modifier and Type | Class and Description |
---|---|
static class |
MotifFinder.EmptyFindResults |
private static class |
MotifFinder.FindResultsSoftValueHashMap |
private static class |
MotifFinder.GenomeSoftValueHashMap |
static class |
MotifFinder.UncompressedFindResults |
Modifier and Type | Field and Description |
---|---|
private int |
baseCount
Number of bases in the sequence.
|
static int |
BIPARTITE_INIT_CACHE_SIZE
When initializing bipartite cache, specifies the initial cache size.
|
static int |
BIPARTITE_INIT_LEN
When initializing bipartite cache, specifies the number of non N bases to initialize with.
|
static int |
BIPARTITE_INIT_N_COUNT_MAX
When initializing bipartite cache, specifies the max # of Ns to initialize with.
|
static int |
BIPARTITE_INIT_N_COUNT_MIN
When initializing bipartite cache, specifies the min # of Ns to initialize with.
|
private static java.util.regex.Pattern |
bipartiteMotifPattern |
static boolean |
checkMotifFinderResults |
protected java.util.HashMap<java.lang.String,java.lang.Integer> |
countCache
To speed up lookups in
count(Motif) when caching is turned on. |
private static long |
cumulativeDeltaHybridSearchMinusSimpleFind |
protected static int |
DEFAULT_MAX_HASH_LENGTH |
protected boolean |
discardOverlaps
True if overlapping motifs are not to be included by count() and find() methods.
|
private static boolean |
doingFindBipartiteHitsComparisonTiming |
static boolean |
doPersisting |
static java.lang.String |
ExtraGeneMotifFinderIdentifier |
static MotifFinder.FindResultsSoftValueHashMap |
extraGeneSoftCache |
protected int |
geneLength
The distance between break characters in the biological sequence; set to
UNEQUAL_GENE_LENGTHS if the breaks are not evenly
spaced. |
protected int[] |
geneStarts
Indices in sequence of gene starts.
|
static MotifFinder.FindResultsSoftValueHashMap |
genomeSoftCache |
static MotifFinder.FindResultsSoftValueHashMap |
groupSoftCache |
protected int |
maxHashLength
Maximum size for caches.
|
static int |
NO_WILDCARD_OPTIMIZATION |
private static java.util.regex.Pattern |
partOfGene |
(package private) static java.util.Random |
rand |
private static long |
regexSearchCounter |
private static long |
regexSearchCounterReportInterval |
private static int |
regexSearchThreshold |
protected java.lang.String |
seqID
The ID of the biological sequence; used for determining the file name for the serialized bipartite caches.
|
protected java.lang.String |
sequence
The biological sequence.
|
(package private) static long |
serialVersionUID |
protected java.util.HashMap<java.lang.String,int[]> |
simpleFindCache
To speed up lookups in
simpleFind(Motif) when caching is turned on. |
static java.lang.String |
TemporaryMotifFinderIdentifier |
static int |
UNEQUAL_GENE_LENGTHS |
protected int |
wildcardOptimizationCutoff
The minimum amount of consecutive N's that must be in a motif to cause wildcard optimization to be used in
find(Motif) ; set to
NO_WILDCARD_OPTIMIZATION to turn off wildcard optimization completely. |
Modifier | Constructor and Description |
---|---|
protected |
MotifFinder() |
protected |
MotifFinder(java.lang.String sequence,
int geneLength)
Same as
MotifFinder(String, String, int) , but sets seqID to null. |
protected |
MotifFinder(java.lang.String seqID,
java.lang.String sequence,
int geneLength)
Sets
seqID , sequence , geneLength , and geneStarts ; sets discardOverlaps to false and
wildcardOptimizationCutoff to NO_WILDCARD_OPTIMIZATION, and then calls initializeDataStructure() . |
Modifier and Type | Method and Description |
---|---|
private static void |
addHitsUsingBestMethod(java.util.Collection<java.lang.Integer> hitsList,
MotifFinder pMotifFinder,
Motif motif,
java.util.regex.Matcher bipartiteMatcher) |
private static void |
addHitsUsingBothEnds(java.util.Collection<java.lang.Integer> hitsList,
MotifFinder pMotifFinder,
Motif motif,
java.util.regex.Matcher bipartiteMatcher) |
private static void |
addHitsUsingOneEndAndRegex(java.util.Collection<java.lang.Integer> hitsList,
MotifFinder pMotifFinder,
Motif motif,
java.util.regex.Matcher bipartiteMatcher) |
private static void |
addHitsUsingRegex(java.util.Collection<java.lang.Integer> hitsList,
MotifFinder pMotifFinder,
Motif motif) |
protected boolean |
areOnSameGene(int pos1,
int pos2)
Returns true if pos1 and pos2 are on the same gene, where pos1 <= pos2.
|
java.lang.Object |
clone()
Returns a copy of this motif finder.
|
private int[] |
combineHitsToBipartiteHits(int[] sm1,
int[] sm2,
int sm2Pos)
Combines sorted arrays of hits for the first and second submotifs of a bipartite motif, returning a sorted array of the bipartite
motifs where the distance between the starts of the first and second submotifs equals sm2Pos, and they are located in the same
biological gene.
|
int |
count(Motif m)
Returns the number of times that m occurs in the sequence.
|
int |
count(Motif m,
int k)
Returns the number of times that any motif with a hamming distance from m of up to k occurs in the sequence.
|
int |
count(MotifList ml)
Returns the sum of the counts of all motif in ml; no checks are made of the presence of duplicate motifs.
|
int |
count(java.lang.String pattern)
Returns the number of times that pattern occurs in the sequence.
|
int |
countGenes(Motif m)
Returns the number of genes m is in.
|
static FindResults |
createFindResults(MotifFinder pMotifFinder,
Motif motif,
java.util.regex.Matcher forwardBipartiteMatcher,
boolean useRegexForNonBipartiteSearches) |
private static FindResults |
createFindResultsUsingRegex(MotifFinder pMotifFinder,
Motif motif) |
MotifFinder |
createNewInstance(java.lang.String sequence,
int geneLength)
Same as
createNewInstance(String, String, int) , but sets seqID to null. |
abstract MotifFinder |
createNewInstance(java.lang.String seqID,
java.lang.String sequence,
int geneLength)
Returns a new motif finder with the same runtime type,
discardOverlaps status, and wildcardOptimizationCutoff as the
calling instance. |
int[] |
find(Motif m)
Returns a SORTED array of the indices of where m occurs in the sequence.
|
int[] |
find(Motif m,
int k)
Returns a sorted array of the indices of where any motif with a hamming distance from m of up to k occurs in the sequence.
|
int[] |
find(MotifList ml)
Returns the SORTED result of calling
find(Motif) on all the motifs of the given MotifList; no checks are made for duplicate
motifs. |
abstract int[] |
find(java.lang.String pattern)
Returns an array of the indices of where pattern occurs in the sequence.
|
FindResults |
findResults(Motif m) |
FindResults |
findResults(MotifList motifList) |
abstract FindResults |
findResults(java.lang.String pattern) |
float |
getAveragePercentageLengthVariation() |
double |
getAvgGeneLength()
Returns the avg gene length, not including
Alphabet.FILLER characters or Alphabet.SEQUENCE_BREAK characters. |
int |
getBaseCount()
Returns the number of bases in the sequence.
|
int |
getCountCacheSize()
Returns the number of elements currently in the cache.
|
boolean |
getDiscardOverlaps()
Gets the value of
discardOverlaps . |
int |
getEndIndexOf(int geneID)
Returns the index in the sequence of the break character at the end of the specified gene-- gene ID's start at 1.
|
int |
getFindCacheSize()
Returns the number of elements currently in the cache.
|
static FindResults |
getFindResults(MotifFinder mf,
Motif motif) |
int |
getGeneCount()
Returns the number of genes in the sequence.
|
int |
getGeneIDOf(int seqIndex)
Returns the ID of the gene on which the given index in the sequence is located-- these ID's start at 1.
|
java.lang.String |
getGeneSequence(int geneID) |
int |
getInternalGeneLength()
For use in creating new MotifFinders; returns the internal geneLength.
|
int |
getPositionInGene(int seqIndex) |
java.lang.String |
getSequence()
Returns the biological sequence.
|
int |
getSequenceLength() |
private static MotifFinder.FindResultsSoftValueHashMap |
getSoftCache(MotifFinder motifFinder) |
private static FindResults |
getSoftCachedFindResults(MotifFinder motifFinder,
java.lang.String key,
boolean usePermanentCache) |
int |
getWildcardOptimizationCutoff()
Returns the
wildcardOptimizationCutoff . |
boolean |
hasUniformGeneLengths()
Returns true if the genes are all the same length.
|
protected void |
initialize(java.lang.String seqID,
java.lang.String sequence,
int geneLength)
Sets
seqID , sequence , geneLength , and geneStarts ; sets discardOverlaps to false and
wildcardOptimizationCutoff to NO_WILDCARD_OPTIMIZATION, and then calls initializeDataStructure() . |
private void |
initializeBipartiteCaches()
Turns on caching, then computes count and find to initialize the caching appropriately for bipartite searching.
|
protected abstract void |
initializeDataStructure()
Subclasses should implement to initialize any data structures used to perform searches; this method is called automatically by the
MotifFinder constructor.
|
boolean |
isCachingOn() |
void |
loadBipartiteCaches()
Turns on caching, fills the find cache with all of the 2-mers and 3-mers, and fills the count cache with bipartite versions, containing
from
BIPARTITE_INIT_N_COUNT_MIN to BIPARTITE_INIT_N_COUNT_MAX N's in the middle, of all of the
BIPARTITE_INIT_LEN -mers. |
private void |
readObject(java.io.ObjectInputStream in)
Calls in.defaultReadObject() and sets
discardOverlaps to false and wildcardOptimizationCutoff to
NO_WILDCARD_OPTIMIZATION. |
static void |
resetExtraGeneSoftCache() |
void |
setDiscardOverlaps(boolean discard)
Sets
discardOverlaps to discard. |
void |
setWildcardOptimization(int cutoff)
Sets
wildcardOptimizationCutoff to cutoff. |
protected int[] |
simpleFind(Motif m)
Returns a SORTED array of the indices of where m occurs in the sequence.
|
private static void |
softCacheHits(MotifFinder motifFinder,
java.lang.String key,
FindResults findResults,
boolean usePermanentCache) |
int |
strandCount(Motif m)
Counts the number of times m occurs on each strand, if and only if m.useRevComp() is true.
|
int |
strandCount(MotifList ml)
Returns the sum of the strand counts of all motif in ml; no checks are made of the presence of duplicate motifs.
|
java.lang.String |
toString() |
void |
turnOffCaching()
Turns off caching.
|
void |
turnOnCaching()
Starts caching, with an estimate of the number of elements that will be cached.
|
void |
turnOnCaching(int initSize) |
void |
turnOnCaching(int initSize,
int maxHashableLength)
Starts caching with an estimate of the number of elements that will be cached.
|
protected int[] |
wildcardFind(Motif m)
Returns a SORTED array of the indices of where m occurs in the sequence.
|
public static final int BIPARTITE_INIT_CACHE_SIZE
public static final int BIPARTITE_INIT_LEN
public static final int BIPARTITE_INIT_N_COUNT_MAX
public static final int BIPARTITE_INIT_N_COUNT_MIN
private static final java.util.regex.Pattern bipartiteMotifPattern
public static boolean checkMotifFinderResults
private static long cumulativeDeltaHybridSearchMinusSimpleFind
protected static final int DEFAULT_MAX_HASH_LENGTH
private static boolean doingFindBipartiteHitsComparisonTiming
public static boolean doPersisting
public static final java.lang.String ExtraGeneMotifFinderIdentifier
public static transient MotifFinder.FindResultsSoftValueHashMap extraGeneSoftCache
public static transient MotifFinder.FindResultsSoftValueHashMap genomeSoftCache
public static transient MotifFinder.FindResultsSoftValueHashMap groupSoftCache
public static final int NO_WILDCARD_OPTIMIZATION
private static final java.util.regex.Pattern partOfGene
static java.util.Random rand
static final long serialVersionUID
public static final java.lang.String TemporaryMotifFinderIdentifier
public static final int UNEQUAL_GENE_LENGTHS
private static final long regexSearchCounterReportInterval
private static long regexSearchCounter
private static final int regexSearchThreshold
private int baseCount
protected transient java.util.HashMap<java.lang.String,java.lang.Integer> countCache
count(Motif)
when caching is turned on.protected transient boolean discardOverlaps
protected int geneLength
UNEQUAL_GENE_LENGTHS
if the breaks are not evenly
spaced. Note that because of the addition of break characters between genes in the biological sequence, each of the biological genes
should be extended or clipped to be geneLength-1 bases long-- NOT geneLength bases long.protected int[] geneStarts
protected transient int maxHashLength
protected java.lang.String seqID
protected java.lang.String sequence
protected transient java.util.HashMap<java.lang.String,int[]> simpleFindCache
simpleFind(Motif)
when caching is turned on.protected transient int wildcardOptimizationCutoff
find(Motif)
; set to
NO_WILDCARD_OPTIMIZATION
to turn off wildcard optimization completely.protected MotifFinder()
protected MotifFinder(java.lang.String sequence, int geneLength)
MotifFinder(String, String, int)
, but sets seqID
to null.protected MotifFinder(java.lang.String seqID, java.lang.String sequence, int geneLength)
seqID
, sequence
, geneLength
, and geneStarts
; sets discardOverlaps
to false and
wildcardOptimizationCutoff
to NO_WILDCARD_OPTIMIZATION, and then calls initializeDataStructure()
.private static void addHitsUsingBestMethod(java.util.Collection<java.lang.Integer> hitsList, MotifFinder pMotifFinder, Motif motif, java.util.regex.Matcher bipartiteMatcher)
private static void addHitsUsingBothEnds(java.util.Collection<java.lang.Integer> hitsList, MotifFinder pMotifFinder, Motif motif, java.util.regex.Matcher bipartiteMatcher)
private static void addHitsUsingOneEndAndRegex(java.util.Collection<java.lang.Integer> hitsList, MotifFinder pMotifFinder, Motif motif, java.util.regex.Matcher bipartiteMatcher)
private static void addHitsUsingRegex(java.util.Collection<java.lang.Integer> hitsList, MotifFinder pMotifFinder, Motif motif)
public static FindResults createFindResults(MotifFinder pMotifFinder, Motif motif, java.util.regex.Matcher forwardBipartiteMatcher, boolean useRegexForNonBipartiteSearches)
private static FindResults createFindResultsUsingRegex(MotifFinder pMotifFinder, Motif motif)
public static FindResults getFindResults(MotifFinder mf, Motif motif)
private static MotifFinder.FindResultsSoftValueHashMap getSoftCache(MotifFinder motifFinder)
private static FindResults getSoftCachedFindResults(MotifFinder motifFinder, java.lang.String key, boolean usePermanentCache)
public static void resetExtraGeneSoftCache()
private static void softCacheHits(MotifFinder motifFinder, java.lang.String key, FindResults findResults, boolean usePermanentCache)
protected boolean areOnSameGene(int pos1, int pos2)
public java.lang.Object clone()
clone
in class java.lang.Object
private int[] combineHitsToBipartiteHits(int[] sm1, int[] sm2, int sm2Pos)
public int count(Motif m)
discardOverlaps
is set to true, overlapping motifs are not
counted. If hashing is on, searches the hash for a previous lookup.public int count(Motif m, int k)
java.lang.IllegalArgumentException
- If this method hasn't been implementedpublic int count(MotifList ml)
public int count(java.lang.String pattern)
public int countGenes(Motif m)
public final MotifFinder createNewInstance(java.lang.String sequence, int geneLength)
createNewInstance(String, String, int)
, but sets seqID
to null.public abstract MotifFinder createNewInstance(java.lang.String seqID, java.lang.String sequence, int geneLength)
discardOverlaps
status, and wildcardOptimizationCutoff
as the
calling instance.public int[] find(Motif m)
discardOverlaps
is set to true, overlapping motifs
are not included. If wildcard optimization is turned on, some magic is used to optimize lookups for sequences with substrings of at
least wildcardOptimizationCutoff
N's in the middle.public int[] find(Motif m, int k)
java.lang.IllegalArgumentException
- If this method hasn't been implementedpublic int[] find(MotifList ml)
find(Motif)
on all the motifs of the given MotifList; no checks are made for duplicate
motifs.public abstract int[] find(java.lang.String pattern)
public FindResults findResults(Motif m)
public FindResults findResults(MotifList motifList)
public abstract FindResults findResults(java.lang.String pattern)
public float getAveragePercentageLengthVariation()
public double getAvgGeneLength()
Alphabet.FILLER
characters or Alphabet.SEQUENCE_BREAK
characters.public int getBaseCount()
public int getCountCacheSize()
public boolean getDiscardOverlaps()
discardOverlaps
.public int getEndIndexOf(int geneID)
public int getFindCacheSize()
public int getGeneCount()
public int getGeneIDOf(int seqIndex)
public java.lang.String getGeneSequence(int geneID)
public int getInternalGeneLength()
public int getPositionInGene(int seqIndex)
public java.lang.String getSequence()
public int getSequenceLength()
public int getWildcardOptimizationCutoff()
wildcardOptimizationCutoff
.public boolean hasUniformGeneLengths()
protected final void initialize(java.lang.String seqID, java.lang.String sequence, int geneLength)
seqID
, sequence
, geneLength
, and geneStarts
; sets discardOverlaps
to false and
wildcardOptimizationCutoff
to NO_WILDCARD_OPTIMIZATION, and then calls initializeDataStructure()
.private void initializeBipartiteCaches()
protected abstract void initializeDataStructure()
public boolean isCachingOn()
public void loadBipartiteCaches()
BIPARTITE_INIT_N_COUNT_MIN
to BIPARTITE_INIT_N_COUNT_MAX
N's in the middle, of all of the
BIPARTITE_INIT_LEN
-mers. These caches are serialized and stored to file so that they only need to be created using count() and
find() the first time this method is called for the given sequence.private void readObject(java.io.ObjectInputStream in) throws java.io.IOException, java.lang.ClassNotFoundException
discardOverlaps
to false and wildcardOptimizationCutoff
to
NO_WILDCARD_OPTIMIZATION.java.io.IOException
java.lang.ClassNotFoundException
public void setDiscardOverlaps(boolean discard)
discardOverlaps
to discard.public void setWildcardOptimization(int cutoff)
wildcardOptimizationCutoff
to cutoff.protected int[] simpleFind(Motif m)
public int strandCount(Motif m)
count(Motif)
. In the case were m does use the reverse complement, searching on both strands is difference from simply calling
count(m) in that palindromes are double counted. This is justified by the fact that in a palindrome, the motif really does occur once
on each strand, which (theoretically) doubles the probability that a protein will come in contact with and bind to that motif.public int strandCount(MotifList ml)
public java.lang.String toString()
toString
in class java.lang.Object
public void turnOffCaching()
public void turnOnCaching()
public void turnOnCaching(int initSize)
public void turnOnCaching(int initSize, int maxHashableLength)
protected int[] wildcardFind(Motif m)
wildcardOptimizationCutoff
N's in the middle.