DORM is a light-weight webtool to browse recurrent mutations in human cancers identified from genome-wide screens (data sourced from COSMIC releases). Here is how to harness the power of regular expressions to craft advanced search queries for the DORM database.
Understanding how the search works behind-the-scenes might help you to formulate better and fail-proof queries.
A query is processed by searching for a matching expression(s) by going through the data row-wise or line-by-line. i.e. If you search for BRAF lung
, this will show BRAF mutations which are reported atleast once in a sample from lung tissue.
A blank space in the query triggers “nesting” of the query parameters from left to right, and is useful when you want to narrow your search space.
Consider the expression:
KRAS G12 lung|pancreas
This search expression is processed by extracting all the rows containing the word
KRAS
, then the phrase G12
is searched in those
results, followed by the regular expression lung|pancreas
. In
practice, this query shows all the KRAS G12 mutations that are reported in samples derived
from either the lung or the pancreas.
Blank space is NOT translated to the OR-operator
i.e. |
, as it is reserved to “nest” search queries (read point #2
above).
Comma ,
and Semicolon ;
are provided as
delimiters to search sets of proteins. These operators are translated to an OR-operator
i.e. |
. The number of spaces after these symbols do not affect the
search i.e. KRAS,EGFR
and KRAS, EGFR
provide the same results.
The either-or clause:
You can use the pipe operator (i.e., |
) for either-or queries, e.g.
KRAS|BRAF
- mutations in either KRAS or BRAF proteins.
EGFR|ERBB
- mutations in the proteins EGFR, ERBB2, ERBB3, ERBB4, etc.N.B.
ERBB
matches the ERBB2, 3, 4 and possibly other proteins that start withERBB
.
Specify inclusions and exclusions:
You can restrict the exact matches with []
operator. You can
specify specific matches, ranges to include or exclude during searches e.g.
Searching
NRG
gives you several proteins, but if you only want the Neuregulin ligands, you can simply searchNRG[1-4]
to specifically search for the four Neuregulins. The expression[1-4]
is evaluated as the range of numbers[1234]
.
Additionally, you can exclude results using the
[^]
operator e.g.
ERBB[^4]
- matches all the ERBB2 and ERBB4 among the ERBBs and leaves out ERBB4.
Set word boundaries:
The word boundary operator \<
or
\>
allows you to fix the boundary of search term. e.g.
\<NRG
- excludes proteins which do not begin with NRG
RAS\>
- lists all the proteins ending in RAS
RAS\> [0-9]+C\>
- lists the mutations in RAS oncogenes that create a change to Cysteine. Remember, blank space triggers nesting of search operations (point #2 above).
Wildcards:
Substituting words, numbers etc. is very easy with these:
a. [0-9]
or [[:digit:]]
- for digits
0-9
b. [a-Z]
or [[:alpha:]]
- for small
alphabets [a-z]
and capital [A-Z]
.
c. The dot .
- matches any character
d. [[:space:]]
- matches blank space
Match length mofidiers:
The benefit of using regular expressions is the flexibility these wildcard characters offer when you do not know the exact length of matching text. These are the five main modifiers:
a . *
- matches at least 0 times [matching character can be
either present or absent]
b. ?
- matches exactly 1 time
c. +
- matches 1 or more times
d. {n}
- matches exactly n
times
e. {x,y}
- matches at least x, and at most y times. Either
x
or y
can be omitted to create hard
upper/lower bounds.
When you get the hang of regular expressions you can create very specific expressions like this one:
ERBB[234]|[HB]{0,1}EGF[R]{0,1}\>|NRG[1-4]|\<EP[GR]\>|AREG|BTC|TGFA
This expression searches, all the 4 ERBB-family receptors and the 11 ligands namely: EGFR, ERBB2, ERBB3, ERBB4, AREG, BTC, TGFA, NRG1, NRG2, NRG3, NRG4, EGF, HBEGF, EPG and EGR; and nothing more. Precision is what you want in Science, and regular expressions will get you there :)
|
, as it is reserved to “nest” search queries (read point #2 above). Please use a comma or semicolon
to delimit your list of protein names.$
and
^
don’t work at the moment; the reasons remain elusive. Please use the word
boundary operator \>
to demarcate these bounds instead.