Performing advanced searches on DORM

 

DORM is a light-weight webtool to browse recurrent mutations in human cancers identified from genome-wide screens (data sourced from COSMIC releases). Here is how to harness the power of regular expressions to craft advanced search queries for the DORM database.

Understanding how the search works behind-the-scenes might help you to formulate better and fail-proof queries.

  1. A query is processed by searching for a matching expression(s) by going through the data row-wise or line-by-line. i.e. If you search for BRAF lung, this will show BRAF mutations which are reported atleast once in a sample from lung tissue.

  2. A blank space in the query triggers “nesting” of the query parameters from left to right, and is useful when you want to narrow your search space.

    Consider the expression:

    KRAS G12 lung|pancreas

    This search expression is processed by extracting all the rows containing the word KRAS, then the phrase G12 is searched in those results, followed by the regular expression lung|pancreas. In practice, this query shows all the KRAS G12 mutations that are reported in samples derived from either the lung or the pancreas.

  3. Blank space is NOT translated to the OR-operator i.e. |, as it is reserved to “nest” search queries (read point #2 above).

  4. Comma , and Semicolon ; are provided as delimiters to search sets of proteins. These operators are translated to an OR-operator i.e. |. The number of spaces after these symbols do not affect the search i.e. KRAS,EGFR and KRAS, EGFR provide the same results.

Using regular expressions

  1. The either-or clause:

    You can use the pipe operator (i.e., | ) for either-or queries, e.g.

    KRAS|BRAF - mutations in either KRAS or BRAF proteins.

    EGFR|ERBB - mutations in the proteins EGFR, ERBB2, ERBB3, ERBB4, etc.

    N.B. ERBB matches the ERBB2, 3, 4 and possibly other proteins that start with ERBB.

  2. Specify inclusions and exclusions:

    You can restrict the exact matches with [] operator. You can specify specific matches, ranges to include or exclude during searches e.g.

    Searching NRG gives you several proteins, but if you only want the Neuregulin ligands, you can simply search NRG[1-4] to specifically search for the four Neuregulins. The expression [1-4] is evaluated as the range of numbers [1234].

    Additionally, you can exclude results using the [^] operator e.g.

    ERBB[^4] - matches all the ERBB2 and ERBB4 among the ERBBs and leaves out ERBB4.

  3. Set word boundaries:

    The word boundary operator \< or \> allows you to fix the boundary of search term. e.g.

    \<NRG - excludes proteins which do not begin with NRG

    RAS\> - lists all the proteins ending in RAS

    RAS\> [0-9]+C\> - lists the mutations in RAS oncogenes that create a change to Cysteine. Remember, blank space triggers nesting of search operations (point #2 above).

  4. Wildcards:

    Substituting words, numbers etc. is very easy with these:

    a. [0-9] or [[:digit:]] - for digits 0-9

    b. [a-Z] or [[:alpha:]] - for small alphabets [a-z] and capital [A-Z].

    c. The dot . - matches any character

    d. [[:space:]] - matches blank space

  5. Match length mofidiers:

    The benefit of using regular expressions is the flexibility these wildcard characters offer when you do not know the exact length of matching text. These are the five main modifiers:

    a . * - matches at least 0 times [matching character can be either present or absent]

    b. ? - matches exactly 1 time

    c. + - matches 1 or more times

    d. {n} - matches exactly n times

    e. {x,y} - matches at least x, and at most y times. Either x or y can be omitted to create hard upper/lower bounds.

     

Getting crafty

When you get the hang of regular expressions you can create very specific expressions like this one:

ERBB[234]|[HB]{0,1}EGF[R]{0,1}\>|NRG[1-4]|\<EP[GR]\>|AREG|BTC|TGFA

This expression searches, all the 4 ERBB-family receptors and the 11 ligands namely: EGFR, ERBB2, ERBB3, ERBB4, AREG, BTC, TGFA, NRG1, NRG2, NRG3, NRG4, EGF, HBEGF, EPG and EGR; and nothing more. Precision is what you want in Science, and regular expressions will get you there :)

Caveats

  1. I repeat, Blank space is NOT translated to the OR-operator i.e. |, as it is reserved to “nest” search queries (read point #2 above). Please use a comma or semicolon to delimit your list of protein names.
  2. The traditional string boundary operators $ and ^ don’t work at the moment; the reasons remain elusive. Please use the word boundary operator \> to demarcate these bounds instead.