How
do I Use the Search and Display Engine for exploration and data-mining:
How
do I find all entries with the words glycoprotein AND kinase?
How
do I find all entries with the words glycoprotein OR kinase?
How
do I find all entries WITHOUT the words glycoprotein OR kinase?
The
search engine won't accept more than one word. How do I find "sus
scrofa"?
How
do I search all sequences for the motif "CAGTGGATAC"?
How
do I search for all sequences containing microsatellites (repeats of motifs
of 1-6 bp)?
How
do I get the right information to plot in Excel the G+C content versus
sequence length for all sequences in the database?
How
do I get a subset of sequences into fasta format into a new file to put
into PAUP?
How
do I find all entries with the words glycoprotein AND kinase?
Use the SEARCH ENGINE and multiple searches to act as AND statements:
1. wholefile contains glycoprotein
2. (click MORE button)
3. wholefile contains kinase
4. (click Search & Display)
How do I find all entries with the words glycoprotein OR kinase?
Use the SEARCH ENGINE and the "|" (OR) symbol in your pattern:
1. wholefile contains glycoprotein|kinase
2. (click Search & Display)
How
do I find all entries WITHOUT the words glycoprotein OR kinase?
Use the SEARCH ENGINE and the "does not contain" comparison SEARCH
option:
1. wholefile doesnotcontain glycoprotein|kinase
2. (click Search & Display)
The
search engine won't accept more than one word. How do I find "sus
scrofa"?
Use wildcards to match spaces. The search engine only takes in
the first "word" it sees and stops at any whitespace. You can get
around this by using the "." symbol which matches any letter, number, or
whitespace. The + symbol means "1 or more" and the * symbol means "from
0 to infinity". The following will match "sus scrofa".
1. wholefile contains sus(.)+scrofa (match a single space,
letter or number between sus and scrofa)
2. (click Search & Display)
1. wholefile contains sus(.)*scrofa (match zero or infinite
characters between sus and scrofa)
2. (click Search & Display)
How
do I search all sequences for the motif "CAGTGGATAC"?
Use the SEARCH ENGINE with the "contains" SEARCH option:
1. seq contains CAGTGGATAC
2. (click Search & Display)
(For more advanced pattern matching with more useful output use the
script Page_Regex.cgi.)
How
do I search for all sequences containing microsatellites (repeats of motifs
of 1-6 bp)?
Use the SEARCH ENGINE with the "contains" SEARCH option. You can use
this complex but correct regular expression as your pattern:
1. seq contains (.)\1{7,}|(.{2,3}?)\2{3,}|(.{4,100}?)\3{2,}
2. (click Search & Display)
For more advanced pattern matching using regular expression, or for
searches for patterns that occur frequently, you'll get much more useful
output with the PATTERN SEARCHING script, Page_Regex.cgi. See 'All
scripts' in the MINE menu for a link to this script. It contains
further instructions and a list of useful regular expressions.
How
do I get the right information to plot in Excel the G+C content versus
sequence length for all sequences in the database?
Use SEARCH ENGINE with report format DISPLAY option:
1. Select "Report Format" from the Display options.
2. Tick off gc and seq_len
3. File_name contains db (will capture all files since the all end
in db)
4. (click Search & Display)
5. Copy and paste the results in Excel, import using the ** symbols
as column separators
6. Use Excel to graph the results
How do I get a subset of sequences into fasta format into a new
file to put into PAUP?
Use the SEARCH ENGINE with fasta format DISPLAY option
1. Select "Fasta format" from the Display options.
2. File_name contains (your pattern)
4. (click Search & Display)
5. Copy and Paste your fasta formatted sequences into a new file, or
save the results of your search using a new name.
END