Sequence query tutorial
Contents
What Queries Are
To query any database is to ask it to deliver all data that
meet a desired logical condition. This condition is itself called
the query. When a user types and clicks on a web page interface
to a database, a formal database query is being created behind the
scenes. When the user clicks "Submit", that query is ultimately
provided to the database program, which responds by returning data
that meet the condition, or match the query.
Querying a database by clicking on a web form is convenient for simple
queries. Complex queries that require many different conditions, e.g.
"Give me all tat sequences in subtype C isolates sampled in
Australia between 2000 and 2004, from patients who were intravenous
drug users with CD4 counts of 200 or less,"
may take several screens
worth of clicking, if they can be made at all. The most convenient way
to make this sort of query is by writing it in the symbolic language
that the database understands. Most modern databases use a dialect of
SQL (sometimes called "structured query language") for this purpose.
Most public databases do not allow the user to use SQL directly, in
order to protect the database's integrity and to hide private
data. This is the case (currently) for
the
LANL HIV database. HIVQuery provides the user with a simple query
language that can make complex queries like the example above. The
program essentially converts the user's typed query into clicks and
entries to the LANL web forms, makes the user's request, and then
collects the sequences returned into a simple text format that is easy
to convert to other computer-readable formats.
Writing a query
If you have used
the Entrez query system
for GenBank, you are already familiar with HIVQuery's format and
syntax. (See the description of the Entrez query language.)
In a database, data is stored in records. Each record consists
of a group of fields; each field contains data
appropriate to that field. For example, a record containing a subtype
B sequence samples in South Africa in 1999 would have
fields subtype
, country
,
and isolation_year
, containing the
data B, ZA, and 1999, along with a
field sequence
containing the sequence data, and many
other fields with other data describing this particular isolate.
Writing a query is a process of stringing together elements
that look like this:
(matchdata)[field]
The field
indicates the category of data
(country
, subtype
,
etc.). The matchdata
is the value of the data in that
field that the user wants each record returned to match. For example, to get all
subtype F sequences in the database, the user would simply type
(F)[subtype]
in the Query box and click Run Query.
If you're using
the
script hivq.PL
,
you would perform
hivq> run (F)[subtype]
or
hivq> query (F)[subtype]
hivq> run
Complex queries
Getting records that meet increasingly restrictive conditions is
sometimes called refining a query. Most users will be familiar with
refining searches by using logical or Boolean operators
like AND
and OR
. In HIVQuery, to get all
subtype F sequences that were found in Brazil, one could type
(F)[subtype] AND (BR)[country]
In the HIVQuery syntax, the AND
is assumed to connect the
query elements, so the following is equivalent to the above:
(F)[subtype] (BR)[country]
In HIVQuery, logical OR
is also available. Note that
(F)[subtype] OR (BR)[country]
will return all sequences for which either the subtype is F, or
the country is Brazil. So this query returns all subtype F
sequences, plus all Brazilian sequencesnote.
OR
comes into its own when coupled
with AND
s. A more useful example might be
(F)[subtype] OR (C)[subtype] OR (D)[subtype]
which returns all sequences of subtypes F, C and D. This kind
of OR
is so frequent, that it has a shorthand:
(F C D)[subtype]
Then to obtain all Brazilian F, C, or D sequences, type
(F C D)[subtype] AND (BR)[country]
or the even shorter
(F C D)[subtype] (BR)[country]
To do the Australian query written in English above, type
(Tat)[gene] (C)[subtype] (AU)[country] (2000 2001 2002 2003 2004)[year] (<200)[cd4count] (PI)[risk_factor]
Annotation Fields
The user can also specify annotation fields. These do not
restrict the query, but arrange for the return of the associated field
data for each sequence returned. Specify annotation fields between
curly braces, as in:
(B C)[subtype] 2000[year] {country cd4count cd8count}
To extract annotations from the returned file, see the PERL
code here for one way to do it.
Click here for a table of valid query fields
and their aliases.
noteThis query requires two
separate visits to the LANL web page. HIVQuery realizes this,
makes the two visits, and aggregates the data for you. back
07 Feb 2009