Sequence query tutorial


Contents

What Queries Are

To query any database is to ask it to deliver all data that meet a desired logical condition. This condition is itself called the query. When a user types and clicks on a web page interface to a database, a formal database query is being created behind the scenes. When the user clicks "Submit", that query is ultimately provided to the database program, which responds by returning data that meet the condition, or match the query.

Querying a database by clicking on a web form is convenient for simple queries. Complex queries that require many different conditions, e.g.
"Give me all tat sequences in subtype C isolates sampled in Australia between 2000 and 2004, from patients who were intravenous drug users with CD4 counts of 200 or less,"
may take several screens worth of clicking, if they can be made at all. The most convenient way to make this sort of query is by writing it in the symbolic language that the database understands. Most modern databases use a dialect of SQL (sometimes called "structured query language") for this purpose.

Most public databases do not allow the user to use SQL directly, in order to protect the database's integrity and to hide private data. This is the case (currently) for the LANL HIV database. HIVQuery provides the user with a simple query language that can make complex queries like the example above. The program essentially converts the user's typed query into clicks and entries to the LANL web forms, makes the user's request, and then collects the sequences returned into a simple text format that is easy to convert to other computer-readable formats.

Writing a query

If you have used the Entrez query system for GenBank, you are already familiar with HIVQuery's format and syntax. (See the description of the Entrez query language.)

In a database, data is stored in records. Each record consists of a group of fields; each field contains data appropriate to that field. For example, a record containing a subtype B sequence samples in South Africa in 1999 would have fields subtype, country, and isolation_year, containing the data B, ZA, and 1999, along with a field sequence containing the sequence data, and many other fields with other data describing this particular isolate.

Writing a query is a process of stringing together elements that look like this:
(matchdata)[field]
The field indicates the category of data (country, subtype, etc.). The matchdata is the value of the data in that field that the user wants each record returned to match. For example, to get all subtype F sequences in the database, the user would simply type
(F)[subtype]
in the Query box and click Run Query.
If you're using the script hivq.PL, you would perform
hivq> run (F)[subtype]
or
hivq> query (F)[subtype]
hivq> run

Complex queries

Getting records that meet increasingly restrictive conditions is sometimes called refining a query. Most users will be familiar with refining searches by using logical or Boolean operators like AND and OR. In HIVQuery, to get all subtype F sequences that were found in Brazil, one could type
(F)[subtype] AND (BR)[country]
In the HIVQuery syntax, the AND is assumed to connect the query elements, so the following is equivalent to the above:
(F)[subtype] (BR)[country]
In HIVQuery, logical OR is also available. Note that
(F)[subtype] OR (BR)[country]
will return all sequences for which either the subtype is F, or the country is Brazil. So this query returns all subtype F sequences, plus all Brazilian sequencesnote.

OR comes into its own when coupled with ANDs. A more useful example might be
(F)[subtype] OR (C)[subtype] OR (D)[subtype]
which returns all sequences of subtypes F, C and D. This kind of OR is so frequent, that it has a shorthand:
(F C D)[subtype]
Then to obtain all Brazilian F, C, or D sequences, type
(F C D)[subtype] AND (BR)[country]
or the even shorter
(F C D)[subtype] (BR)[country]
To do the Australian query written in English above, type
(Tat)[gene] (C)[subtype] (AU)[country] (2000 2001 2002 2003 2004)[year] (<200)[cd4count] (PI)[risk_factor]

Annotation Fields

The user can also specify annotation fields. These do not restrict the query, but arrange for the return of the associated field data for each sequence returned. Specify annotation fields between curly braces, as in:
(B C)[subtype] 2000[year] {country cd4count cd8count}
To extract annotations from the returned file, see the PERL code here for one way to do it.

Click here for a table of valid query fields and their aliases.


noteThis query requires two separate visits to the LANL web page. HIVQuery realizes this, makes the two visits, and aggregates the data for you. back
FR icon 07 Feb 2009