Building Queries Systematically
- by Jeremy Smyth
The SQL language is a bit like a toolkit for data. It consists of lots of little fiddly bits of syntax that, taken together, allow you to build complex edifices and return powerful results. For the uninitiated, the many tools can be quite confusing, and it's sometimes difficult to decide how to go about the process of building non-trivial queries, that is, queries that are more than a simple SELECT a, b FROM c;
A System for Building Queries
When you're building queries, you could use a system like the following:
Decide which fields contain the values you want to use in our output, and how you wish to alias those fields
Values you want to see in your output
Values you want to use in calculations . For example, to calculate margin on a product, you could calculate price - cost and give it the alias margin.
Values you want to filter with. For example, you might only want to see products that weigh more than 2Kg or that are blue. The weight or colour columns could contain that information.
Values you want to order by. For example you might want the most expensive products first, and the least last. You could use the price column in descending order to achieve that.
Assuming the fields you've picked in point 1 are in multiple tables, find the connections between those tables
Look for relationships between tables and identify the columns that implement those relationships. For example, The Orders table could have a CustomerID field referencing the same column in the Customers table.
Sometimes the problem doesn't use relationships but rests on a different field; sometimes the query is looking for a coincidence of fact rather than a foreign key constraint. For example you might have sales representatives who live in the same state as a customer; this information is normally not used in relationships, but if your query is for organizing events where sales representatives meet customers, it's useful in that query. In such a case you would record the names of columns at either end of such a connection.
Sometimes relationships require a bridge, a junction table that wasn't identified in point 1 above but is needed to connect tables you need; these are used in "many-to-many relationships". In these cases you need to record the columns in each table that connect to similar columns in other tables.
Construct a join or series of joins using the fields and tables identified in point 2 above. This becomes your FROM clause.
Filter using some of the fields in point 1 above. This becomes your WHERE clause.
Construct an ORDER BY clause using values from point 1 above that are relevant to the desired order of the output rows.
Project the result using the remainder of the fields in point 1 above. This becomes your SELECT clause.
A Worked Example
Let's say you want to query the world database to find a list of countries (with their capitals) and the change in GNP, using the difference between the GNP and GNPOld columns, and that you only want to see results for countries with a population greater than 100,000,000.
Using the system described above, we could do the following:
The Country.Name and City.Name columns contain the name of the country and city respectively.
The change in GNP comes from the calculation GNP - GNPOld. Both those columns are in the Country table. This calculation is also used to order the output, in descending order
To see only countries with a population greater than 100,000,000, you need the Population field of the Country table. There is also a Population field in the City table, so you'll need to specify the table name to disambiguate. You can also represent a number like 100 million as 100e6 instead of 100000000 to make it easier to read.
Because the fields come from the Country and City tables, you'll need to join them. There are two relationships between these tables: Each city is hosted within a country, and the city's CountryCode column identifies that country. Also, each country has a capital city, whose ID is contained within the country's Capital column. This latter relationship is the one to use, so the relevant columns and the condition that uses them is represented by the following FROM clause:
FROM Country JOIN City ON Country.Capital = City.ID
The statement should only return countries with a population greater than 100,000,000. Country.Population is the relevant column, so the WHERE clause becomes:
WHERE Country.Population > 100e6
To sort the result set in reverse order of difference in GNP, you could use either the calculation, or the position in the output (it's the third column):
ORDER BY GNP - GNPOld or ORDER BY 3
Finally, project the columns you wish to see by constructing the SELECT clause:
SELECT Country.Name AS Country, City.Name AS Capital,
GNP - GNPOld AS `Difference in GNP`
The whole statement ends up looking like this:
mysql> SELECT Country.Name AS Country, City.Name AS Capital,
-> GNP - GNPOld AS `Difference in GNP`
-> FROM Country JOIN City ON Country.Capital = City.ID
-> WHERE Country.Population > 100e6
-> ORDER BY 3 DESC;
+--------------------+------------+-------------------+
| Country | Capital | Difference in GNP |
+--------------------+------------+-------------------+
| United States | Washington | 399800.00 |
| China | Peking | 64549.00 |
| India | New Delhi | 16542.00 |
| Nigeria | Abuja | 7084.00 |
| Pakistan | Islamabad | 2740.00 |
| Bangladesh | Dhaka | 886.00 |
| Brazil | Brasília | -27369.00 |
| Indonesia | Jakarta | -130020.00 |
| Russian Federation | Moscow | -166381.00 |
| Japan | Tokyo | -405596.00 |
+--------------------+------------+-------------------+
10 rows in set (0.00 sec)
Queries with Aggregates and GROUP BY
While this system might work well for many queries, it doesn't cater for situations where you have complex summaries and aggregation. For aggregation, you'd start with choosing which columns to view in the output, but this time you'd construct them as aggregate expressions.
For example, you could look at the average population, or the count of distinct regions.You could also perform more complex aggregations, such as the average of GNP per head of population calculated as AVG(GNP/Population).
Having chosen the values to appear in the output, you must choose how to aggregate those values. A useful way to think about this is that every aggregate query is of the form X, Y per Z. The SELECT clause contains the expressions for X and Y, as already described, and Z becomes your GROUP BY clause. Ordinarily you would also include Z in the query so you see how you are grouping, so the output becomes Z, X, Y per Z.
As an example, consider the following, which shows a count of countries and the average population per continent:
mysql> SELECT Continent, COUNT(Name), AVG(Population)
-> FROM Country
-> GROUP BY Continent;
+---------------+-------------+-----------------+
| Continent | COUNT(Name) | AVG(Population) |
+---------------+-------------+-----------------+
| Asia | 51 | 72647562.7451 |
| Europe | 46 | 15871186.9565 |
| North America | 37 | 13053864.8649 |
| Africa | 58 | 13525431.0345 |
| Oceania | 28 | 1085755.3571 |
| Antarctica | 5 | 0.0000 |
| South America | 14 | 24698571.4286 |
+---------------+-------------+-----------------+
7 rows in set (0.00 sec)
In this case, X is the number of countries, Y is the average population, and Z is the continent. Of course, you could have more fields in the SELECT clause, and more fields in the GROUP BY clause as you require. You would also normally alias columns to make the output more suited to your requirements.
More Complex Queries
Queries can get considerably more interesting than this. You could also add joins and other expressions to your aggregate query, as in the earlier part of this post. You could have more complex conditions in the WHERE clause. Similarly, you could use queries such as these in subqueries of yet more complex super-queries. Each technique becomes another tool in your toolbox, until before you know it you're writing queries across 15 tables that take two pages to write out. But that's for another day...