Subqueries
- reference
In SQL++, a subquery is a SELECT query that is a constituent part of another SQL++ query or subquery.
Using subqueries, you can create multiple levels of nesting of queries. The outer levels of queries are called outer or parent queries, and inner level subqueries are called inner or child queries, or simply subqueries.
Syntax
subquery-expr ::= '(' select ')'
-
A subquery in SQL++ is limited to a SELECT query, whereas the outermost parent query can be any DML statement such as SELECT, INSERT, UPDATE, DELETE or MERGE.
-
Subquery must be enclosed in parenthesis.
-
When subquery is used in
FROM
clause, it must be aliased. -
Subqueries are SQL++ expressions and SQL++ evaluates them like any other SQL++ language expressions. When an expression is contained in a SQL++ statement, the expression is evaluated once for every input document to the statement, and same applies to a subquery-expression — the inner SELECT or subquery is executed once for every input document considered in the outer query.
-
A subquery expression returns an array every time it is evaluated, and the array contains the results of the inner SELECT subquery.
Subqueries in DML Statement Clauses
A subquery can occur as part of various clauses of the parent query where generic SQL++ expressions can be used. For instance, a SELECT statement can have a subquery in the projection list, WHERE clause, FROM clause, LET clause, and GROUP BY clause. Similarly, an UPDATE, INSERT, DELETE can have a subquery in the WHERE clause. For example, consider the following query which finds all the cities with landmarks that have airports.
SELECT t1.city FROM landmark t1
WHERE t1.city IN (SELECT RAW city FROM airport);
[
{
"city": "Abbeville"
},
{
"city": "Avignon"
}..
]
In the above example, the subquery is used as a part of WHERE condition of the parent query and it returns an array of cities that have airports.
Note that the IN or WITHIN predicate needs an array on right-side expression as the subquery evaluates to.
Further, the usage of keyword RAW
in the subquery projection ensures just the city
attribute values in the result without JSON wrapping.
The outer query browses through each of the landmark cities and returns the city name if it is in the array of cities returned by the inner subquery.
When subquery is used in the FROM clause, that must be aliased, to be able to access and refer to the results documents of the subquery.
In the following example Q2, the subquery is named as t1
, and it returns the aggregated value of airport names by city and country.
The outer query further processes the result of the subquery to find total number of airports by country where each city has more than 5 airports.
SELECT t1.country, array_agg(t1.city), sum(t1.city_cnt) as apnum
FROM (SELECT city, city_cnt, array_agg(airportname) as apnames, country
FROM airport
GROUP BY city, country LETTING city_cnt = count(city) ) AS t1
WHERE t1.city_cnt > 5
GROUP BY t1.country;
[
{
"$1": [
"Paris"
],
"apnum": 9,
"country": "France"
},
{
"$1": [
"London"
],
"apnum": 13,
"country": "United Kingdom"
},
{
"$1": [
"Houston",
"New York",
"San Diego"
],
"apnum": 22,
"country": "United States"
}
]
For more examples, see the Examples section.
Subqueries in SQL++ Expressions
Subqueries can appear as independent SELECT queries or can be combined as a constituent part of other SQL++ language expressions.
For example, in the above queries Q1 and Q2, the subquery is a full independent SELECT query. In the following example Q3, the subquery is part of a SQL++ expression.
SELECT array_max((SELECT count(city) as cnt
FROM airport
GROUP BY city)[*].cnt);
[
{
"$1": 14
}
]
In this example, the subquery is wrapped in a parenthesis ( ) and it occurs as part of the expression [*].cnt
which treats the result of the subquery as an array of JSON documents (with one attribute cnt
) and extracts the cnt
values from all the subquery result array elements.
Finally, the array_max()
function is applied to this to get the maximum number of airports in any of the cities.
Variables in Scope of a Subquery
The scope of subquery is defined as the set of names, identifiers, and variables that are accessible to the subquery. Variables include all keyspace names, aliases, LET or LETTING variables that can be referenced by the subquery. For a multilevel nested subquery, the scope includes scope of the subquery and all the scopes of its parent queries.
A subquery cannot access variables from its sibling subqueries or its parents’ sibling subqueries. For example, in the following nested query structure, the subquery SQ13 can only access the variables from SQ13, SQ1, and Q0, but not from SQ12, or SQ2.
Q0: SELECT ... (SELECT .. (SELECT ...) AS SQ12 (SELECT ...) AS SQ13 ) AS SQ1 (SELECT ... (SELECT ...) AS SQ21 (SELECT ...) AS SQ22 ) AS SQ2
Note that the keyspace name identifiers themselves are NOT variables, whereas any aliases of the keyspace name identifiers are variables in the scope.
For example, in the above query Q1, the keyspace identifier landmark
in the outer query is not considered a variable for the subquery.
In fact, that is the reason the subquery can also use the same keyspace name independently and still be a non-correlated subquery.
However, the alias t1
defined in the outer query is considered a variable in scope for the subquery.
The following example Q4 is a correlated subquery that uses the variable t1
from the outer query:
SELECT t1.city, t1.name
FROM landmark t1
WHERE t1.city IN SPLIT((SELECT RAW t2.name
FROM t1 AS t2)[0]);
[
{
"city": "Aberdour",
"name": "Aberdour Castle"
},
{
"city": "Aberdour",
"name": "Aberdour Railway Station"
},
{
"city": "Aberdulais",
"name": "Aberdulais Falls and Tin Works"
}
// ...
]
This above example Q4 uses the alias t1
from outer query in the FROM clause of subquery.
The subquery aliases t1
to t2
to avoid conflict in same variable name t1
being present in both outer and inner queries scope.
The reference to t2
in the subquery is actually referring to the same document from landmark
that is being processed as t1
in the outer query.
In other words, the same query can be simply rewritten (without using subqueries) as follows:
SELECT t1.city, t1.name
FROM landmark t1
WHERE t1.city IN SPLIT(t1.name);
[
{
"city": "Aberdour",
"name": "Aberdour Castle"
},
{
"city": "Aberdour",
"name": "Aberdour Railway Station"
},
{
"city": "Aberdulais",
"name": "Aberdulais Falls and Tin Works"
}
// ...
]
Typically, subqueries may refer to any variables and aliases available in the scope to build correlated subqueries and to perform subqueries specific to some context of outer query. See Correlated Subqueries for more details and Examples.
FROM clause in Subqueries
Keyspace Identifier versus Expression
As described in the FROM clause, the from-term can be a keyspace name or identifier or a SQL++ expression:
-
Keyspace identifiers are independent sources of data for a query.
-
SQL++ expressions can be constructed using various SQL++ language constructs including subqueries.
-
Constant expressions are independent sources of data for a query.
-
Variable expressions depend on variables in scope and are evaluated to resolve as input data for the query. These are applicable to subqueries.
-
This applies irrespective of whether it is a simple identifier such as
alias
,var
or a nested path identifier such askeyspace.subdoc1.subdoc2.field
,alias.subdoc.field
, orvar.subdoc.field
.
-
An expression can be a simple identifier or variable such as alias
, var
or more complex with various SQL++ language constructs.
Either way, SQL++ evaluates from-term
to resolve to keyspace identifiers or expressions as follows:
-
A
from-term
is considered as an expression if it is not a keyspace name identifier. -
If simple identifier can be considered as identifier or expression depending on various factors. An identifier
var
is considered as an expression if it is variable in scope defined through LET or LETTING, or explicit keyspace alias in parent queries.
In the following equivalent queries, explicit alias t
of airport
or LET variable x
is treated as an expression and hence a nested path like t.geo.lat
is allowed in the subquery FROM clause.
SELECT count(*) FROM airport t
WHERE (SELECT RAW t.geo.alt FROM t t1)[0] > 6000 ;
SELECT count(*) FROM airport t
WHERE (SELECT RAW alt FROM t.geo.alt)[0] > 6000;
SELECT count(*) FROM `travel-sample` .inventory.airport t
LET x = t.geo
WHERE (SELECT RAW y.alt FROM x y)[0] > 6000;
SELECT count(*) FROM airport t
WHERE (SELECT RAW geo.alt FROM t.geo)[0] > 6000;
Implicit Alias
When explicit alias is not defined, every identifier will have an implicit alias predefined with the same name as the identifier.
Implicit alias of a nested path is defined as the last component in the path.
In above example Q5D, the implicit alias of the nested path t.geo
in subquery is geo
, and in example Q5B the implicit alias of t.geo.alt
is alt
.
For example, the following example Q6 has no explicit alias for travel-sample
.
So the airport
used in the subquery FROM clause is considered a keyspace identifier, but not an expression.
That makes the subquery non-correlated by the FROM clause, and the subquery returns all documents from keyspace airport
.
SELECT array_length((SELECT RAW t1.geo.alt
FROM airport t1))
FROM airport LIMIT 4;
[
{
"$1": 1968
},
// ...
]
Contrast Q6 with Q6A below, the subquery is correlated by using the keyspace alias in the FROM clause.
The result is only 1
because the subquery is applicable to only the current document t
being processed in the parent query.
SELECT array_length((SELECT RAW t1.geo.alt FROM t t1))
FROM airport t;
[
{
"$1": 1
},
// ...
]
Further, the subquery is required to alias its from-term
to avoid conflict with the same identifier in both outer and inner queries.
For example, the following example Q6B shows an error:
SELECT array_length((SELECT RAW t1.geo.alt
FROM airport))
FROM airport;
[
{
"code": 4020,
"msg": "Duplicate subquery alias airport",
"query": "SELECT array_length((SELECT RAW t1.geo.alt\nFROM airport))\nFROM airport;"
}
]
An implicit keyspace alias is not considered as an expression, as in the above example Q6.
However, the nested paths are expressions.
In the following example Q6C, the FROM term has the nested path airport.geo
as an expression, where airport
refers to the implicit alias of the keyspace in the parent query.
Hence this is a correlated subquery, and the result is 1
corresponds to the current document in the parent query.
SELECT array_length((SELECT RAW t1.alt
FROM airport.geo t1))
FROM airport
LIMIT 1;
[
{
"$1": 1
}
]
If the expression does not resolve to any of the variables in scope for the subquery, then that is treated as keyspace identifier and subsequently if the keyspace is not found, an error is raised.
For example, in the following query hotel
is not defined in the scope and is treated as a new keyspace identifier.
SELECT * FROM landmark t1
WHERE t1.city IN (SELECT RAW city
FROM hotel
LIMIT 3
);
[
{
"t1": {
"activity": "see",
"address": "Prince Arthur Road, ME4 4UG",
"alt": null,
"city": "Gillingham",
"content": "Adult - £6.99 for an Adult ticket that allows you to come back for further visits within a year (children's and concessionary tickets also available). Museum on military engineering and the history of the British Empire. A quite extensive collection that takes about half a day to see. Of most interest to fans of British and military history or civil engineering. The outside collection of tank mounted bridges etc can be seen for free. There is also an extensive series of themed special event weekends, admission to which is included in the cost of the annual ticket.",
"country": "United Kingdom",
"directions": null,
"email": null,
"geo": {
"accuracy": "RANGE_INTERPOLATED",
"lat": 51.39184,
"lon": 0.53616
},
"hours": "Tues - Fri 9.00am to 5.00pm, Sat - Sun 11.30am - 5.00pm",
"id": 10019,
"image": null,
"name": "Royal Engineers Museum",
"phone": "+44 1634 822839",
"price": null,
"state": null,
"title": "Gillingham (Kent)",
"tollfree": null,
"type": "landmark",
"url": "http://www.remuseum.org.uk"
}
}
// ...
]
Nested Path Expressions in Subqueries
As mentioned in the FROM clause, the from-term
of both parent and subqueries allow nested path expressions over constants and subqueries.
However, only subqueries allow variable expressions (including paths), that are referenced through any variables defined in scope of the subquery.
This is very powerful for language expressibility, simplicity and flexibility to SQL++ queries.
Especially, when combined with subqueries, nested path expressions over variables extend full power of SQL++ syntax to array attributes/sub-documents without losing the structure of the array elements in results, or requiring tricky processing (with UNNEST, GROUP BY, ORDER BY and so).
See the examples below.
The usage of nested paths over keyspace identifiers is NOT allowed in the from-terms, and it results in a syntax error. Nested paths are always considered expressions in SQL++. |
Examples
The following query is valid because the nested path in the subquery is based on the explicit alias variable k1
in scope.
SELECT * FROM keyspace1 AS k1
WHERE (SELECT … FROM k1.subdoc1.subdoc2.field3 …);
The following query is invalid and raises an error because the nested path is in the outermost query:
SELECT * FROM keyspace1.subdoc1.subdoc2.field3 … ;
The subquery is based on the keyspace identifiers.
Note that, the outer query has explicit alias defined, and hence the keyspace1/keyspace2 in subquery from-term
are treated as identifiers.
SELECT * FROM keyspace1 AS k1
WHERE (SELECT … FROM keyspace1.subdoc1.field3 …);
SELECT * FROM keyspace1 AS k1
WHERE (SELECT … FROM keyspace2.subdoc2 …);
The following example shows usage of nested path over subquery expression.
SELECT x.alt
FROM (SELECT geo from airport
)[*].geo AS x
LIMIT 2;
[
{
"alt": 12
},
{
"alt": 295
}
]
The following example shows usage of nested path over constant expression.
SELECT x
FROM [{"a" : 1, "b" : {"c" : 2}},
{"a" : 3, "b" : {"d" : 4}}][*].b AS x
LIMIT 2;
[
{
"x": {
"c": 2
}
},
{
"x": {
"d": 4
}
}
]
The following two queries show valid and invalid examples with the airport
collection.
Note the nested paths used in the FROM clause of the subquery.
SELECT t1.city, t1.geo.alt
FROM airport t1
WHERE (SELECT RAW t2.alt
FROM airport.geo t2)[0] > 4000;
[
{
"code": 3000,
"msg": "Ambiguous reference to field travel-sample.",
"query_from_user": "SELECT t1.city, t1.geo.alt\nFROM `travel-sample` t1\nWHERE t1.type = \"airport\" AND \n(SELECT RAW t2.alt \n FROM `travel-sample`.geo t2)[0] > 4000;"
}
]
SELECT t1.city, t1.geo.alt
FROM airport t1
WHERE (SELECT RAW t2.alt FROM t1.geo t2)[0] > 4000;
[
{
"alt": 6537,
"city": "Grants"
},
{
"alt": 5045,
"city": "Prescott"
},
// ...
]
The following query demonstrates the power of using nested path expressions in correlated subqueries over the array subdocument hotel.reviews
.
The query Q9 finds the top 10 hotels and number of reviewers which have Overall rating at least 4 and rated by minimum 6 people.
SELECT name, cnt_reviewers
FROM hotel AS t
LET cnt_reviewers = (SELECT raw count(*)
FROM t.reviews AS s
WHERE s.ratings.Overall >= 4)[0]
WHERE cnt_reviewers >= 6
ORDER BY cnt_reviewers DESC
LIMIT 10;
[
{
"cnt_reviewers": 9,
"name": "Holiday Inn London Kensington Forum"
},
{
"cnt_reviewers": 9,
"name": "Campanile"
},
{
"cnt_reviewers": 9,
"name": "Drop in Chalets"
},
{
"cnt_reviewers": 9,
"name": "Cadogan Hotel"
},
{
"cnt_reviewers": 9,
"name": "Negresco"
},
{
"cnt_reviewers": 9,
"name": "Suites at Fisherman's Wharf"
},
{
"cnt_reviewers": 9,
"name": "Wyndham Parc 55 Hotel"
},
{
"cnt_reviewers": 8,
"name": "Lochmaddy Hotel"
},
{
"cnt_reviewers": 8,
"name": "Kensington West"
},
{
"cnt_reviewers": 8,
"name": "Ibis Hotel Stratford"
}
]
SELECT name, cnt_reviewers
FROM hotel AS t
LET cnt_reviewers = (SELECT raw count(*)
FROM hotel tmp
USE KEYS meta(t).id
UNNEST tmp.reviews s
WHERE s.ratings.Overall >= 4)[0]
WHERE cnt_reviewers >= 6
ORDER BY cnt_reviewers DESC
LIMIT 10;
The above query Q9A is an equivalent of query Q9 that does not use nested paths in the subquery FROM clause.
Therefore, it requires a USE KEYS clause on the same document as in the outer query, that is meta(t).id
where t
refers to the outer query document.
Nested paths cannot be used in the FROM clause and hence UNNEST is used on nested structures.