Correlated Subqueries

  • Capella Operational
  • concept
    +

    When a subquery is independent of its parent query, it is called a non-correlated subquery. This means the subquery does not depend on any of the variables or aliases defined in the outer query, nor the specific document being processed by the outer query. For instance, in examples Q1, Q2, and Q3 on the Subqueries page, the subquery has no correlation with its outer query.

    When a subquery refers to variables, aliases, attributes, or keyspace in the outer statement, it is called a correlated subquery. These subqueries are special because their performance characteristics can be very different from non-correlated queries. Typically, correlated queries are expensive, and query engines take special care in planning and executing such queries efficiently.

    Examples on this Page

    To use the examples on this page, you must set the query context to the inventory scope in the travel sample dataset. For more information, see Query Context.

    Performance

    For non-correlated queries, a subquery is evaluated once and resulting values are substituted in the parent query for all of the documents at the outer level queries. However, when a query includes a correlated subquery, the evaluation of the subquery depends on the outer query document being processed, because the subquery refers to some values or attributes in the outer query. It is similar to having variables in the subquery whose exact values are provided by the outer query, which might change for each of the documents considered at outer levels. Therefore, the subquery is executed repeatedly, once for each document that is selected in the outer query.

    SQL++ query engine optimizes the correlated subquery processing in multiple ways. For example:

    • Subqueries are prepared only once, though they are executed multiple times. This avoids repetitive costs such as parsing and planning of the subquery.

    • When a subquery is correlated through the FROM clause expression, fetching documents in the subquery is avoided. The query engine smartly reuses the correlated document already fetched in the outer query.

    Types of Correlation

    In SQL++, the way in which a subquery is correlated with its parent queries is very important. That dictates certain behaviors and limitations in writing nested subqueries, and impacts query performance.

    Correlation by Source (or FROM clause-Expression)

    The data source for a query or subquery is specified by its FROM clause. When the FROM clause of a subquery refers to any variables (aliases, keyspace names, LET/LETTING variables, or document attributes) in the scope of parent queries, then the correlation is established using the source keyspace in the FROM expression. Such subquery is called a Source Correlated Subquery and it offers the following benefits:

    Nested Paths in FROM clause

    Couchbase Capella enables you to use correlated nested paths in a subquery FROM clause. This provides powerful language expressibility, simplicity, and flexibility to SQL++ queries, especially when dealing with nested array attributes. See Nested Paths in Subqueries for more details.

    Better Performance

    When correlation is established through the FROM clause in the subquery (with variables in scope), then the SQL++ engine knows that the subquery is referring to the same document that is being processed in one of the outer queries. Therefore, the subquery avoids fetch of the documents used in the subquery. This significantly improves the performance of such subqueries, as shown in example Q6 on the Subqueries page. In contrast, example Q6A on the Subqueries page cannot take advantage of this optimization.

    Correlation by Reference (or non FROM clause-Expression)

    In this case the subquery refers to variables in the scope of the outer level query, in clauses other than the FROM clause of the subquery. In such case, the FROM clause will have an independent keyspace identifier that does not reference any variables in the scope. This kind of subquery execution works like a JOIN query and requires the USE KEYS clause. For more information, see USE KEYS in the Subquery and FROM clause in Subqueries.

    In the following example, the subquery in the LET clause of the parent query introduces correlation in the USE KEYS clause (referencing t1 fields) and the WHERE clause (t2.iata = t1.airline). This query finds the airline and route details of flights that have routes starting from SFO airport.

    Q10
    SELECT airline_details, t1.destinationairport, t1.stops
    FROM route t1
    LET airline_details = (SELECT t2.name, t2.callsign
                           FROM airline t2
                           USE KEYS t1.airlineid
                           WHERE t2.iata = t1.airline)
    WHERE t1.sourceairport = "SFO"
          AND ARRAY_LENGTH(airline_details) > 0
    LIMIT 2;
    Results
    [
      {
        "airline_details": [
          {
            "callsign": "JETBLUE",
            "name": "JetBlue Airways"
          }
        ],
        "destinationairport": "AUS",
        "stops": 0
      },
      {
        "airline_details": [
          {
            "callsign": "JETBLUE",
            "name": "JetBlue Airways"
          }
        ],
        "destinationairport": "BOS",
        "stops": 0
      }
    ]

    FROM Clause and USE KEYS in Correlated Subqueries

    In example Q10, notice the USE KEYS clause used to establish the correlation with the outer query documents. Without this, it is not possible to identify the documents in the subquery that are related to the specific document being considered by the outer query.

    It is important to understand the reasoning to include the USE KEYS clause. It is not always mandatory. It entirely depends on how the FROM clause is formulated, which indicates the source of documents for the subquery and hence the correlation with the parent query.

    When a keyspace name identifier is used in the FROM clause of a subquery, it refers to a collection of documents referenced by the keyspace identifier. However, when an alias of the keyspace is used in the FROM clause (or any other clauses of the query), it refers to an individual document of the keyspace being considered in the outer query.

    FROM Clause with Keyspace Identifier

    The USE KEYS clause is mandatory for the primary keyspace of the subquery when the FROM clause has keyspace identifier that is independent of any of the aliases or variables in scope. This is needed to establish correlation with the documents or keyspace used in the outer query. For example:

    • The FROM clause of the subquery in Q10 is an independent keyspace identifier airline, and hence the correlation with the parent query is established explicitly using the USE KEYS clause through the referential attribute t1.airlineid.

    • Similarly, the subquery in Q9A on the Subqueries page has an independent keyspace identifier airport in the FROM clause, but the correlation is self-referencing to the same document. Therefore, USE KEYS meta(t).id is used.

    This is exactly same as the Primary Key-Foreign Key relationship required to join two documents that are referenced in the outer and inner queries. Note that in the travel-sample data model, the "route" documents refer the "airline" documents using the attribute airlineid. Refer to the Data Model.

    FROM Clause with Expression

    The USE KEYS clause is not required in the subquery when the FROM clause in the subquery has a generic expression as its data source, and not a keyspace name identifier. The FROM clause expression can be:

    • An independent constant expression or subquery expression that does not refer to any variables in scope.

    • A generic SQL++ expression or subquery that refers to any variables in scope.

    In example Q9 on the Subqueries page, the FROM clause is an expression referring to the variable alias t (in fact the nested path t.reviews) that already establishes correlation, and hence the subquery does not need an explicit USE KEYS clause.

    Correlated Subquery versus JOINs

    Correlated subqueries can be alternatively formulated using JOINs, because conceptually, a correlated query execution involves the same steps as JOIN. For instance, evaluating a nested subquery corresponding to each outer query document is equivalent to a nested-loop-join operation. This is one reason for the above-mentioned mandatory requirement of the USE KEYS clause for certain correlated queries using a FROM clause.

    In general, SQL++ recommends the usage of JOIN queries when possible, instead of semantically equivalent correlated subqueries. However, in some cases it may be easier or intuitive to formulate some queries using subqueries (instead of JOINs). In such case, it is recommended to understand the EXPLAIN query plans and performance of both queries.

    Q10A: Earlier Q10 rewritten with JOIN
    SELECT DISTINCT airline.name, airline.callsign, route.destinationairport, route.stops, route.airline
    FROM route
          JOIN airline
          ON KEYS route.airlineid
    WHERE route.sourceairport = "SFO"
    LIMIT 2;