Querying Time-Series data in Couchbase with Window Functions

    +

    Overview

    This tutorial will show a time-series report about user event activity using both window functions and common table expressions (CTEs). The ultimate goal is to build a report that can help show any trends or patterns in user activity. This tutorial will not focus on visualization, but the end result could look something like this:

    Data visualization

    This tutorial is part of the main User Profile tutorial (both .NET and Java). You should be able to follow along if you haven’t gone through those tutorials yet, but reading those tutorials first will give you more context.

    This tutorial contains 5 parts:

    Setup and Prerequisites

    This tutorial uses window functions, which were first introduced to Couchbase Server in version 6.5.

    Additionally, the data being queried in this tutorial comes from the main User Profile tutorial (.NET and Java). If you are following along with those tutorials, you should have a data set that contains user events. For reference, the user profile store should contain many user event documents which look like this:

    {
      "createdDate": "2019-08-11T16:54:03.7227615-04:00",
      "eventType": 1,
      "userId": "user::70231780-3eef-40d6-b689-0a0d6455824e",
      "type": "userEvent"
    }
    
    OR
    
    {
      "createdDate": 1565691905000,
      "eventType": 1,
      "userId": "user::70231780-3eef-40d6-b689-0a0d6455824e",
      "type": "userEvent"
    }

    The rest of this tutorial assumes you have a number of these type of documents in a bucket called user_profile.

    If you don’t, you can load this sample data set: user-profile-sample.json file (large JSON file).

    cbimport json -c localhost -u Administrator -p
    password -b user_profile -d file://user-profile-sample.json -f list -g %id%

    After this is complete, you should see a message like this:

    Json `file://user-profile-sample.json` imported to `http://localhost:8091` successfully
    Documents imported: 25000 Documents failed: 0

    For more details on loading this sample data set, see the indexing user profiles tutorial.

    Introduction to Window Functions

    Window functions are useful for reporting aggregations or cumulative data. The OVER clause defines a window over which these values will be calculated.

    There are a number of window functions available in Couchbase Server 6.5+. This tutorial will only be showing basic usage of OVER, but please refer to the full window functions documentation for details about functions like CUME_DIST, DENSE_RANK, ROW_NUMBER, and much more.

    Within the OVER clause, PARTITION specifies the field that describes the window. For example, to get a sum of the number of activities by a time segment:

    SELECT
        e.segment,
        SUM(e.numEvents) OVER(PARTITION BY e.segment) AS numEvents
    FROM /* ... etc ... */

    Introduction to Common Table Expressions

    A common table expression is a way to define a temporary set of results which can be referenced later in the query. Most often this is used to simplify a query when a result set is used several times.

    Use the WITH clause to create a common table expression. Please refer to the complete documentation on the WITH clause for more information.

    For this tutorial, notice that each user event has a createdDate field which contains a timestamp. However, for reporting purposes, it would be more useful to group these events into some segment (like a 15 minute interval).

    One approach might be to calculate the segment (based on the timestamp) and put those results into a CTE for further querying. For example:

    WITH segmentedUserEvents AS (
        SELECT
            DATE_ADD_STR(DATE_TRUNC_STR(u.createdDate,"hour"), ROUND(DATE_PART_STR(u.createdDate,"minute")/15,0)*15,"minute") as segment,
            u.eventType,
            COUNT(*) AS numEvents
        FROM user_profile u
        WHERE u.type = 'userEvent'
        GROUP BY DATE_ADD_STR(DATE_TRUNC_STR(u.createdDate,"hour"), ROUND(DATE_PART_STR(u.createdDate,"minute")/15,0)*15,"minute"), u.eventType
    )
    SELECT /* ... etc ... */

    After the initial WITH, the remainder of the query can refer to segmentedUserEvents instead of repeating the entire contents. This allows a query writer to simplify the entire query, as well as isolate certain components.

    Note that the combination of DATE_ADD_STR, DATE_TRUNC_STR, and ROUND returns a "segment" date, rounded to 15 minute segments. Examples are: "2019-08-11T15:30:00-04:00", "2019-08-11T15:45:00-04:00", etc.

    If you are storing dates as milliseconds instead of a string, you will need to use a different set of DATE functions. For example: DATE_ADD_MILLIS(DATE_TRUNC_MILLIS(u.createdDate,"hour"), ROUND(DATE_PART_MILLIS(u.createdDate,"minute")/15,0)*15,"minute")

    Querying User Events

    Putting together this CTE with the OVER(PARTITION …​) from earlier, build a query that will find the number of certain types of events that occurred within each segment of time.

    WITH segmentedUserEvents AS (
        SELECT
            DATE_ADD_STR(DATE_TRUNC_STR(u.createdDate,"hour"), ROUND(DATE_PART_STR(u.createdDate,"minute")/15,0)*15,"minute") as segment,
            u.eventType,
            COUNT(*) AS numEvents
        FROM user_profile u
        WHERE u.type = 'userEvent'
        GROUP BY DATE_ADD_STR(DATE_TRUNC_STR(u.createdDate,"hour"), ROUND(DATE_PART_STR(u.createdDate,"minute")/15,0)*15,"minute"), u.eventType
    )
    SELECT
        e.segment,
        SUM(e.numEvents) OVER(PARTITION BY e.segment) AS numEvents
    FROM segmentedUserEvents e
    AND e.segment BETWEEN '2019-08-11' AND '2019-08-12'
    WHERE e.eventType = 1
    ORDER BY e.segment;

    This query uses a CTE to segment each user event. It then uses OVER to sum up the number of events in each segment. The raw data result should look similar to:

    [
      { "numEvents": 5, "segment": "2019-08-11T15:30:00-04:00" },
      { "numEvents": 5, "segment": "2019-08-11T15:45:00-04:00" },
      { "numEvents": 8, "segment": "2019-08-11T16:00:00-04:00" },
    
      /* ... etc ... */
    
    ]

    This query and/or the resulting data can be fed directly into a data visualization tool. This screenshot shows a simple webpage using a JavaScript visualization tool.

    Data visualization

    Upon clicking "Load", the tool makes an HTTP request to an endpoint that executes the above query.

    The full source code for this tutorial is available on GitHub.

    Summary

    In this mini-tutorial, we looked at some of the tools available in N1QL to perform time-series queries.

    For more information about the full capabilities of window functions and common table expressions in N1QL, check out the documentation:

    If you have questions or problems, you can always get help from humans on the N1QL category on the Couchbase Forums.

    Return to the parent User Profile tutorial: