Data Enrichment

    March 16, 2025
    + 12

    Goal: Given a legacy document set containing attributes whose format makes them difficult to search on. In order to correct this search deficiency, new searchable attributes will be added to the document. These new attributes related to and can be calculated from the original attributes. On any mutation (a document creation or modification), the new attributes should also be created (or updated)

    Implementation: Implementation: Create a JavaScript function that contains an OnUpdate handler. The handler listens for mutations or data-changes within a specified "Listen To Location" (or source collection). When any document within the collection is created or modified, the Eventing Function executes a user-defined routine. In this example, if a new or updated document has two specifically named fields with IP addresses (representing the start and end of an address range), the Eventing Function uses get_numip_first_3_octets(ip) to convert those IPs into integers and adds them as new fields in the document.

    • Case 1: A new document is created in a specified, target collection: this new document is identical to the old, except that it has two new additional fields, which contain integers that correspond to the IP addresses. The original document, in the source collection, is not changed.

    • Case 2: The original document, in the source collection, is mutated (or changed) to have two additional fields, which contain integers that correspond to the IP addresses. In this case the Eventing Service automatically suppresses the recursive mutation.

    Preparations (Common):

    For this example, two (2) buckets bulk and rr100 are required where the latter is intended to be 100% resident. Create the buckets with a minimum size of 100MB. For information on buckets, see Create a Bucket. Within the buckets we need three (3) keyspaces bulk.data.source, bulk.data.target, and rr100.eventing.metadata (we loosely follow this organization).

    For the Function Scope or RBAC grouping, we will use the bulk.data, assuming you have the role of either "Full Admin" or "Eventing Full Admin". For standard or non-privileged users, refer to Eventing Role-Based Access Control.

    If you run a version of Couchbase prior to 7.0, you can create the buckets source, target, and metadata and run this example. Furthermore, if your cluster was subsequently upgraded from, say, 6.6.2 to 7.0, your data would be moved to source._default._default, target._default._default, and metadata._default._default and your Eventing Function would be seamlessly upgraded to use the new keyspaces and continue to run correctly.

    For complete details on how to set up your keyspaces, refer to creating buckets and creating scopes and collections.

    The Eventing Storage keyspace, in this case rr100.eventing.metadata, is for the sole use of the Eventing system, do not add, modify, or delete documents from it. In addition, do not drop or flush or delete the containing bucket (or delete this collection) while you have any deployed Eventing functions. In a single tenancy deployment this collection can be shared with other Eventing functions.

    Procedure (Case 1):

    1. Access the Couchbase Web Console  Buckets page and click the Scopes and Collections link of the bulk bucket.

      • Click Documents in the upper right banner for the data scope.

      • Select the keyspace bulk, data, source

      • You should see no user records.

      • Click Add Document in the upper right banner

      • For the ID in the Create New Document dialog, specify SampleDocument

        ID [ SampleDocument         ]
      • For the document body in the Create New Document dialog, the following text is displayed:

        {
        "click": "to edit",
        "with JSON": "there are no reserved field names"
        }
      • replace the above text with the following JSON document via a cut-n-paste

        {
        "country": "AD",
          "ip_start": "5.62.60.1",
          "ip_end": "5.62.60.9"
        }
      • Click Save.

    2. From the Couchbase Web Console  Eventing page, click ADD FUNCTION to add a new Function. The ADD FUNCTION dialog appears.

    3. In the ADD FUNCTION dialog, for individual Function elements provide the below information:

      • For the Function Scope drop-down, select bulk.data as the RBAC grouping.

      • For the Listen To Location drop-down, select bulk, data, source as the keyspace.

      • For the Eventing Storage drop-down, select rr100, eventing, metadata as the keyspace.

      • Enter case_1_enrich_ips as the name of the Function you are creating in the Function Name text-box.

      • Leave the "Deployment Feed Boundary" as Everything.

      • [Optional Step] Enter the text: On mutation create a new document in a different collection with additional fields, in the Description text-box.

      • For the Settings option, use the default values.

      • For the Bindings option, add two bindings.

        • For the first binding, select bucket alias, specify src as the alias name of the collection; select bulk, data, source as the associated keyspace, and select read only for the access mode.

        • For the second binding, select bucket alias, specify tgt as the alias name of the collection, select bulk, data, and target as the associated keyspace, and select read and write for the access mode.

      • After configuring your settings, the ADD FUNCTION dialog should look like this:

        enrichcase1 01 settings
    4. After providing all the required information in the ADD FUNCTION dialog, click Next: Add Code. The case_1_enrich_ips dialog appears.

      • The case_1_enrich_ips dialog initially contains a placeholder code block. You will substitute your actual case_1_enrich_ips code in this block.

        enrichcase1 02 editor with default
      • Copy the following function and paste it in the placeholder code block of the[.ui]case_1_enrich_ips dialog.

        javascript
        function OnUpdate(doc, meta) { log('document', doc); doc["ip_num_start"] = get_numip_first_3_octets(doc["ip_start"]); doc["ip_num_end"] = get_numip_first_3_octets(doc["ip_end"]); tgt[meta.id]=doc; } function get_numip_first_3_octets(ip) { var return_val = 0; if (ip) { var parts = ip.split('.'); //IP Number = A x (256*256*256) + B x (256*256) + C x 256 + D return_val = (parts[0]*(256*256*256)) + (parts[1]*(256*256)) + (parts[2]*256) + parseInt(parts[3]); return return_val; } }

        After pasting, the screen appears as displayed below:

        enrichcase1 03 editor with code
      • Click Save and Return.

    5. The OnUpdate routine specifies that when a change occurs to data within the bucket, the routine get_numip_first_3_octets is run on each document that contains ip_start and ip_end. A new document is created whose data and metadata are based on those of the document on which get_numip_first_3_octets is run; but with the addition of ip_num_start and ip_num_end data-fields, which contain the numeric values returned by get_numip_first_3_octets. The get_numip_first_3_octets routine splits the IP address, converts each fragment to a numeral, and adds the numerals together to form a single value, which it returns.

    6. From the Eventing screen, click the case_1_enrich_ips function to select it, then click Deploy.

      enrichcase1 03a deploy
      • In the Confirm Deploy Function, click Deploy Function.

    7. The Eventing function is deployed and starts running within a few seconds. From this point, the defined Function is executed on all existing documents and will also, more importantly, run on subsequent mutations.

    8. To check the results of the deployed Eventing Function:

      • Access the Couchbase Web Console  Buckets page and click the Scopes and Collections link of the bulk bucket.

      • Click btn[Documents] in the upper right banner for the data scope.

      • Select the keyspace bulk, data, target

      • Edit the document, and you will see a duplicate of the source bucket but with two new calculated fields as follows:

        {
          "country": "AD",
          "ip_end": "5.62.60.9",
          "ip_start": "5.62.60.1",
          "ip_num_start": 87964673,
          "ip_num_end": 87964681
        }
      • Click Cancel to close the editor.

    9. Because our Eventing Function is deployed, it will continue to process all new mutations, let’s test this out.

      • Access the Couchbase Web Console  Buckets page and click the Scopes and Collections link of the bulk bucket.

      • Click Documents in the upper right banner for the data scope.

      • Select the keyspace bulk,data, source

      • You should see one user record (the one we entered at the beginning of this procedure).

      • Click Add Document in the upper right banner

      • For the ID in the Create New Document dialog, specify AnotherSampleDocument

        ID [ AnotherSampleDocument  ]
      • For the document body in the Create New Document dialog, the following text is displayed:

        {
        "click": "to edit",
        "with JSON": "there are no reserved field names"
        }
      • replace the above text with the following JSON document via a cut-n-paste

        {
          "country": "RU",
          "ip_start": "7.12.60.1",
          "ip_end": "7.62.60.9"
        }
      • Click Save.

    10. To check results (which were updated in real time) by the deployed Eventing Function:

      • Access the Couchbase Web Console  Buckets page and click the Scopes and Collections link of the bulk bucket.

      • Click Documents in the upper right banner for the data scope.

      • Select the keyspace bulk, data, target

      • Edit the newly created document, and you will see a duplicate of the source bucket but with two new calculated fields as follows:

        {
          "country": "RU",
          "ip_end": "7.62.60.9",
          "ip_start": "7.12.60.1",
          "ip_num_start": 118242305,
          "ip_num_end": 121519113
        }
      • Click Cancel to close the editor.

    Procedure (Case 2):

    Undeploy the Eventing Function case_1_enrich_ips, if it’s running.
    1. Access the Couchbase Web Console  Eventing page and click the function name case_1_enrich_ips link of the source bucket.

      enrichcase1 03b undeploy
      • Click Undeploy

      • Click Undeploy Function to confirm.

    2. We assume that the two documents from Case 1 above exist in the source collection. If they don’t, please create them in the source collection.

      • Access the Couchbase Web Console  Buckets page and click the Scopes and Collections link of the bulk bucket.

      • Click Documents in the upper right banner for the data scope.

      • Select the keyspace bulk, data, source

      • You should see two user records (as previously created above).

        {
        "country": "AD",
          "ip_start": "5.62.60.1",
          "ip_end": "5.62.60.9"
        }
        {
          "country": "RU",
          "ip_start": "7.12.60.1",
          "ip_end": "7.62.60.9"
        }
    3. From the Couchbase Web Console  Eventing page, click ADD FUNCTION, to add a new Function. The ADD FUNCTION dialog appears.

    4. In the ADD FUNCTION dialog, for individual Function elements provide the below information:

      • For the Function Scope drop-down, select bulk.data as the RBAC grouping.

      • For the Listen To Location drop-down, select bulk, data, source as the keyspace.

      • For the Eventing Storage drop-down, select rr100, eventing, metadata as the keyspace.

      • Enter case_2_enrich_ips as the name of the Function you are creating in the Function Name text-box.

      • Leave the Deployment Feed Boundary as Everything.

      • [Optional Step] Enter the text: On mutation create a new document in the same collection with additional fields in the Description text-box.

      • For the Settings option, use the default values.

      • For the Bindings option, add two bindings.

        • For the only binding, select bucket alias, specify src as the alias name of the collection, select bulk, data, source as the associated keyspace, and select read and write for the access mode.

      • After configuring your settings, the ADD FUNCTION dialog should look like this:

        enrichcase2 01 settings
    5. After providing all the required information in the ADD FUNCTION dialog, click Next: Add Code. The case_2_enrich_ips dialog appears.

      • The case_2_enrich_ips dialog initially contains a placeholder code block. You will substitute your actual case_2_enrich_ips code in this block.

        enrichcase2 02 editor with default
      • Copy the following function and paste it in the placeholder code block of the case_2_enrich_ips dialog.

        javascript
        function OnUpdate(doc, meta) { log('document', doc); doc["ip_num_start"] = get_numip_first_3_octets(doc["ip_start"]); doc["ip_num_end"] = get_numip_first_3_octets(doc["ip_end"]); // !!! write back to the source bucket !!! src[meta.id]=doc; } function get_numip_first_3_octets(ip) { var return_val = 0; if (ip) { var parts = ip.split('.'); //IP Number = A x (256*256*256) + B x (256*256) + C x 256 + D return_val = (parts[0]*(256*256*256)) + (parts[1]*(256*256)) + (parts[2]*256) + parseInt(parts[3]); return return_val; } }

        After pasting, the screen appears as displayed below:

        enrichcase2 03 editor with code
      • Click Save and Return.

    6. The OnUpdate routine specifies that when a change occurs to data within the bucket, the routine get_numip_first_3_octets is run on each document that contains ip_start and ip_end. A new document is created whose data and metadata are based on those of the document on which get_numip_first_3_octets is run; but with the addition of ip_num_start and ip_num_end data-fields, which contain the numeric values returned by get_numip_first_3_octets. The get_numip_first_3_octets routine splits the IP address, converts each fragment to a numeral, and adds the numerals together, to form a single value; which it returns.

    7. From the Eventing screen, click the case_2_enrich_ips function to select it, then click btn:*Deploy].

      enrichcase2 03a deploy
      • In the Confirm Deploy Function, click Deploy Function.

    8. The Eventing function is deployed and starts running within a few seconds. From this point, the defined Function is executed on all existing documents and will also, more importantly, run on subsequent mutations. Unlike our first example, the documents that are the source of the mutations will be updated.

    9. To check results (which were updated in real time) by the deployed Eventing Function:

      • Access the Couchbase Web Console  Buckets page and click the Scopes and Collections link of the bulk bucket.

      • Click Documents in the upper right banner for the data scope.

      • Select the keyspace bulk, data, source

      • Edit the SampleDocument it will have been enriched or modified with two new calculated fields:

        {
          "country": "AD",
          "ip_end": "5.62.60.9",
          "ip_start": "5.62.60.1",
          "ip_num_start": 87964673,
          "ip_num_end": 87964681
        }
      • Edit the AnotherSampleDocument it will also have been enriched or modified with two new calculated fields:

        {
          "country": "RU",
          "ip_end": "7.62.60.9",
          "ip_start": "7.12.60.1",
          "ip_num_start": 118242305,
          "ip_num_end": 121519113
        }
      • Click Cancel to close the editor.

    10. Because our Eventing Function is deployed, it will continue to process all new mutations, let’s test this out.

      • Access the Couchbase Web Console  Buckets page and click the Scopes and Collections link of the bulk bucket.

      • Click Documents in the upper right banner for the data scope.

      • Select the keyspace bulk, data, source

      • Edit at AnotherSampleDocument again BUT change "ip_start" to "6.12.60.1"

        {
          "country": "RU",
          "ip_end": "7.62.60.9",
          "ip_start": "6.12.60.1",
          "ip_num_start": 118242305,
          "ip_num_end": 121519113
        }
      • Click Save to update the document and close the editor.

      • Edit at AnotherSampleDocument again and see the recalculation of "ip_num_start": 118242305 to "ip_num_start": 101465089 happened in real-time.

        {
          "country": "RU",
          "ip_end": "7.62.60.9",
          "ip_start": "6.12.60.1",
          "ip_num_start": 101465089,
          "ip_num_end": 121519113
        }
      • Click Cancel to close the editor.

    Cleanup (both Case 1 and Case 2):

    Go to the Eventing portion of the UI and undeploy the Function(s) case_1_enrich_ips and case_2_enrich_ips, this will remove the 1024 documents for each function from the rr100.eventing.metadata collection (in the Bucket view of the UI). Remember you may only delete the rr100.eventing.metadata keyspace if there are no deployed Eventing Functions.

    Now flush the 'bulk' bucket if you plan to run other examples (you may need to Edit the bucket 'bulk' and enable the flush capability).