Function Author’s Guide
Python UDFs in Analytics are designed first and foremost to give maximum freedom to the user, and as such almost any Python code can be bound as a function. However, to achieve the best performance and security, it is good to keep in mind how Analytics runs and utilizes your Python code.
Security
Python UDFs are run as the same user as the Couchbase Analytics process itself. To enable maximum flexibility for cases where specialized hardware (e.g. GPUs) might be needed for the function, no additional sandboxing or other code and process isolation techniques are used. This is similar to an unfenced external function in a traditional RDBMS.
Therefore any code that is part of a Library should be very carefully vetted for
the possibility of misuse or accidental damage to the Couchbase cluster. Once a function is exposed from a library via
CREATE ANALYTICS FUNCTION
by an administrator with cluster.analytics!manage
, it will become available to anyone
with cluster.analytics!select
to execute.
For the above reasons, Library uploads are highly privileged and correspondingly restricted, as well as the aforementioned
restriction binding of those library artifacts into N1QL for Analytics via CREATE ANALYTICS FUNCTION
. Library uploads
can only be accomplished via the loopback interface by an administrator with the privilege cluster.admin.diag!write
.
Execution Model
Analytics queries are deployed across the cluster as Hyracks jobs. A Hyracks job has a lifecycle that can be simplified for the purposes of UDFs to
-
A pre-run phase which allocates resources,
open
-
The time during which the job has data flowing through it,
nextFrame
-
Cleanup and shutdown in
close
.
If a N1QL for Analytics function is defined as a member of a class in the library, the class will be instantiated
during open
. The class will exist in memory for the lifetime of the query. Therefore if your function needs to reference
files or other data that would be costly to load per-call, making it a member variable that is initialized in the constructor
of the object will greatly increase the performance of the function.
For each function invoked during a query, there will be an independent instance of the function per data partition. This means that the function must not assume there is any global state or that it can assume things about the layout of the data. The execution of the function will be parallel to the same degree as the level of data parallelism in the cluster (e.g. the number of Analytics data partitions)
After initialization, the function bound in the N1QL for Analytics function definition is called once per tuple during the query
execution (i.e. nextFrame
). Unless the function specifies null-call
in the WITH
clause, NULL
values will be
skipped.
At the close of the query, the function is torn down and not re-used in any way. All functions should assume that nothing will persist in-memory outside of the lifetime of a query, and any behavior contrary to this is undefined. Concretely, this means that the function should be free of side effects like writing files or editing state remotely. The query optimizer has no accommodation for this behavior, so any function that does this will have unpredictable behavior depending on how it is used in a query.
Type Mappings
Currently only a subset of types are supported in Python UDFs. The supported types are as follows:
-
Integer types (int8,16,32,64)
-
Floating point types (float, double)
-
String
-
Boolean
-
Arrays, Sets (cast to lists)
-
Objects (cast to dict)
Unsupported types can be cast to these in N1QL for Analytics first in order to be passed to a Python UDF.
Packaging
Python UDFs need to be rolled into a shiv package with all their dependencies.
By default Couchbase Analytics will use its own integrated Python 3.9 interpreter.
This can be changed in the cluster config using the python.path
configuration variable.
The purpose of using shiv
is so that the library can be uploaded with exactly the dependencies that are necessary,
with the library itself. In this way every library has its own dependencies residing with it on the server. Thus,
the possibility of conflicting versions of different packages is greatly reduced.
shiv
targets the architecture, platform and Python version on which it is run. This will lead to issues if a library has
dependencies that use native code (e.g. numpy
), and the server has a different platform like Linux. shiv
takes most
arguments from pip, so the appropriate arguments to retrieve wheels for the correct target platform can be passed.
For example, to target a Linux x86 platform for scikit-learn with Python 3.9, the command would look something like:
shiv -o lib.pyz --site-packages . --platform manylinux2010_x86_64 --python-version 39 --only-binary=:all: scikit-learn
The maximum allowed size of an uploaded library is determined by the maxWebRequestSize
parameter value in the Analytics
Service configuration. If the crafted library is very large (512MB or more), this limit may need to be reconfigured to
a higher value.
Tutorial
First, devise a function that you would like to use, sentiment_mod.py
:
import os from typing import Tuple class sent_model: def __init__(self): good_words = os.path.join(os.path.dirname(__file__), 'good.txt') with open(good_words) as f: self.whitelist = f.read().splitlines() def sentiment(self, arg: Tuple[str])-> str: words = arg[0].split() for word in words: if word in self.whitelist: return 'great' return 'eh'
Furthermore, let’s assume 'good.txt' contains the following entries:
spam eggs ham
Now, in the module directory, execute shiv
with all the dependencies of the module listed. This function doesn’t
actually use scikit-learn here, but it’s just included as an example of a real module dependency. This command will
wrap up the code that was just written, along with any dependencies it might have, so the server can keep all
dependencies of each library separate from one another.
shiv -o lib.pyz --site-packages . scikit-learn
Then, deploy it with the library name pylib
in the Default
scope, or a scope of our choosing. This requires the
permission cluster.admin.diag!write
, which is normally only available to the role Full Administrator
.
This request must also originate locally from a node that is running Analytics. Upload requests from non-local origins
are blocked for security purposes. See the Analytics Library REST API for complete details.
curl -X POST -u couchbase:couchbase -F "type=python" -F "data=@./lib.pyz" localhost:8095/analytics/library/Default/pylib
This request uploads and extracts the library across all nodes in the cluster, using a 2 phase commit protocol to ensure
that each node has the complete library. Therefore it may take some time to return, dependent on the size of the uploaded
library. Once a 200 OK
has been returned by the server, the upload is complete and the library is ready to use.
With the library deployed, you can define a function within it for use. For example, to expose the Python function
sentiment
in the module sentiment_mod
in the class sent_model
, the DDL would be as follows:
CREATE ANALYTICS FUNCTION sentiment(a) AS "sentiment_mod", "sent_model.sentiment" AT pylib;
By default, Analytics will treat all external functions as deterministic. It means the function must return the same
result for the same input, irrespective of when or how many times the function is called on that input.
This particular function behaves the same on each input, so it satisfies the deterministic property.
This enables better optimization of queries including this function.
If a function is not deterministic then it should be declared as such by using a WITH
sub-clause:
CREATE ANALYTICS FUNCTION sentiment(text) AS "sentiment_mod", "sent_model.sentiment" AT pylib WITH { "deterministic": false }
With the function now defined, it can then be used as any other scalar Analytics function would be. See User-Defined Functions for details of the N1QL for Analytics syntax for creating external functions.