How to create new Text-to-Speech cloud function (low-level instruction)

Serhii Dolhopolov 24 Jan 2025
Vladyslav Korchenko

Overview

This documentation describes how to create a Text-to-Speech module as a cloud function. While the example uses Google Cloud Functions, other cloud providers can be used. The cloud function acts as an adapter for a text-to-speech service (e.g., PlayHT) and implements a predefined interface to integrate with the Add-on Mart as a Text-to-Speech module.

Supported cloud functions

Supported cloud functions are: Oracle Cloud Functions, AWS Lambdas, Azure Functions, Google Cloud Functions

Oracle cloud function example

Oracle cloud function specific example

How to create Google Cloud Function

Google cloud function example

Google cloud function specific example

Authentication options for Google Cloud Function

Allow function to be triggered without authentication from specific IP addresses list.
- To implement IP-based access control you can utilize Google Cloud Armor in conjunction with an HTTP(S) Load Balancer.
Bearer Token Authentication.
- Deploy the function with the --no-allow-unauthenticated flag.
- Assign appropriate roles (e.g., Cloud Functions Invoker) to the service account.
- Generate a key and share it with the PortaOne team so that the dispatcher application can generate a Bearer Token.
- Refer to Secure your Cloud Run function for detailed Google documentation.
Static Token Authentication. Used in the example code. (May be deprecated in the future).
- Use a predefined token shared between the caller and the Cloud Function.
- Store the token securely as an environment variable in the Cloud Function.
- Validate the incoming request by comparing the Authorization header with the static token.

Code for the Google Cloud Function

Below is the example implementation of the required endpoints for the Text-to-Speech adapter using Golang. This code serves as a reference and may be adapted to suit other implementations or platforms; however, it must strictly adhere to the predefined interface to ensure compatibility.

Link to the repository: https://gitlab.portaone.com:8949/read-only/playht-adapter

General Guidelines

Project Structure: Organize the project based on the cloud function provider and chosen programming language. Refer to Write Cloud Run functions for detailed Google documentation.
HTTP Invocation: Ensure the function is invoked via an HTTP(S) request.
Error Handling: Return an appropriate error code along with a clear and descriptive error message. Use standard HTTP status codes (e.g. 4xx for client errors, 5xx for server errors).

Required Endpoints

The following endpoints must be implemented to meet the requirements for the T2S subsystem:

/getLanguages - Returns a list of supported languages.
/getVoices - Returns a list of available voices for the specified language.
/synthesizeSpeech - Synthesizes speech from the given text using the specified voice.

Configure the function to serve all endpoints under a single URL. For example, in Go, use switch cases to handle different paths. See the provided example code.

Request handlers

The function acts as an adapter to external Text-to-Speech service, enabling requests to that service by implementing PortaOne interfaces. The function shall accept input in the POST HTTP request in JSON format and returns output in JSON format. See the example of request handlers implementation.

If the external service does not provide an API for supported languages, maintain a static list locally. Validate incoming payloads to ensure all required fields are present. Respond with 400 Bad Request for missing or malformed data.

Data structures

To meet the requirements of the PortaOne interface and integrate with external Text-to-Speech services, the function must adhere to predefined data structures across all endpoints. Define request and response structures for all endpoints to ensure compatibility. Example: /getVoices should receive a GetVoicesListRequest and return a GetVoicesListResponse.Translate data from external APIs into the internal structures required by PortaOne. Example: Convert voice data from an external API into the GetVoicesListResponse format.

Refer to the example data structure definitions.

Logging

Format for logging is not strictly limited but at least the following information should present in each log message:

timestamp when this particular log message is printed;
log level (info, warn, error, debug);
"trace_id" - extracted from the "x-portaone-trace-id" request header.

Main purpose of usage "trace_id" property is to add ability to combine log entities related to some action triggered by user to single group in order to make log entities analysis easier. Write all logs to stdout in JSON format.

Deployment example

To deploy the sample function, ensure the following environment variables are configured:

Mandatory environment variables:

PLAYHT_API_URL - The base URL for the PlayHT API
STATIC_AUTH_TOKEN - A static token used for authentication

Optional environment variables:

API_TIMEOUT - Specifies the timeout duration for API calls (optional, default value is 30 seconds)

An example of a function deployment command:

gcloud functions deploy playhtAdapter \
--gen2 \
--region=europe-west1 \
--runtime=go122 \
--entry-point=adapter \
--trigger-http \
--allow-unauthenticated \
--set-env-vars PLAYHT_API_URL=https://api.play.ht/api/v2,STATIC_AUTH_TOKEN=******************

Text-to-Speech interface open API description

api_interface.swagger.json

{
	"swagger": "2.0",
	"info": {
		"title": "cpe/dispatcherService.proto",
		"version": "version not set"
	},
	"consumes": [
	  "application/json"
	],
	"produces": [
	  "application/json"
	],
	"paths": {
	  "/getLanguages": {
		"post": {
		  "summary": "Get a list of available languages",
		  "description": "This method returns a list of supported languages",
		  "responses": {
			"200": {
				"description": "A successful response.",
				"schema": {
				  "$ref": "#/definitions/GetLanguagesListResponse"
				}
			  }
		  },
		  "parameters": [
			{
			  "name": "body",
			  "in": "body",
			  "required": true,
			  "schema": {
				"$ref": "#/definitions/GetLanguagesListRequest"
			  }
			}
		  ]
		}
	  },
	  "/getVoices": {
		"post": {
		  "summary": "Get a list of available voices",
		  "description": "This method returns a list of available voices for the specified language",
		  "responses": {
			"200": {
				"description": "A successful response.",
				"schema": {
				  "$ref": "#/definitions/GetVoicesListResponse"
				}
			  }
		  },
		  "parameters": [
			{
			  "name": "body",
			  "description": "ListVoicesForLanguage request.",
			  "in": "body",
			  "required": true,
			  "schema": {
				"$ref": "#/definitions/GetVoicesListRequest"
			  }
			}
		  ]
		}
	  },
	  "/synthesizeSpeech": {
		"post": {
		  "summary": "Synthesize speech from text using a specific voice",
		  "description": "This method synthesizes speech from the given text using the specified voice",
		  "responses": {
			"200": {
				"description": "A successful response.",
				"schema": {
				  "$ref": "#/definitions/SynthesizeSpeechResponse"
				}
			}
		  },
		  "parameters": [
			{
			  "name": "body",
			  "description": "The synthesize speech request.",
			  "in": "body",
			  "required": true,
			  "schema": {
				"$ref": "#/definitions/SynthesizeSpeechRequest"
			  }
			}
		  ]
		}
	  }
	},
	"definitions": {
	  "AuthInfo": {
		"type": "object",
		"properties": {
		  "apiKey": {
			"type": "string",
			"description": "An external Text-to-Speech service token."
		  }
		}
	  },
	  "ConfigurationInfo": {
		"type": "object",
		"properties": {
		  "authInfo": {
			"$ref": "#/definitions/AuthInfo"
		  }
		},
		"description": "Configuration info about the external service."
	  },
	  "GetLanguagesListRequest": {
		"type": "object",
		"properties": {
		  "configurationInfo": {
			"$ref": "#/definitions/ConfigurationInfo"
		  }
		}
	  },
	  "GetLanguagesListResponse": {
		"type": "object",
		"properties": {
		  "success": {
			"type": "boolean",
			"description": "The flag shows whether the request was successfully processed."
		  },
		  "error": {
			"type": "string",
			"description": "The error message."
		  },
		  "languages": {
			"type": "array",
			"items": {
			  "type": "string"
			},
			"description": "The list of supported languages."
		  }
		},
		"description": "The response for the /getLanguages method."
	  },
	  "GetVoicesListRequest": {
		"type": "object",
		"properties": {
		  "configurationInfo": {
			"$ref": "#/definitions/ConfigurationInfo"
		  },
		  "languageCode": {
			"type": "string",
			"description": "BCP-47 language tag (e.g., 'en-NZ', 'fr-FR')."
		  }
		},
		"description": "/getVoices request."
	  },
	  "GetVoicesListResponse": {
		"type": "object",
		"properties": {
		  "success": {
			"type": "boolean",
			"description": "The flag shows whether the request was successfully processed."
		  },
		  "error": {
			"type": "string",
			"description": "The error message."
		  },
		  "voices": {
			"type": "array",
			"items": {
			  "type": "string"
			},
			"description": "The list of supported voices."
		  }
		},
		"description": "The response for the /getVoices method."
	  },
	  "SynthesizeSpeechRequest": {
		"type": "object",
		"properties": {
		  "configurationInfo": {
			"$ref": "#/definitions/ConfigurationInfo"
		  },
		  "voice": {
			"type": "string",
			"description": "Description of which voice to use for a synthesis request."
		  },
		  "input": {
			"type": "string",
			"description": "Text to synthesize into speech."
		  },
		  "audioConfig": {
			"type": "string",
			"description": "Description of audio data to be synthesized."
		  }
		},
		"description": "The /synthesizeSpeech request."
	  },
	  "SynthesizeSpeechResponse": {
		"type": "object",
		"properties": {
		  "success": {
			"type": "boolean",
			"description": "The flag shows whether the request was successfully processed."
		  },
		  "error": {
			"type": "string",
			"description": "The error message."
		  },
		  "audioContent": {
			"type": "string",
			"description": "A base64-encoded string."
		  }
		},
		"description": "The /synthesizeSpeech response."
	  }
	}
  }

Request Examples

Note that the cloud function example expects PlayHT credentials in the following format: API_KEY::USER_ID

/getLanguages

curl -s -X POST "https://<your-function-endpoint>/getLanguages" \
-H "Authorization: Bearer ***" \
-H "Content-Type: application/json" \
-d '{
    "configuration_info": {
      "auth_info": {
        "api_key": "<API_KEY::USER_ID>"
      }
    }
}' | jq
{
  "success": true,
  "error": "",
  "languages": [
    "en-US",
    "en-CA",
    "en-IN",
    "en-IE",
    "en-GB",
    "en-AU",
    "en-ZA",
    "en-FI",
    "en-FR",
    "en-IT",
    "en-MX",
    "en-NZ"
  ]
}

/getVoices

curl -s -X POST "https://<your-function-endpoint>/getVoices" \
-H "Authorization: Bearer ***" \
-H "Content-Type: application/json" \
-d '{
    "configuration_info": {
      "auth_info": {
        "api_key": "API_KEY::USER_ID"
      }
    },
    "language_code": "en-NZ"
}'| jq
{
  "success": true,
  "error": "",
  "voices": [
    "ID: s3://voice-cloning-zero-shot/d9ff78ba-d016-47f6-b0ef-dd630f59414e/female-cs/manifest.json, Name: Ruby, Language: English (NZ), LanguageCode: en-NZ"
  ]
}

/synthesizeSpeech

curl -s -X POST "https://europe-west1-playht-adapter.cloudfunctions.net/playhtAdapter/synthesizeSpeech1" \
-H "Authorization: Bearer ***" \
-H "Content-Type: application/json" \
-d '{
    "configuration_info": {
      "auth_info": {
        "api_key": "API_KEY::USER_ID"
      }
    },
    "input": "Test of text to speech Google Cloud Function",
    "voice": "s3://voice-cloning-zero-shot/d9ff78ba-d016-47f6-b0ef-dd630f59414e/female-cs/manifest.json"
}' | jq
{
  "success": true,
  "error": "",
  "audio_content": "<base64-encoded string>"
}

Page tree