Tips on how to construct an OpenAI-compatible API | by Saar Berkovich | March, 2024

adminMarch 24, 2024

0 16 3 minutes read

We’ll begin by putting in the non-streaming bit. Let’s begin by modeling our utility:

from typing import Record, Non-obligatoryfrom pydantic import BaseModel
class ChatMessage(BaseModel):
function: str
content material: str
class ChatCompletionRequest(BaseModel):
mannequin: str = "mock-gpt-model"
messages: Record[ChatMessage]
max_tokens: Non-obligatory[int] = 512
temperature: Non-obligatory[float] = 0.1
stream: Non-obligatory[bool] = False

The PyDantic mannequin represents a request from a shopper, which is meant to duplicate an API reference. For the sake of brevity, this mannequin doesn’t use all of the specs, however relatively the naked bones wanted for it to work. In case you are lacking a parameter that’s a part of the API specification (similar to top_p), you’ll be able to merely add it to the mannequin.

I ChatCompletionRequest modeling parameters that OpenAI makes use of of their functions. The main points of the chat API require the specification of an inventory of ChatMessage (like chat historical past, the shopper is normally in control of saving it and returning it to all requests). Every chat message has function characteristic (usually system, assistant or consumer ) and a content material attribute containing the precise textual content of the message.

Subsequent, we’ll write our level to complete the FastAPI dialog:

import timefrom fastapi import FastAPI
app = FastAPI(title="OpenAI-compatible API")
@app.publish("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
if request.messages and request.messages[0].function == 'consumer':
resp_content = "As a mock AI Assitant, I can solely echo your final message:" + request.messages[-1].content material
else:
resp_content = "As a mock AI Assitant, I can solely echo your final message, however there have been no messages!"
return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"mannequin": request.mannequin,
"decisions": [{
"message": ChatMessage(role="assistant", content=resp_content)
}]
}

That is simple.

To check our utilization

It’s assumed that each code blocks are within the named file principal.pywe are going to set up two Python libraries within the location of our selection (it’s all the time higher to create a brand new one): pip set up fastapi openai and open the server in terminal:

uvicorn principal:app

Utilizing one other terminal (or by launching the server within the background), we’ll open the Python console and copy-paste the next code, taken straight from OpenAI’s Python Shopper Reference:

from openai import OpenAI# init shopper and connect with localhost server
shopper = OpenAI(
api_key="fake-api-key",
base_url=" # change the default port if wanted
)
# name API
chat_completion = shopper.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
mannequin="gpt-1337-turbo-pro-max",
)
# print the highest "selection" 
print(chat_completion.decisions[0].message.content material)

In the event you did every little thing accurately, the response from the server needs to be printed accurately. It is also price testing chat_completion examine to see that each one the related attributes have been despatched from our server. It’s best to see one thing like this:

Creator’s code, formatted utilizing Carbon

Since LLM technology is commonly sluggish (computationally costly), it’s price streaming your generated content material again to the shopper, in order that the consumer can see the response as it’s generated, with out ready for it to complete. In the event you keep in mind, tell us ChatCompletionRequest a boolean stream property – this enables the shopper to request that knowledge be re-broadcast to it, relatively than being despatched abruptly.

This makes issues very troublesome. We are going to create a generator operate to wrap our mock response (in an actual world situation, we are going to need a generator linked to our LLM technology)

import asyncio
import jsonasync def _resp_async_generator(text_resp: str):
# let's fake each phrase is a token and return it over time
tokens = text_resp.break up(" ")
for i, token in enumerate(tokens):
chunk = {
"id": i,
"object": "chat.completion.chunk",
"created": time.time(),
"mannequin": "blah",
"decisions": [{"delta": {"content": token + " "}}],
}
yield f"knowledge: {json.dumps(chunk)}nn"
await asyncio.sleep(1)
yield "knowledge: [DONE]nn"

And now, we will modify our precise endpoint to return the printed response when stream==True

import timefrom starlette.responses import StreamingResponse
app = FastAPI(title="OpenAI-compatible API")
@app.publish("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
if request.messages:
resp_content = "As a mock AI Assitant, I can solely echo your final message:" + request.messages[-1].content material
else:
resp_content = "As a mock AI Assitant, I can solely echo your final message, however there wasn't one!"
if request.stream:
return StreamingResponse(_resp_async_generator(resp_content), media_type="utility/x-ndjson")
return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"mannequin": request.mannequin,
"decisions": [{
"message": ChatMessage(role="assistant", content=resp_content)        }]
}

Exams streaming utilization

After restarting the uvicorn server, we are going to open the Python console and enter this code (once more, taken from the OpenAI library documentation)

from openai import OpenAI# init shopper and connect with localhost server
shopper = OpenAI(
api_key="fake-api-key",
base_url=" # change the default port if wanted
)
stream = shopper.chat.completions.create(
mannequin="mock-gpt-model",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
print(chunk.decisions[0].delta.content material or "")

It’s best to see every phrase within the server’s response printed slowly, simulating tokenization. We are able to examine the final one chunk count on to see one thing like this: