The developer’s guide to building AI apps: Part 1

Andrew Tate
Andrew Tate
Technical writer

Jul 19, 2024

In every quirky-named meeting room, in every Zoom call, in every Slack, one question is almost certainly being asked:

“How are we going to ‘do AI?’

Every type of business, from big banks to industrial manufacturers, to your friendly neighborhood SaaS company, is working out how to leverage AI. AI ideas come at you fast in these meetings—an emoji-only chatbot, a pirate resume generator, an absurdist product copywriter, or maybe even more practical ideas like an enterprise search tool, workflow automation, or code generation that can drive real productivity drivers for business users.

But someone still has to go and make the thing happen.

Turning ideas into execution and execution into ROI is no small feat with any product. With AI, there’s the added difficulty of building on shifting sands. The AI landscape is evolving rapidly, with new models, tools, and techniques emerging constantly. Developers have to build the infrastructure around their AI and tune the model to the task—whatever it is—without a whole lot of tried-and-true best practices to work with, and it can feel harder than it ought to be.

Let’s demystify the process of building effective AI apps and talk about how to pull the starting line forward. In this first part of a two-part series, we’ll walk you through the decision points of building AI apps, the tools you’ll need, and considerations to weigh as you build.

The emerging AI stack

One of the reasons that AI has been able to “take over” quickly is the developer experience. For popular models like OpenAI’s GPT models, getting to that “wow” moment is quick. It just takes this:

1import OpenAI from "openai";
2
3const openai = new OpenAI();
4
5
6
7async function main() {
8
9const completion = await openai.chat.completions.create({
10
11messages: [
12
13{ role: "system", content: "You are a helpful assistant." },
14
15{ role: "user", content: "What is retool?" },
16
17],
18
19model: "gpt-3.5-turbo",
20
21});
22
23
24
25console.log(completion.choices[0]);
26
27}
28
29
30
31main();

And you get this completion:

1{
2
3index: 0,
4
5message: {
6
7role: 'assistant',
8
9content: 'Retool is an application development platform that helps teams build internal tools quickly. It provides a drag-and-drop interface to build user interfaces and connect them to data sources such as databases, APIs, and more. Retool enables software developers to create tools for various purposes without having to write code from scratch.'
10
11},
12
13logprobs: null,
14
15finish_reason: 'stop'
16
17}

Sold! You could do this in a minute during an ideas meeting.

But proof-of-concept =/= production-level code, and “here’s an idea” =/= ROI-driving apps. As readily available generative AI has matured more in the past 18 months, a stack has emerged to help developers deploy AI reliably in production.

Building a successful AI-powered app means choosing the right model, getting your data into that model, and creating a user interface that allows someone on the other side to experience the model intuitively. Here’s how we recommend assembling your AI stack for your first AI app:

Data: Going beyond general outputs

Language models are generalists. They are typically trained on billions of features from across the internet—including text, audio, video, and images. You can head to ChatGPT or Claude right now to chat with the model and get a reasonable answer if that answer is readily available on the web.

However, the bare-metal model will rarely work in a business environment. The point of building AI apps is to build something people need, even if that is a pirate resume generator (someone should make this).

You’ll need to add your own data to the model. There are three ways to think about doing this.

1. Adding data via the context window

For some implementations, all the data you need can be added to the “context window” at run time. The context window is the pretty name for the prompt. Modern models have huge context windows—for example, Google Gemini boasts a context window of 1M tokens (if you’re brave enough to use it via Google Vertex). That means you could copy and paste War and Peace into the prompt and still have room to ask, “Who did it?” (spoiler alert: it was Napoleon).

This option is good for general tasks, one-off needs, and summarization. For instance, if you need to analyze a PDF or single document, you can just paste the whole document into the prompt and ask the model to summarize it or extract key information. But this isn’t a terribly reusable or reliable option for building AI apps.

2. Fine-tuning a model

Fine-tuning is taking a pre-trained model and further training it on a smaller dataset specific to your use case. This allows you to adapt the model to your domain and improve its performance on your task.

To fine-tune a model, you’ll need:

  • A pre-trained model (like Claude 3 or GPT-4.o—we were feeling retro and used GPT-3.5 Turbo in samples)
  • A dataset specific to your use case
  • A machine with GPU for training

The process typically involves the following steps:

  • Prepare your dataset. Collect and clean your data, then format it to be compatible with the model. Usually, this means in JSONL format.
1{
2  "messages": [
3    {"role": "system", "content": "Pegleg is a resume generator that turns regular sentences into pirate speak."},
4    {"role": "user", "content": "I spent 3 years at Deloitte"}, {"role": "assistant", "content": "Avast, ye landlubbers. 3 years on the seas of big four, aboard the good ship Deloitte did I spend"}
5  ]
6}
7{
8  "messages": [
9    {"role": "system", "content": "Pegleg is a resume generator that turns regular sentences into pirate speak."},
10    {"role": "user", "content": "My role was Senior Consultant"}, {"role": "assistant", "content": "Me duties were that of a Senior, navigatin' the treacherous waters of consultant-seas."}
11  ]
12}
13{
14  "messages": [
15    {"role": "system", "content": "Pegleg is a resume generator that turns regular sentences into pirate speak."},
16    {"role": "user", "content": "I 2X'ed my KPIs during this time"}, {"role": "assistant", "content": "Aye, during me tenure, I doubled me key performance indicators, provin' me worth as a scurvy buccaneer of the corporate world."}
17  ]
18}
19...
20
  • Set up the fine-tuning script. Then, you need to create a script that loads your dataset and the pre-trained model. Here’s an example using the OpenAI API and Python:
1from openai import OpenAI
2client = OpenAI()
3
4file = client.files.create(
5 file=open("pirate.jsonl", "rb"),
6 purpose="fine-tune"
7)
8
9job = client.fine_tuning.jobs.create(
10 training_file=file.id,
11 model="gpt-3.5-turbo"
12)
13
  • Monitor the fine-tuning progress. Fine-tuning can take some time, depending on the size of your dataset and the complexity of the task. You can monitor the progress using the OpenAI API or their web interface.
  • Use the fine-tuned model. Once the fine-tuning is complete, you can use the fine-tuned model just like the base model, passing prompts to the fine-tuned model to get responses tailored to your domain.
1completion = client.chat.completions.create(
2 model="ft:gpt-3.5-turbo-0125:org:custom_suffix:id",
3 messages=[
4   {"role": "system", "content": "Pegleg is a resume generator that turns regular sentences into pirate speak."},
5   {"role": "user", "content": "I have strong communication and presentation skills"}
6 ]
7)
8print(completion.choices[0].message)
9
10#output
11ChatCompletionMessage(content='Me parley with ease and spin a tale with the gift of gab, me communication and presentation skills be the envy of any crew', role='assistant', function_call=None, tool_calls=None)

Fine-tuning can significantly improve a model’s performance on a specific task but requires substantial domain-specific data and computational resources.

3. Using embeddings to add your data to a model

Embeddings are a way to represent your data in a high-dimensional vector space where similar data points are closer together. By generating embeddings for your data and the user’s query, you can quickly find the most relevant data to use as context for the model.

Let’s run through a specific example: retrieval-augmented generation, or RAG.

RAG combines embeddings and language models to generate more accurate and relevant responses. It allows the model to access external knowledge at runtime without fine-tuning. For example, if you want to include references to a set of documentation, citations from a collection of scientific papers, or specific quotes from a news article in your responses, you would want to retrieve the relevant portions of those documents and give that to a language model as part of the context of your query. This is also the easiest way to “teach” language models information that has been created past their knowledge cutoff date.

Let’s say we wanted to give a model more context based on Retool’s documentation. Here’s how we would do it:

  • Prepare the data. We’d need to collect and preprocess our documentation using NLP techniques. Likely, we’d need to split the data into smaller chunks (e.g., paragraphs or sentences) that can be used as context for the model.
  • Generate embeddings. Here, we’d use a model like OpenAI's text-embedding-3-small to generate embeddings for each chunk of your data. Then, we’d store these embeddings in a vector database like Pinecone, along with the corresponding text chunks.
1from openai import OpenAI
2
3from pinecone import Pinecone, PodSpec
4
5
6
7client = OpenAI()
8
9
10
11# Generate an embedding for a piece of text
12
13text = "To get started with Chat actions for apps, add a new AI Action query. For workflows, add a new AI Action block."
14
15response = client.embeddings.create(
16
17model="text-embedding-3-small",
18
19input=text
20
21)
22
23
24
25embedding = response['data'][0]['embedding']
26
27print(embedding)
28
29
30
31#output
32
33[-0.01288499, -0.008340877, 0.004680492, -0.014892475, 0.008351787, -0.07519336, 0.006687976, -0.015972588, 0.006246111, -0.014597898, -0.033799917, -0.026926467, -0.01286317, -0.012012172, -0.018514674, 0.015056128, -0.03711663, -0.011772146, 0.039124113, 0.015405255, 0.026533699, 0.016725395, -0.0114884805, 0.0035130966, -0.03174879, -0.0284539, -0.0007037103, 0.027319236, -0.044251923, -0.033232585, -0.011292096, 8.446911e-05, -0.024548035, 0.03251251, 0.035130966, -0.017707316, -0.03951688, 0.03478184, 0.0274938, 0.0061806496, -0.017281817, 0.012503132, -0.0054823942, 0.072313055, -0.040520623, -0.028082952, -0.024875343, -0.0008196316, -0.019452954, -0.0046559437, -0.049794324, -0.029915871, -0.03940778, 0.009268247, -0.016332626, -0.010610206, -0.03624381, -0.029130334, -0.016103512, ...]
34
35
36
37pc = Pinecone(api_key='<<PINECONE_API_KEY>>')
38
39pc.create_index(
40
41name="example-index",
42
43dimension=1536,
44
45metric="cosine",
46
47spec=PodSpec(
48
49environment='us-west-2',
50
51pod_type='p1.x1'
52
53)
54
55)
56
57
58
59index = pc.Index(host=os.environ.get('INDEX_HOST'))
60
61upsert_response = index.upsert(
62
63vectors=[("vec1", embedding)],
64
65)
  • Implement the retrieval step. When a user makes a query, generate an embedding for the query using the same embedding model. This will use the vector database to find the top-k chunks most similar to the query embedding based on a similarity metric like cosine similarity.
1query = (
2
3"How do I use AI with Retool?"
4
5)
6
7
8
9res = openai.Embedding.create(
10
11input=[query],
12
13engine="text-embedding-3-small"
14
15)
16
17
18
19# retrieve from Pinecone
20
21xq = response['data'][0]['embedding']
22
23
24
25# get relevant contexts (including the questions)
26
27res = index.query(vector=xq, top_k=2, include_metadata=True)
  • Implement the generation step. Concatenate the retrieved text chunks with the user’s query to form the input for the language model, then pass this input to an LLM to generate a response.

Using RAG, your AI app can generate more informed and relevant responses to the user's query, as the model has access to pertinent external knowledge at runtime. This approach is handy when you have a large amount of data that can't be easily fine-tuned or when you need your model to handle various queries without retraining.

Though tools like OpenAI's embeddings and Pinecone’s vector database make the process more manageable, implementing RAG requires more setup than simply using a pre-trained model. But, the improved quality of the model’s responses can be well worth the effort.

Model: Choosing an LLM

Choosing a large language model (LLM) for your app is like choosing the right brain. LLMs are transformer-based neural networks trained on vast amounts of text data, allowing them to learn the intricacies of language and develop a deep informational understanding of the world.

They excel at a wide range of natural language tasks, from text generation and summarization to question answering and sentiment analysis, thanks to their ability to capture complex patterns and relationships within data. The most advanced LLMs have hundreds of billions of parameters, enabling them to generate human-like text with remarkable coherence and contextual awareness. When building AI applications, developers can leverage the “world knowledge” embedded in these LLMs and combine it with domain-specific data passed through the context window or retrieved from a vector database, creating powerful and tailored AI experiences.

We can break these down into two general categories:

Proprietary models

These are the ones you’re finding $20 monthly payments to on all the company credit cards—think OpenAI’s GPT models, Anthropic’s Claude, and Google’s Gemini.

Proprietary models offer state-of-the-art performance, ease of use, reliability, and continuous improvement. They provide developer-friendly APIs and well-documented libraries, making integration into applications straightforward. Backed by robust infrastructure, these models can handle high-volume requests and scale effectively.

However, proprietary models come with usage costs that can quickly add up, especially when processing large volumes of data. Building an application around a proprietary model ties core functionality to a third-party provider, leading to potential issues with downtime, API changes, or the provider's longevity. Using a proprietary model involves sending data to a third-party server, raising privacy and data security concerns.

Open-source models

Open-source models are an alternative to proprietary LLMs, providing developers greater control, flexibility, and cost-effectiveness. These models, such as Meta’s Llama 3 or Mistral’s Mixtral, often developed by research institutions or non-profits, have rapidly advanced in recent years, narrowing the performance gap with their proprietary counterparts.

Because open-source models are freely available for fine-tuning and deployment without API usage costs or licensing fees, they can be cost-effective. They also provide complete control over the model's architecture, training data, and hyperparameters, allowing developers to tailor the model to their specific use case. Open-source models also offer transparency and accountability, with full visibility into the model's architecture, training methodology, and decision-making process.

Integrating open-source models into an application may require more technical expertise compared to working with a proprietary API, as tasks like model hosting, scaling, and monitoring need to be handled independently. Training and fine-tuning large open-source models can be computationally intensive, requiring significant GPU resources, and inference also has computational requirements and associated costs to consider.

Ultimately, the choice of proprietary versus open source is entirely up to you, your business and application needs, and the trade-offs you’re willing to make.

User interface: using the model output

The user interface (UI) is the final slice of AI app layer cake.

So far, most people experience large models via chat interfaces (aka the ubiquitous chatbots). The user interacts with the AI model through a conversational interface, typically by entering text prompts and receiving text responses. Chat interfaces provide a natural and intuitive way for users to engage with AI, mimicking the familiar experience of human-to-human conversation.

Under the hood, chat interfaces send the user’s input to the AI model, along with any relevant context or data, and then display the model’s output back to the user. The AI model processes the input using its trained weights and generates a response based on understanding the conversation and the specific task.

To implement a chat interface, you’ll need a few key components:

  • A frontend interface. This is typically a web-based interface built with HTML, CSS, and JavaScript. The frontend should provide an input field for users to enter messages and a display area for the conversation history.
  • A backend server. The backend server is an intermediary between the frontend and the AI model. It receives the user’s input from the frontend, prepares the input for the model (e.g., adding context or retrieving relevant data), sends the input to the model, and then returns the model's response to the frontend.
  • Asynchronous communication. To provide a realistic and responsive user experience, it’s crucial to implement asynchronous communication between the frontend, backend, and AI model. This allows the model to generate real-time responses as the user types, creating a more natural conversation flow.

Here’s a simplified example of how the flow might work:

  • The user types a message in the frontend chat interface and hits send.
  • The frontend sends the message to the backend server via a WebSocket connection.
  • The backend server receives the message, prepares the input for the AI model (e.g., adding context), and sends a request to the AI model’s API using a streaming endpoint.
  • As the AI model generates the response, it sends it back to the backend server in chunks.
  • The backend server receives the chunks and forwards them to the frontend via the WebSocket connection.
  • The frontend receives the chunks and displays them in the chat interface in real-time, creating the illusion of a continuous conversation.

By implementing asynchronous streaming, you can create a chat interface that feels responsive and natural, even when dealing with complex queries and large amounts of data.

It's important to note that building a chat interface involves writing a significant amount of API-calling code to facilitate communication between the frontend, backend, and AI model. Unlike using pre-built components that can be easily connected, such as Retool's Vectors, Workflows, and App Builder, creating a chat interface from scratch requires a deeper understanding of APIs and the ability to write code that sends requests and handles responses effectively. This includes managing long-lasting connections, prepping inputs for the AI model, and processing the streaming output before displaying it to the user.

Of course, chat interfaces (whether or not they speak pirate) are just one way to use the output of AI models. Depending on your application’s specific needs, you might use the model’s output to generate reports, control other applications, create images, or any other applications.

Securing the AI stack

Many organizations are wary of the security risks AI poses. If you add your own data or documents to an AI model, for example, what happens with that data?

Many AI model providers are strongly pushing the narrative that they aren’t using private data to train their models, but you’re still sending data and documents to another service. While most reputable AI providers have strict data usage policies and claim not to use customer data for model training, it’s still important to understand the risks and take appropriate precautions.

Two must-haves for AI security are the same as any application: data encryption and access control.

For data encryption, when sending data to an AI service, you should use secure communication protocols like HTTPS to encrypt the data as it travels across the network. This helps protect against eavesdropping and man-in-the-middle attacks. If you're storing sensitive data for use with your AI application, such as user information or proprietary content, encrypt it at rest using robust encryption algorithms. Many cloud storage services, such as Amazon S3 or Google Cloud Storage, offer server-side encryption by default, but you can also implement client-side encryption for added security.

Proper access control and authentication mechanisms are essential to prevent unauthorized access to your AI application and the underlying data. Use industry-standard authentication protocols, such as OAuth 2.0, to secure the communication between your frontend, backend, and AI services.

More specific to AI, you’ll need to think about data minimization and anonymization. To reduce the risk of sensitive data exposure, practice data minimization by only sending the necessary information to your AI service. If possible, anonymize or pseudonymize the data before sending it, removing any personally identifiable information (PII) or sensitive details that aren't essential for the AI task.

For example, if you're building an AI-powered customer support chatbot, you might not need to send the customer's full name or contact details to the AI service. Instead, you could use a unique identifier to link the conversation to the customer's record in your database, keeping the sensitive information within your control.

Deploying AI applications

When deploying AI applications, you'll likely use a microservice architecture, where each component of the AI stack lives separately from the others. This modular approach allows for greater flexibility, scalability, and maintainability, but it also introduces some complexity in deployment and management.

In a typical AI application, you'll have several distinct services:

  • Data ingestion workflow. This service is responsible for collecting, preprocessing, and transforming the data that will be used to train or inform the AI model. It involves the data cleaning, normalization, feature extraction, and embedding generation steps from above. The data ingestion workflow is often implemented as a separate service to allow for independent scaling and maintenance.
  • Database. The database serves as the central repository for storing and managing the data used by the AI application. This may include raw data, preprocessed data, embeddings, and metadata. Above, we’ve talked about vector databases such as Pinecone or Weaviate. Still, you’ll also need regular databases, either SQL databases such as Postgres or, more commonly used in AI applications with unstructured data, NoSQL databases like MongoDB.
  • User Interface. The UI is typically deployed as a separate service, often as a web application or mobile app. This allows for independent development, testing, and scaling of the UI and the ability to support multiple client platforms.

To connect these microservices, you'll need to implement reliable and efficient communication channels between them. Depending on your application's specific requirements, this is often achieved using RESTful APIs, message queues, or gRPC.

Deploying the AI application as a set of microservices offers several advantages:

  • Scalability. Each component can be scaled independently based on its specific resource requirements and usage patterns. This allows you to optimize the performance and cost of your application by allocating resources where they're needed most.
  • Flexibility. Microservices can be developed, tested, and deployed independently, allowing for greater flexibility regarding technology choices and release cycles. This enables you to iterate and improve individual components without affecting the entire application.
  • Resilience. By decoupling the components of your AI application, you can isolate failures and prevent them from cascading across the entire system. If one service goes down, the others can continue functioning, providing a more resilient and fault-tolerant application.

However, deploying an AI application as microservices also introduces some challenges:

  • Complexity. Managing a distributed microservices system can be more complex than a monolithic application. You'll need to consider service discovery, inter-service communication, data consistency, and distributed monitoring and logging.
  • Latency. As data flows between the various microservices, some additional latency may be introduced compared to a monolithic application. You'll need to carefully design your application architecture and communication protocols to minimize latency and ensure a responsive user experience.
  • Debugging. With multiple services involved, debugging and troubleshooting can be more challenging. You'll need robust monitoring, logging, and tracing mechanisms to quickly identify and resolve issues across the distributed system.

The right AI stack will power your ROI-driving apps

The reality of building and deploying AI applications is never just as simple as an API call. You need a way to add your data, tune or build your models, and allow users to use your application. You have to build all the infrastructure you need up front, and keep all of your data secure.

In short, these apps can require a lot of time and resources to build well, which isn’t always the answer the execs asking about AI in meetings want to hear. Identifying ways you can build and iterate more quickly, test and switch between models, and limit the amount of manual set up you need to do before getting to the core of the build is key to enabling progress.

In the second part of this series, which will be live in a few days, we show you a way to do just that—plus, we’ll walk through a complete AI app development tutorial. Until then, sign up for a free Retool account to get started. Ready to build an app today? Go to part two for a full tutorial.

Thank you to Keanan Koppenhaver for his technical review of this article.

Reader

Andrew Tate
Andrew Tate
Technical writer
Andrew is an ex-neuroengineer-turned-developer and technical content marketer.
Jul 19, 2024
Copied