together.products

Fastest cloud platform
for building and running
generative AI.

Inference that’s fast, simple, and scales as you grow.

Fast
Run leading open-source models like Llama-2 on the fastest inference stack available, up to 3x faster¹ than TGI, vLLM, or other inference APIs like Perplexity, Anyscale, or Mosaic ML.
COST-EFFICIENT
Together Inference is 6x lower cost² than GPT 3.5 Turbo when using Llama2-13B. Our optimizations bring you the best performance at the lowest cost.
scalable
We obsess over system optimization and scaling so you don’t have to. As your application grows, capacity is automatically added to meet your API request volume.

Serverless Endpoints for leading open-source models

import os, requests

url = "https://api.together.xyz/v1/chat/completions"

payload = {
    "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "repetition_penalty": 1
}
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "Authorization": "Bearer TOGETHER_API_KEY"
}

response = requests.post(url, json=payload, headers=headers)

print(response.text)

#  Signup to get your API key here: https://api.together.ai/
#  Documentation for API usage: https://docs.together.ai/

Try now

Watch demo

Dedicated Instances for any model

Integrate Together Inference Engine into your application

import os, requests

url = 'https://api.together.xyz/inference'
headers = {
    'Authorization': 'Bearer ' + os.environ["TOGETHER_API_KEY"],
    'accept': 'application/json',
    'content-type': 'application/json'
}

data = {
    "model": "togethercomputer/llama-2-70b-chat",
    "prompt": "The capital of France is",
    "max_tokens": 128,
    "stop": ".",
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "repetition_penalty": 1
}

response = requests.post(url, json=data, headers=headers)
print(response.json())

#  Signup to get your API key here: https://api.together.ai/
#  Documentation for API usage: https://docs.together.ai/

import os, requests

url = "https://api.together.xyz/api/v1/embeddings"
headers = {
    'Authorization': 'Bearer ' + os.environ["TOGETHER_API_KEY"],
    'accept': 'application/json',
    'content-type': 'application/json'
}

response = requests.post(
    url,
    headers=headers,
    json={
        "input": texts,
        "model": model_api_string
    }
)

embeddings = [response.json()['data'][i]['embedding'] for i in range(len(texts))]

#  Signup to get your API key here: https://api.together.ai/
#  Documentation for API usage: https://docs.together.ai/

import requests

url = 'https://api.together.xyz/inference'
headers = {
    'Authorization': 'Bearer ' + os.environ["TOGETHER_API_KEY"],
    'accept': 'application/json',
    'content-type': 'application/json'
}

data = {
    "model": "togethercomputer/llama-2-70b-chat",
    "prompt": "The capital of France is",
    "stream_tokens": true
}

#  Signup to get your API key here: https://api.together.ai/
#  Documentation for API usage: https://docs.together.ai/

Read our docs

Perfect for enterprises — performance, privacy, and scalability to meet your needs.

Performance

You get faster tokens per second, higher throughput and lower time to first token³. And, all these efficiencies mean we can provide you compute at a lower cost.

speed relative
to tgi, vllm or
other inference services

3x FASTER⁴

llama-2 70b

117 TOKENS/SEC⁵

cost relative to gpt-3.5-turbo

6x lower cost⁶

The Together Inference Engine sets us apart.

We built the blazing fast inference engine that we wanted to use. Now, we’re sharing it with you.

‍

The Together Inference Engine deploys the latest inference techniques:

01
Flash-Decoding and FlashAttention
Flash-Decoding dramatically speeds up attention in the Together Inference Engine while FlashAttention helps with the time to first token — up to 8x faster generation for very long sequences⁷. We achieve these by carefully designing how keys and values are loaded in parallel while separately rescaling and combining results to maintain the right attention outputs.
02
CUDA Graphs
Using CUDA graphs greatly reduces the overhead of launching GPU operations in the Together Inference Engine. It saves time by using a mechanism to launch multiple GPU operations through a single CPU operation.
03
Optimizations
Developed by our expert research team, the Together Inference Engine layers on multiple techniques. As we do this, we make painstaking optimizations to ensure that we get unmatched efficiency.

Privacy

control
Privacy settings put you in control of what data is kept and none of your data will be used by Together AI to train new models, unless you explicitly opt in to share it.
autonomy
When you fine-tune or train a model with Together AI the resulting model is your own private model. You own it.
security
Together AI offers flexibility to deploy in a variety of secure clouds for enterprise customers.

Customize leading open-source models with your own private data.
Achieve higher accuracy on your domain tasks.
Start by preparing your dataset — one row per label in a .jsonl file, following the prompt template of the model you are fine-tuning.

{"text": "<s>[INST] <<SYS>>\\n{your_system_message}\\n<</SYS>>\\n\\n{user_message_1} [/INST]"}
{"text": "<s>[INST] <<SYS>>\\n{your_system_message}\\n<</SYS>>\\n\\n{user_message_1} [/INST]"}

Validate that your dataset has the right format and upload it.

together files check $FILE_NAME 
together files upload $FILE_NAME 
{
       "filename" : "acme_corp_customer_support.json",
       "id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a",
       "object": "file"
}

Begin fine-tuning with a single command — with full control over hyper parameters.

together finetune create --training-file $FILE_ID 
--model $MODEL_NAME 
--wandb-api-key $WANDB_API_KEY 
--suffix v1 
--n-epochs 10 
--n-checkpoints 5 
--batch-size 8 
--learning-rate 0.0003
{
    "training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a",
    "model_output_name": "username/togethercomputer/llama-2-13b-chat",
    "model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat",
    "Suffix": "Llama-2-13b 1",
    "model": "togethercomputer/llama-2-13b-chat",
    "n_epochs": 10,
    "batch_size": 8,
    "learning_rate": 1e-06,
    "checkpoint_steps": 5,
    "created_at": 1687982945,
    "updated_at": 1687982945,
    "status": "pending",
    "id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f",
    "epochs_completed": 3,
    "events": [
        {
            "object": "fine-tune-event",
            "created_at": 1687982945,
            "message": "Fine tune request created",
            "type": "JOB_PENDING",
        }
    ],
    "queue_depth": 0,
    "wandb_project_name": "Llama-2-13b Fine-tuned 1"
}

Monitor results on Weights & Biases, or deploy checkpoints and test them through the Together Playgrounds.

Together fine-tuning

Fine-tune models with your data.

Host your fine-tuned model for inference when it’s ready.

Get started now

Pricing

Together Custom Models is designed to help you train your own state-of-the-art AI model.
Benefit from cutting-edge optimizations in the Together Training stack like FlashAttention-2.
Once done the model is yours. You retain full ownership of the model that is created, and you can run your model wherever you please.
Together Custom Models helps you through all stages of building your state-of-the-art AI model:
01. Start with data design.
Incorporate quality signals from RedPajama-v2 (30T tokens) into your model to boost its quality.
Choose data based on similarity to Wikipedia, amount of code, or how often the text uses bullets for brevity. For more details on the quality slices in RedPajama-v2 read the blog post.
Read more
Leverage advanced data selection tools like DSIR to select data slices and then optimize the amount of each slice used with DoReMi.
02. Select model architecture & training recipe.
We provide proven training recipes for instruction-tuning, long context optimization, conversational chat, and more.
Work in collaboration with our team of experts to determine the optimal architecture and training recipe.
03. Train your model.
Press go. Together Custom Models schedules, orchestrates, and optimizes your training jobs over any number of GPUs.
Up to
9x faster training
with FlashAttention-2⁹
Up to
75% lower cost
than training on AWS¹⁰
04. Tune and align your model.
Further customize and tailor your model to follow instructions and your business rules.
05. Evaluate model quality.
Evaluate your final model on public benchmarks such as HELM and LM Evaluation Harness, and your own custom benchmark — so you can iterate quickly on model quality.

Together custom models

Build models from scratch

We love to build state-of-the-art models. Use Together Custom Models to train your next generative AI model.

Together GPU Clusters

We offer high-end compute clusters for training and fine-tuning. But premium hardware is just the beginning. Our clusters are ready-to-go with the blazing fast Together Training stack. And our world-class team of AI experts is standing by to help you. Together GPU Clusters has a >95% renewal rate. Come build with us, and see what the hubbub is about.

Cutting-edge hardware

The fastest network for distributed training — 3.2Tbps Infiniband.
State-of-the-art training clusters with the fastest compute available — Nvidia H100 and A100 GPUs.
Directly SSH into the cluster, download your dataset and you’re ready to go.

Reserve cluster now

Software stack ready for distributed training

Train with the Together Training stack, delivering nine times faster training speed with FlashAttention-2.¹¹
Slurm configured out-of-the-box for distributed training and the option to use your own scheduler.
Directly SSH into the cluster, download your dataset and you’re ready to go.

Performance metrics

training horsepower

exaflops¹²

relative to aws

4x lower cost¹³

training speed

9x faster¹⁴

Benefits

Scale infra – at your pace
Start with as little as 30 days — and expand at your own pace. Scale up or down as your needs change — from 16 GPUs to 2048 GPUS.
SNAPPY SETUP. BLAZING FAST TRAINING.
We value your time. Your cluster comes optimized for distributed training with the high performance Together Training stack and a setup Slurm cluster out of the box. You focus on your model and we’ll ensure everything runs smoothly. ssh in, download your data, and start training.
EXPERT SUPPORT
Our team is dedicated to your success. Our expert team will help unblock you, whether you have AI or system issues. Guaranteed uptime SLA and support included with every cluster. Additional engineering services available when needed.

Hardware specs

01
A100 PCIe Cluster Node Specs
- 8x A100 / 80GB / PCIe
- 200Gb non-blocking Ethernet
- 120 vCPU Intel Xeon (Ice Lake)
- 960GB RAM
- 7.68 TB NVMe storage
02
A100 SXM Cluster Node Specs
- 8x NVIDIA A100 80GB SXM4
- 200 Gbps Ethernet or 1.6 Tbps Infiniband configs available
- 120 vCPU Intel Xeon (Sapphire Rapids)
- 960 GB RAM
- 8 x 960GB NVMe storage
03
H100 Clusters Node Specs
- 8x Nvidia H100 / 80GB / SXM5
- 3.2 Tbps Infiniband network
- 2x AMD EPYC 9474F 18 Cores 96 Threads 3.6GHz CPUs
- 1.5TB ECC DDR5 Memory
- 8x 3.84TB NVMe SSDs

Customers Love Us

“Together GPU Clusters provided a combination of amazing training performance, expert support, and the ability to scale to meet our rapid growth to help us serve our growing community of AI creators.”

‍

Demi Guo

CEO, Pika Labs

After pre-training a model using

Together GPU Clusters

, you instruction-tune with

Together Fine-tuning

and host with

Together Inference

After selecting a model with

Together Inference

, you can customize it with your own private data using

Together Fine-tuning

Try now

After building your model on

Together GPU Clusters

, you deploy your own Dedicated Instances for your production traffic with

Together Inference

Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Based on published pricing November 8th, 2023, comparing Open AI GPT-3.5-Turbo to Llama-2-13B on Together Inference using Serverless Endpoints. Assumes equal number of input and output tokens.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference. Detailed results and methodology published here.
Based on published pricing November 8th, 2023, comparing Open AI GPT-3.5-Turbo to Llama-2-70B on Together Inference using Serverless Endpoints. Assumes equal number of input and output tokens.
Testing and methodology published in this blog post.
Testing and methodology published in this blog post.
Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source.
Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks and AWS p5.48xlarge instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration.
Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source.
Calculated based on published NVIDIA performance specifications in H100 and A100 datasheets.
Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks and AWS p5.48xlarge instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration.
Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source.

together.products

Inference that’s fast, simple, and scales as you grow.

Fast

COST-EFFICIENT

scalable

Serverless Endpoints for leading open-source models

Dedicated Instances for any model

Integrate Together Inference Engine into your application

Perfect for enterprises — performance, privacy, and scalability to meet your needs.

Performance

speed relative to tgi, vllm orother inference services

llama-2 70b

cost relative to gpt-3.5-turbo

The Together Inference Engine sets us apart.

01

02

03

Privacy

control

autonomy

security

Fine-tune models with your data.

Build models from scratch

Together GPU Clusters

Cutting-edge hardware

Software stack ready for distributed training

Performance metrics

training horsepower

relative to aws

training speed

Benefits

Scale infra – at your pace

SNAPPY SETUP. BLAZING FAST TRAINING.

EXPERT SUPPORT

Hardware specs

A100 PCIe Cluster Node Specs

A100 SXM Cluster Node Specs

H100 Clusters Node Specs

Customers Love Us

Subscribe to newsletter

speed relative
to tgi, vllm or
other inference services