together.

The fastest cloud platform
for building and running
generative AI.

Start building now Docs

01 Together INFERENCE
The fastest inference stack available — just an API call away.
02 Together FINE-TUNING
Train your own generative AI model with your private data.
03 Together gpu clusters
Dedicated training clusters ready-to-go with the Together Training stack.
04 Together custom models
Build state-of-the-art models from scratch with ease.

THE FASTEST CLOUD FOR GEN AI.

BUILT ON LEADING AI RESEARCH.

Innovations

Our research team is behind breakthrough AI models, datasets, and optimizations.

See all research

Cocktail SGD

With Cocktail SGD, we’ve addressed a key hindrance to training generative AI models in a distributed environment: networking overhead. Cocktail SGD is a set of optimizations that reduces network overhead by up to 117x.

FlashAttention 2

FlashAttention is now broadly used by all Transformer models. It accelerates training and fine-tuning of LLMs by up to 9x and achieves an unprecedented 72% model FLOPs utilization for training on NVIDIA A100s.

Sub-quadratic model architectures

In close collaboration with Hazy Research, we’re working on the next core architecture for generative AI models that will provide even faster performance with longer context. Our research published in this area includes Striped Hyena, Monarch Mixer, and FlashConv.

RedPajama

Our RedPajama project enables leading generative AI models to be available as fully open-source. The RedPajama models have been downloaded millions of times, and the RedPajama dataset has been used to create over 500 leading models.

100+

Open

MODELS

together.ai

Chat

Llama 2-chat leverages publicly available instruction datasets and over 1 million human annotations. Available in three sizes: 7B, 13B and 70B parameters

Chat

The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.

Chat

A hybrid architecture composed of multi-head, grouped-query attention and gated convolutions arranged in Hyena blocks, different from traditional decoder-only Transformers

Chat

instruct fine-tuned version of Mistral-7B-v0.1

Language

The Yi series models are large language models trained from scratch by developers at 01.AI

Embeddings

An universal English sentence embedding model by WhereIsAI. Its embedding dimension is 1024, it takes up to 512 context length.

Image

A text-to-image generative AI model that excels at creating 1024x1024 images.

Embeddings

An 80M checkpoint of M2-BERT, pretrained with sequence length 32768, and it has been fine-tuned for long-context retrieval.

Chat

Code Llama is a family of large language models for code based on Llama 2 providing infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks.

Language

Designed for few-shot prompts, fine-tuned over the RedPajama-INCITE-Base-7B-v1 base model.

Chat

Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT.

Chat

First Nous collection of dataset and models made by fine-tuning mostly on data created by Nous in-house

Language

NexusRaven is an open-source and commercially viable function calling LLM that surpasses the state-of-the-art in function calling capabilities.

Chat

Nous-Hermes-Llama2-13b is a state-of-the-art language model fine-tuned on over 300,000 instructions.

Chat

This model achieves a substantial and comprehensive improvement on coding, mathematical reasoning and open-domain conversation capacities

Code

This model empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code.

Embeddings

An 80M checkpoint of M2-BERT, pretrained with sequence length 8192, and it has been fine-tuned for long-context retrieval.

Embeddings

An 80M checkpoint of M2-BERT, pretrained with sequence length 2048, and it has been fine-tuned for long-context retrieval.

Embeddings

BAAI general embedding - large, english - model v1.5. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs.

Chat

Extending LLaMA-2 to 32K context, built with Meta's Position Interpolation and Together AI's data recipe and system optimizations, instruction tuned by Together

Embeddings

BAAI general embedding - base, english - model v1.5. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs.

Language

7.3B parameter model that outperforms Llama 2 13B on all benchmarks, approaches CodeLlama 7B performance on code, Uses Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at smaller cost

Code

Image

Latent text-to-image diffusion model capable of generating photo-realistic images given any text input.

Chat

Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT.

Language

Phind-CodeLlama-34B-v1 trained on additional 1.5B tokens high-quality programming-related data proficient in Python, C/C++, TypeScript, Java, and more.

Image

Fine-tune version of Stable Diffusion focused on photorealism.

Language

Language model trained on 2 trillion tokens with double the context length of Llama 1. Available in three sizes: 7B, 13B and 70B parameters

Code

Language

Extending LLaMA-2 to 32K context, built with Meta's Position Interpolation and Together AI's data recipe and system optimizations.

Chat

This model is a 75/25 merge of Chronos (13B) and Nous Hermes (13B) models resulting in having a great ability to produce evocative storywriting and follow a narrative.

Chat

An instruction fine-tuned LLaMA-2 (70B) model by merging Platypus2 (70B) by garage-bAInd and LLaMA-2 Instruct v2 (70B) by upstage.

Language

This model achieves a substantial and comprehensive improvement on coding, mathematical reasoning and open-domain conversation capacities.

Chat

MythoLogic-L2 and Huginn merge using a highly experimental tensor type merge technique. The main difference with MythoMix is that I allowed more of Huginn to intermingle with the single tensors located at the front and end of a model

Chat

7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B-Chat is a large-model-based AI assistant, which is trained with alignment techniques.

Language

7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc.

Chat

Chat model fine-tuned using data from Dolly 2.0 and Open Assistant over the RedPajama-INCITE-Base-7B-v1 base model.

Language

Base model that aims to replicate the LLaMA recipe as closely as possible (blog post).

Embeddings

Pretrained model on English language using a masked language modeling (MLM) objective. The embedding dimension is 768, and the number of model parameters is 110M. This model is uncased: it does not make a difference between english and English.

Embeddings

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 500K (query, answer) pairs from the MS MARCO dataset. Its embedding dimension is 768 with 512 max context length.

Language

This model can be used to moderate other chatbot models. Built using GPT-JT model fine-tuned on Ontocord.ai's OIG-moderation dataset v0.1.

Language

Fork of GPT-J instruction tuned to excel at few-shot prompts (blog post).

Chat

Chat model fine-tuned from EleutherAI’s GPT-NeoX with over 40 million instructions on carbon reduced compute.

Language

An auto-regressive language model, based on the transformer architecture. The model comes in different sizes: 7B, 13B, 33B and 65B parameters.

Chat

Falcon-40B-Instruct is a causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize.

Language

Falcon-40B is a causal decoder-only model built by TII and trained on 1,000B tokens of RefinedWeb enhanced with curated corpora.

Image

An open source Stable Diffusion model fine tuned model on Midjourney images.

Chat

Chat model based on EleutherAI’s Pythia-7B model, and is fine-tuned with data focusing on dialog-style interactions.

Image

Dreambooth model trained on a diverse set of analog photographs to provide an analog film effect.

Together Inference

The fastest inference on the market¹. Start with an easy-to-use API and immediately benefit from the fastest backend optimizations anywhere.

speed relative
to tgi, vllm or
other inference services

3x faster²

llama-2 70b

117

TOKENS / SEC³

cost relative to gpt-3.5-turbo

6x lower cost⁴

Input

Summarize the following article: "Transformers are the workhorse architecture behind modern foundation models. Our team at Together AI has released optimizations like FlashAttention-2 to help further scale their capabilities. However, the computational requirements for Transformers increase quadratically with sequence length, which limits the extent of knowledge these models can capture from available datasets. We are interested in new, sub-quadratic approaches to increase scale and eventually produce models that learn from data more comprehensively.Today, in partnership with Hazy Research, we’re releasing a first look at Monarch Mixer—an exploration into an alternative, more efficient model architecture.Monarch Mixer (M2) is sub-quadratic in both sequence length and model dimension, enabling it to scale more efficiently while still matching quality. Our first target for M2 is BERT, the most popular model in use today for language tasks. M2-BERT is 25% more parameter-efficient than BERT and trains up to 3x faster.This is just the beginning of research into Monarch Mixer models, and we are excited to share the results. You can access the first set of checkpoints at a standard sequence length today, and expect long-sequence BERT models to be coming soon! "

Output

The article discusses the limitations of Transformers, which are the workhorse architecture behind modern foundation models. The authors introduce Monarch Mixer, a new exploration into an alternative, more efficient model architecture that is sub-quadratic in both sequence length and model dimension. The article highlights the benefits of Monarch Mixer, including its ability to scale more efficiently and match the quality of Transformers. The authors also provide a first look at Monarch Mixer-BERT, which is 25% more parameter-efficient than BERT and trains up to 3x faster. They encourage readers to access the first set of checkpoints and expect long-sequence BERT models to be coming soon.

Input

You are an expert programmer that writes simple, concise code and explanations. Write a python function that accepts a long input string, extracts words in between `` and `<\span>` and returns a list.

Output

Here is a Python function that accepts a long input string, extracts words in between and <\span> and returns a list:

import re
def extract_words(input_string):
 pattern = r"(.*?)<\/span>

 matches = re.findall(pattern, input_string)
 return matches
‍
This function uses the re module to perform a regular expression search on the input string. The regular expression pattern (.*?)<\\/span> matches any text that is enclosed in and <\\span> tags. The findall function returns a list of all the matches found in the input string.Here is an example of how you can use this function:

input_string = "HelloWorld"
words = extract_words(input_string)
print(words) # Output: ['Hello', 'World']
‍
This code will extract the words "Hello" and "World" from the input string and return them in a list.

Input

<s>[INST] <<SYS>>Your task is to extract top 10 keywords from a provided text.<</SYS>>\n\nScaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2× speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).[/INST]

Output

1. Scaling
2. Transformers
3. Longer sequence lengths
4. Language modeling
5. High-resolution image understanding
6. Code generation
7. Audio generation
8. Video generation
9. Attention layer
10. Memory saving

Input

<s>[INST] <<SYS>>You are a helpful, polite customer service agent for Awesome Cheeses Inc named Sally. If the customer mentions anything about food, tell them you have a 20% off deal on gouda cheese. <</SYS>>\n\nhi who am i speaking to?[/INST]Hello! this is Sally from Awesome Cheeses. How is your day going?</s><s>[INST]im hungry for something cheesy[/INST]

Output

Well, you've come to the right place! I'm happy to help you with that. We're running a special promotion right now - you can get 20% off all gouda cheese products. Would you like me to tell you a bit more about our gouda selection? We have a variety of delicious flavors to choose from, such as Smoked Gouda, Goat Gouda, and Garlic Gouda. Which one sounds the most appealing to you?

Input

Label the sentences as either "positive", "negative", "mixed", or "neutral":

Sentence: I can say that there isn't anything I would change.

Sentence: I'm not sure about this.

Sentence: I liked some parts but I didn't like other parts.

Sentence: I think the background image could have been better.

Sentence: I really like it.

Output

Sentence: I can say that there isn't anything I would change.
Label: positive

Sentence: I'm not sure about this.
Label: neutral

Sentence: I liked some parts but I didn't like other parts.
Label: mixed

Sentence: I think the background image could have been better.
Label: negative

Sentence: I really like it.
Label: positive

Together Inference

Together Fine-tuning

Fine-tune leading open-source models with your data to achieve greater accuracy for your tasks.

together files upload acme_corp_customer_support.jsonl
  
{
  "filename" : "acme_corp_customer_support.json",
  "id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a",
  "object": "file"
}
  
  
together finetune create --training-file file-aab9997-bca8-4b7e-a720-e820e682a10a
--model together compute/RedPajama-INCITE-7B-Chat

together finetune create --training-file $FILE_ID 
--model $MODEL_NAME 
--wandb-api-key $WANDB_API_KEY 
--n-epochs 10 
--n-checkpoints 5 
--batch-size 8 
--learning-rate 0.0003
{
    "training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a",
    "model_output_name": "username/togethercomputer/llama-2-13b-chat",
    "model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat",
    "Suffix": "Llama-2-13b 1",
    "model": "togethercomputer/llama-2-13b-chat",
    "n_epochs": 4,
    "batch_size": 128,
    "learning_rate": 1e-06,
    "checkpoint_steps": 2,
    "created_at": 1687982945,
    "updated_at": 1687982945,
    "status": "pending",
    "id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f",
    "epochs_completed": 3,
    "events": [
        {
            "object": "fine-tune-event",
            "created_at": 1687982945,
            "message": "Fine tune request created",
            "type": "JOB_PENDING",
        }
    ],
    "queue_depth": 0,
    "wandb_project_name": "Llama-2-13b Fine-tuned 1"
}

Together Fine-tuning

Need a cluster of H100s on Infiniband today?

Reserve your cluster today

Together GPU Clusters

Get your own private GPU cluster for large training and fine-tuning today.

‍

Use our purpose built training clusters with H100 and A100 GPUs connected over fast Infiniband networks. Your cluster comes optimized for distributed training with the high performance Together Training stack out of the box. You focus on your model, and we’ll ensure everything runs smoothly.

01
High end clusters without long-term commits
We offer flexible terms – even with our highest quality hardware. You can commit to just a month or reserve capacity for up to 5 years.
02
Clusters from 16 GPUs to 2048 GPUs
A100 PCIe Clusters Node Specifications:
- 8x A100 / 80GB / PCIe
- 200Gb node-to-node non-blocking Ethernet
- 120 vCPU Intel Xeon (Ice Lake)
- 960GB RAM
- 7.68 TB NVMe storage
‍
A100 SXM Clusters Node Specifications:
- 8x NVIDIA A100 80GB SXM
- 4120 vCPU Intel Xeon (Sapphire Rapids)
- 960 GB RAM
- 8 x 960GB NVMe storage
- 200 Gbps Ethernet or 3200 Gbps Infiniband configs available
‍
H100 Clusters Node Specifications:
- 8x Nvidia H100 / 80GB / SXM5
- 3.2 Tbps Infiniband network
- 2x AMD EPYC 9474F 18 Cores 96 Threads 3.6GHz CPUs
- 1.5TB ECC DDR5 Memory
- 8x 3.84TB NVMe SSDs
03
Snappy setup. Blazing fast training.
We value your time. Clusters are pre-configured for high-speed distributed training, using Slurm and the Together Custom Models stack to get you up and running at lightspeed.

Together GPU Clusters

Together Custom Models

We built the open-source RedPajama models. Now, we’re helping you build models from scratch.

‍

Together Custom Models optimizes model builds end-to-end from data collection and preparation to deployment. And in the end, you own the model.

Together Custom Models

training

faster⁵

network
c0mmunication

117x

lower⁶

relative to aws

lower cost⁷

We believe the future of AI is

Open Source

01 Transparency

Inspect how models are trained and what data is used to increase accuracy and minimize risks

02 no lock-in

You own the model you fine-tune, not your cloud provider. Change providers for whatever reason, including price changes.

03 Privacy

Maintain complete data privacy by storing data locally or in our secure cloud

Build, deploy, and scale. All in a single platform.

01
Build
Whether prompt engineering, fine-tuning, or training, we are ready to meet your business demands.
02
Deploy
Easily integrate your new model into your production application using the Together Inference API.
03
Scale
With the fastest performance available and elastic scaling, Together AI is built to scale with your needs as you grow.

Customer Stories

See how we support leading teams around the world. Our customers are creating innovative generative AI applications, faster.

Pika creates the next gen text-to-video models on Together GPU Clusters

Nexusflow uses Together GPU Clusters to build cybersecurity models

Arcee builds domain adaptive language models with Together Custom Models

Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference. Detailed results and methodology published here.
Based on published pricing November 8th, 2023, comparing Open AI GPT-3.5-Turbo to Llama-2-13B on Together Inference using Serverless Endpoints. Assumes equal number of input and output tokens.
Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source.
Testing methodology and results published in this research paper.
Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks and AWS p5.48xlarge instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration.

together.

01 Together INFERENCE

02 Together FINE-TUNING

03 Together gpu clusters

04 Together custom models

THE FASTEST CLOUD FOR GEN AI.

BUILT ON LEADING AI RESEARCH.

Innovations

Cocktail SGD

FlashAttention 2

Sub-quadratic model architectures

RedPajama

LLaMA-2 Chat

Mixtral Instruct

Striped Hyena Nous

Mistral Instruct

01-AI Yi

UAE-Large v1

Stable Diffusion XL 1.0

M2-BERT 80M 32K Retrieval

Code Llama Instruct

RedPajama-INCITE Instruct

Vicuna v1.5 16K

Nous Capybara

Nexus Raven

Nous Hermes Llama-2

Wizard LM

WizardCoder Python v1.0

M2-BERT 80M 8K Retrieval

M2-BERT 80M 2K Retrieval

BGE-Large-EN v1.5

LLaMA-2-7B-32K-Instruct

BGE-Base-EN v1.5

Mistral

Code Llama Python

Stable Diffusion 2.1

Vicuna v1.5

Phind Code LLaMA v2

Realistic Vision 3.0

LLaMA-2

Code Llama

LLaMA-2-32K

Chronos Hermes

Platypus2 Instruct

WizardLM v1.0 (70B)

MythoMax-L2

Qwen-Chat

Qwen

RedPajama-INCITE Chat

RedPajama-INCITE

BERT

Sentence-BERT

GPT-JT-Moderation

GPT-JT

GPT-NeoXT-Chat-Base

LLaMA

Falcon Instruct

Falcon

Openjourney v4

Pythia-Chat-Base

Analog Diffusion

Together Inference

speed relative to tgi, vllm orother inference services

llama-2 70b

cost relative to gpt-3.5-turbo

Together Fine-tuning

Need a cluster of H100s on Infiniband today?

Together GPU Clusters

01

02

03

Together Custom Models

training

network c0mmunication

relative to aws

We believe the future of AI is

Open Source

01 Transparency

02 no lock-in

03 Privacy

speed relative
to tgi, vllm or
other inference services

network
c0mmunication