Skip to content
forked from suno-ai/bark

๐Ÿ”Š Text-Prompted Generative Audio Model

License

Notifications You must be signed in to change notification settings

iam-akshay/bark

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

44 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Bark API

The Bark API is a web API that generates waveform prompts from input text. It is built using FastAPI, a modern, fast (high-performance) web framework for building APIs.

Requirements

  • Docker
  • Docker Compose

Running the application

To run the application, follow these steps:

  • Clone the repository to your local machine.
  • In the project directory, run docker-compose up --build -d command.
  • Wait for the containers to start. You can monitor the logs using docker-compose logs -f command.
  • Open your web browser and navigate to http://localhost:8000.
  • That's it! The application should now be running in your browser.

Stopping the application

To stop the application, run docker-compose down -v command in the project directory. This will stop and remove all the containers, networks, and volumes created by docker-compose up command.

Endpoints

The Bark API has two endpoints:

GET /

This endpoint returns a welcome message.

POST /api/prompt

This endpoint generates a waveform prompt from the input text.

  • Request Body** The request body must contain a JSON object with two keys:

text (string): any text to be used as input for prompt generation

filename (string, optional): the filename to be used to save the generated prompt. If not provided, a default filename (dummy.npz) will be used.

  • Response If the request is successful, the endpoint will return a JSON object with a message indicating that prompt generation has been started.


๐Ÿถ Bark

Twitter

Examples | Model Card | Playground Waitlist

Bark is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.

๐Ÿ”Š Demos

Open in Spaces Open In Colab

๐Ÿค– Usage

from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
     Hello, my name is Suno. And, uh โ€” and I like pizza. [laughs] 
     But I also have other interests such as playing tic tac toe.
"""
audio_array = generate_audio(text_prompt)

# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)
pizza.webm

To save audio_array as a WAV file:

from scipy.io.wavfile import write as write_wav

write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)

๐ŸŒŽ Foreign Language

Bark supports various languages out-of-the-box and automatically determines language from input text. When prompted with code-switched text, Bark will attempt to employ the native accent for the respective languages. English quality is best for the time being, and we expect other languages to further improve with scaling.

text_prompt = """
    Buenos dรญas Miguel. Tu colega piensa que tu alemรกn es extremadamente malo. 
    But I suppose your english isn't terrible.
"""
audio_array = generate_audio(text_prompt)
miguel.webm

๐ŸŽถ Music

Bark can generate all types of audio, and, in principle, doesn't see a difference between speech and music. Sometimes Bark chooses to generate text as music, but you can help it out by adding music notes around your lyrics.

text_prompt = """
    โ™ช In the jungle, the mighty jungle, the lion barks tonight โ™ช
"""
audio_array = generate_audio(text_prompt)
lion.webm

๐ŸŽค Voice Presets and Voice/Audio Cloning

Bark has the capability to fully clone voices - including tone, pitch, emotion and prosody. The model also attempts to preserve music, ambient noise, etc. from input audio. However, to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided, fully synthetic options to choose from for each language. Specify following the pattern: {lang_code}_speaker_{0-9}.

text_prompt = """
    I have a silky smooth voice, and today I will tell you about 
    the exercise regimen of the common sloth.
"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
sloth.webm

Note: since Bark recognizes languages automatically from input text, it is possible to use for example a german history prompt with english text. This usually leads to english audio with a german accent.

๐Ÿ‘ฅ Speaker Prompts

You can provide certain speaker prompts such as NARRATOR, MAN, WOMAN, etc. Please note that these are not always respected, especially if a conflicting audio history prompt is given.

text_prompt = """
    WOMAN: I would like an oatmilk latte please.
    MAN: Wow, that's expensive!
"""
audio_array = generate_audio(text_prompt)
latte.webm

๐Ÿ’ป Installation

pip install git https://github.com/suno-ai/bark.git

or

git clone https://github.com/suno-ai/bark
cd bark && pip install . 

๐Ÿ› ๏ธ Hardware and Inference Speed

Bark has been tested and works on both CPU and GPU (pytorch 2.0 , CUDA 11.7 and CUDA 12.0). Running Bark requires running >100M parameter transformer models. On modern GPUs and PyTorch nightly, Bark can generate audio in roughly realtime. On older GPUs, default colab, or CPU, inference time might be 10-100x slower.

If you don't have new hardware available or if you want to play with bigger versions of our models, you can also sign up for early access to our model playground here.

โš™๏ธ Details

Similar to Vall-E and some other amazing work in the field, Bark uses GPT-style models to generate audio from scratch. Different from Vall-E, the initial text prompt is embedded into high-level semantic tokens without the use of phonemes. It can therefore generalize to arbitrary instructions beyond speech that occur in the training data, such as music lyrics, sound effects or other non-speech sounds. A subsequent second model is used to convert the generated semantic tokens into audio codec tokens to generate the full waveform. To enable the community to use Bark via public code we used the fantastic EnCodec codec from Facebook to act as an audio representation.

Below is a list of some known non-speech sounds, but we are finding more every day. Please let us know if you find patterns that work particularly well on Discord!

  • [laughter]
  • [laughs]
  • [sighs]
  • [music]
  • [gasps]
  • [clears throat]
  • โ€” or ... for hesitations
  • โ™ช for song lyrics
  • capitalization for emphasis of a word
  • MAN/WOMAN: for bias towards speaker

Supported Languages

Language Status
English (en) โœ…
German (de) โœ…
Spanish (es) โœ…
French (fr) โœ…
Hindi (hi) โœ…
Italian (it) โœ…
Japanese (ja) โœ…
Korean (ko) โœ…
Polish (pl) โœ…
Portuguese (pt) โœ…
Russian (ru) โœ…
Turkish (tr) โœ…
Chinese, simplified (zh) โœ…
Arabic Coming soon!
Bengali Coming soon!
Telugu Coming soon!

๐Ÿ™ Appreciation

  • nanoGPT for a dead-simple and blazing fast implementation of GPT-style models
  • EnCodec for a state-of-the-art implementation of a fantastic audio codec
  • AudioLM for very related training and inference code
  • Vall-E, AudioLM and many other ground-breaking papers that enabled the development of Bark

ยฉ License

Bark is licensed under a non-commercial license: CC-BY 4.0 NC. The Suno models themselves may be used commercially. However, this version of Bark uses EnCodec as a neural codec backend, which is licensed under a non-commercial license.

Please contact us at [email protected] if you need access to a larger version of the model and/or a version of the model you can use commercially.

๐Ÿ“ฑย Community

๐ŸŽงย Suno Studio (Early Access)

Weโ€™re developing a playground for our models, including Bark.

If you are interested, you can sign up for early access here.

FAQ

How do I specify where models are downloaded and cached?

Use the XDG_CACHE_HOME env variable to override where models are downloaded and cached (otherwise defaults to a subdirectory of ~/.cache).

Bark's generations sometimes differ from my prompts. What's happening?

Bark is a GPT-style model. As such, it may take some creative liberties in its generations, resulting in higher-variance model outputs than traditional text-to-speech approaches.

About

๐Ÿ”Š Text-Prompted Generative Audio Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 57.3%
  • Jupyter Notebook 41.8%
  • Dockerfile 0.9%