fish-speech-start-local/inference.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fish Speech"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### For Windows User / win用户"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "bat"
    }
   },
   "outputs": [],
   "source": [
    "!chcp 65001"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### For Linux User / Linux 用户"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import locale\n",
    "locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For Chinese users, you probably want to use mirror to accelerate downloading\n",
    "# !set HF_ENDPOINT=https://hf-mirror.com\n",
    "# !export HF_ENDPOINT=https://hf-mirror.com \n",
    "\n",
    "!huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## WebUI Inference\n",
    "\n",
    "> You can use --compile to fuse CUDA kernels for faster inference (10x)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "!python tools/run_webui.py \\\n",
    "    --llama-checkpoint-path checkpoints/fish-speech-1.5 \\\n",
    "    --decoder-checkpoint-path checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth \\\n",
    "    # --compile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Break-down CLI Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Encode reference audio: / 从语音生成 prompt: \n",
    "\n",
    "You should get a `fake.npy` file.\n",
    "\n",
    "你应该能得到一个 `fake.npy` 文件."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "## Enter the path to the audio file here\n",
    "src_audio = r\"D:\\PythonProject\\vo_hutao_draw_appear.wav\"\n",
    "\n",
    "!python fish_speech/models/vqgan/inference.py \\\n",
    "    -i {src_audio} \\\n",
    "    --checkpoint-path \"checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth\"\n",
    "\n",
    "from IPython.display import Audio, display\n",
    "audio = Audio(filename=\"fake.wav\")\n",
    "display(audio)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Generate semantic tokens from text: / 从文本生成语义 token:\n",
    "\n",
    "> This command will create a codes_N file in the working directory, where N is an integer starting from 0.\n",
    "\n",
    "> You may want to use `--compile` to fuse CUDA kernels for faster inference (~30 tokens/second -> ~300 tokens/second).\n",
    "\n",
    "> 该命令会在工作目录下创建 codes_N 文件, 其中 N 是从 0 开始的整数.\n",
    "\n",
    "> 您可以使用 `--compile` 来融合 cuda 内核以实现更快的推理 (~30 tokens/秒 -> ~300 tokens/秒)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "!python fish_speech/models/text2semantic/inference.py \\\n",
    "    --text \"hello world\" \\\n",
    "    --prompt-text \"The text corresponding to reference audio\" \\\n",
    "    --prompt-tokens \"fake.npy\" \\\n",
    "    --checkpoint-path \"checkpoints/fish-speech-1.5\" \\\n",
    "    --num-samples 2\n",
    "    # --compile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Generate speech from semantic tokens: / 从语义 token 生成人声:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "!python fish_speech/models/vqgan/inference.py \\\n",
    "    -i \"codes_0.npy\" \\\n",
    "    --checkpoint-path \"checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth\"\n",
    "\n",
    "from IPython.display import Audio, display\n",
    "audio = Audio(filename=\"fake.wav\")\n",
    "display(audio)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}