Synthesizer V Studio Flat: A Practical Guide (v1.35, revised)

Written by POPY
With thanks to W and M.

What Kind of Tool Is SV Flat?

What Is Synthesizer V Studio Flat?

Synthesizer V Studio Flat is an enhanced build of SV1 Pro created by some Synth fans, inspired by Yumekey. In addition to unlocking and bundling voicebanks, it ships with an add-on called Flat Manager that lets you manage voicebanks and edit them with much higher freedom.

Why Use Synthesizer V Studio Flat?

With Synthesizer V Studio Flat and its editors, you can modify existing SV1 Pro voicebanks with almost no practical limits. If your goal is to improve vocal quality and push for more refined results, SV Flat opens up an entirely different workflow.

What New Features Does Synthesizer V Studio Flat Provide?

Removes the limit on the number of Vocal Modes (current versions have no hard limit; for CPU rendering reasons, staying under ~1000 is recommended)
Supports Vocal Mode mixing and transfer (mix/port Vocal Modes between voicebanks, even across different Base Models; results can be unpredictable)
Lets you replace Auto-Pitch defaults (swap Auto-Pitch parameters to change a voicebank's automatic pitch behavior)
Lets you swap Resonance / Articulation models (via built-in model switching; this is an experimental modification and results are not fully predictable)
Lets you swap Articulation and timing models (also via built-in models; experimental modification)
Supports the PLUS library (a SV1-compatible implementation of the SV2 PLUS compatibility library, provided for editing and experimentation)
Extends Vocal Mode gain range (by editing percentages in the editor you can effectively reach -500% to 500%; large values can cause clipping)
Supports mixing a Vocal Mode's base timbre to steer phonation behavior so it can better match other voicebanks' Vocal Modes
Can generate random voicebanks and random Vocal Modes (useful for inspiration; results can be strange)
Unlocks phoneme properties, adds support for languages like Korean/French, and supports custom dictionaries
Supports version management (useful when you start producing many derived voicebank variants)
Uses model compression to significantly reduce SV1 Pro voicebank storage footprint
Supports exporting custom voicebanks
Supports manual refresh of custom voicebanks

How Do I Get Started With Synthesizer V Studio Flat?

Read through this guide end to end, then experiment section by section. That's the fastest way to build an intuition for what each parameter controls and how to iterate safely.

Basics and Background Concepts

SFPK Format

SFPK is the voicebank file format used by Synthesizer V Studio Flat. It can be opened and installed via Flat Manager. Conceptually, an SFPK is an archive; depending on the voicebank, it may contain Base Model files, images, NOFS-JSON, and more.

NOFS-JSON is Flat's lightweight voicebank format. By editing the JSON in Flat Manager, you can change metadata, timbre parameters, Vocal Modes, pitch-related parameters, phoneme tables, and more.

Timbre / Vocal Mode / pitch parameters in the JSON are 256-character HEX strings. Internally they are parsed as 32 fp32 values, forming a 32-dimensional embedding vector (Emb). By performing mathematical operations on these floats, you can blend or reinforce timbre and Vocal Modes.

Note: For batch packaging, you can zip multiple .sfpk files and rename the zip extension to .sfpks.

Base Model (`base_model`): The Common Denominator Behind Voicebanks

In Flat Manager, use the main menu sort option "By model" to group voicebanks by model. That model is the Base Model (base_model). In the voicebank JSON editor you'll also see entries such as:

"base_model": "2c233d8e9b19f1f4dc0276ba3a5542c1"

base_model is the foundational model a voicebank is built on. A practical way to think about it: SV1 Pro likely trained on large datasets (male and female voices from a training batch), producing a Base Model. In generative-model terms (e.g., VAE), a fixed Base Model defines a region of feature space. Within that space, adjusting embedding vectors (Emb) allows the singing to move smoothly across different voice states.

In short, the Base Model is the lowest-level data source for a voicebank. It stores a range of voice states and largely determines the timbre range the voicebank can reach.

Base Model field format:

"base_model": "2c233d8e9b19f1f4dc0276ba3a5542c1"

Q1: What Can I Do With Base Model Characteristics?

A: Current Flat Manager versions support mixing/transferring/modifying voicebank timbre parameters (base / default / auxiliary). These parameters are rooted in the Base Model's feature space. If you want to mix or transfer Vocal Modes between voicebanks, using the same base_model is strongly recommended; otherwise results may become unpredictable (sometimes surprisingly good, but not controllable).

Q2: Can I Only Replace `base_model`?

A: You can, but it's not recommended. A voicebank's timbre is built on top of its Base Model. Replacing only the Base Model usually behaves like generating a random voicebank and is rarely useful.

Q3: Could Different Base Models Be Similar?

A: Yes, aside from SV2 compatibility libraries / PLUS compatibility libraries. Many Base Models are fine-tuned from the SV1 "general" Base Model (2c233d8e9b19f1f4dc0276ba3a5542c1) with additional data, so they can be somewhat similar. On these fine-tuned Base Models (use release timing as a clue), you may sometimes reuse timbre parameters from the general SV1 Base Model, but the outcome still has significant uncertainty because the Base Model changed.

Timbre Parameters (`base` / `default` / `auxiliary`): Layered Offset Style Vectors

With a Base Model, SV needs a way to point to a specific voicebank. That's the role of timbre parameters.

At a high level, SynthV Pro can be treated as a three-stage "offset model": three types of embedding vectors (Emb) are stacked to shape a specific voicebank:

(base): the foundational timbre vector. It anchors core characteristics such as Articulation, accent, singing style, and overall timbre. In most Base Models, (base) is what identifies the voicebank (with exceptions such as SV2 compatibility libraries / PLUS compatibility libraries). This is hidden in stock SV1 Pro.
(default): the default Vocal Mode. After (base) is set, (default) anchors the voicebank's default singing mode. (default) is effectively a 100%-strength Vocal Mode and can also be reused as a Vocal Mode in other voicebanks. This is also hidden in stock SV1 Pro.
(auxiliary): adjustable Vocal Modes. Thanks to SV1 Pro's Base Model data coverage, (auxiliary) can express not only timbre but also Resonance, vocal technique, breathiness, and more. In SV Flat, you can tweak/transfer/modify/build special (auxiliary) modes for specific dynamic effects.

(default) and (auxiliary) are equivalent categories and can be blended with each other. (base) is different: in general, (base) should only be blended with other (base) vectors.

Practical importance usually looks like this:

$$ \mathtt{(base) > (default) > (auxiliary)} $$

You can think of SV1 Pro as stacking three offset vectors in order: (base), then (default), then (auxiliary). The direction of a vector represents timbre features; its magnitude represents feature weight. Controlling these vectors gives you fine-grained control over Vocal Modes.

Replacing, mixing, or reshaping these three layers can significantly alter timbre, mouth shape behavior, and perceived texture. Adding or replacing auxiliary styles increases the editing range inside the editor (and works best when you stay aligned with the voicebank's base_model).

extra is a scalar correction used only for (auxiliary) styles. It adjusts details like aspiration; deleting it (default to 0) usually causes only small changes. Flat Manager can predict extra for an (auxiliary) style (currently limited to Base Model 2c233d8e9b19f1f4dc0276ba3a5542c1).

Safety note: When editing a voicebank, change the version and save once first. Flat Manager will create a new version branch and helps prevent accidental loss of the original.

Format example (styles):

"styles": [
  {
    "name": "(base)",
    "data": "C622493E2C8638BEBE4E3B3EB7331FBE52C4DA3DCE1A21BD667B993D43A82D3E367652BEB53579BC7814C1BDBA9427BDFFB2913B14E9433D232DE03D5660BD3D2407653EBD37F8BB61C15E3D2F8478BDBD0A8E3D4AAE033EDCEDD83D23F5193E002985BD3A3D09BCA6CA46BD1C4D4BBDBDFDC0BDA52B81BC0201E43D7D4A383D"
  },
  {
    "name": "(default)",
    "data": "877D863A8DEEB1BD7ED4BDBDF6BF1FBE1FB8EFBD32474CBE7058D6BCAF8804BE64DED9BCC66E8F3C64F1B1BC6CE880BE263808BE997563BD78C42EBE8D80423D1F2B78BE374A64BB0045F7BDE0CF92BEC45A8EBEE71019BCEEEB45BE6CF7CFBEC8ED133EDA4C19BD951A37BDEAC7733EE8EC98BD4A9C1DBE7DF3013CAC661E3C"
  },
  {
    "name": "Gentle",
    "data": "6F231F3D0F37CEBD7EE0503D0080E53D92C8433E00A47C3C741E7B3EC178103EEC0DB3BC5E0556BD007F2E3E65E4233E3457D7BD62F9023EDC75C73D405DAF3ABF9E21BE0B8A593D003AF5BD4074CE3CC09DD2BBB4F4C33C443A59BDDC3775BE28B1C13C56E05D3C60170D3E54FD11BD6A14A13D409ED93B66ADB73C1E9DEF3D",
    "extra": 0.18505549430847168
  }
]

Q1: How Do I Edit Timbre Parameters?

A0 — Add more Vocal Modes / create a blank template:
Open the voicebank editor, right-click any style and copy it, for example:

{
  "name": "Gentle",
  "data": "6F231F3D0F37CEBD7EE0503D0080E53D92C8433E00A47C3C741E7B3EC178103EEC0DB3BC5E0556BD007F2E3E65E4233E3457D7BD62F9023EDC75C73D405DAF3ABF9E21BE0B8A593D003AF5BD4074CE3CC09DD2BBB4F4C33C443A59BDDC3775BE28B1C13C56E05D3C60170D3E54FD11BD6A14A13D409ED93B66ADB73C1E9DEF3D",
  "extra": 0.18505549430847168
}

Paste it back into the same list and change it into a blank template, for example:

{
  "name": "TimbreStyle1",
  "data": "0000000000000080000000000000008000000000000000800000000000000000000000800000008000000080000000800000000000000000000000000000000000000000000000800000000000000080000000000000000000000000000000000000008000000080000000800000008000000080000000800000000000000000",
  "extra": 0
}

Make sure the JSON stays valid: commas are the most common issue (missing or extra, depending on whether you insert in the middle or at the end). Click Save (or press Ctrl+S) when done. This blank style is useful as a target slot for A2 — Mixing.

A1 — Magnitude control:
In the voicebank editor, click a style's data value. Press Ctrl+M (or right-click and select "Adjust embedding Magnitude") to view its absolute magnitude. Use the slider to set magnitude in [-5, 5] or just input a number without range limitation. Negative values invert the vector direction. You can also add an "enfored_length": 0 field (any number) to directly set the absolute magnitude.

Note: Large magnitudes can easily cause clipping or extreme loudness. Be cautious, inspect rendered waveforms, then audition. Manual input can exceed 5; use that sparingly.

A2 — Vocal Mode mixing:
Click the style's data value, then press Ctrl+E (or right-click "Export embedding to Mixer") to send it to the Mixer (calculator icon in Flat Manager). Mix channels with sliders, then click Export to write an auxiliary embedding back into the JSON you're editing. Alternatively, copy the 256-HEX result and paste it into the blank template created in A0.

Note: In theory you can mix any embeddings (base/default/auxiliary, different Base Models, random vectors, pitch embeddings, etc.). For controlled results, follow the Base Model guidance and keep edits targeted.

A3 — Vocal Mode transfer:
In the source voicebank, copy a style entry. In the destination voicebank, paste it somewhere after (base) and (default) inside styles, then ensure JSON commas are correct. Save (Ctrl+S).

A4 — Random style:
In the voicebank editor, click "New Random Style" (round button with a plus). Flat Manager appends a random style to the end of styles. Save (Ctrl+S).

A5 — Random voicebank (Reset):
In the voicebank editor, click "Reset" (square button with a plus). This generates a completely random voicebank. Save (Ctrl+S).

Q2: After Editing, How Do I Make Changes Take Effect in the SV Flat Editor?

A: Besides restarting the SV Flat editor, you can uninstall one minor version of the voicebank and use "Refresh" in the SV Flat editor to reload it.

Q3: How Do I Manage Too Many Vocal Modes Inside One Voicebank?

A: Use version branching. Change version to create separate variants so you can switch between them:

"name": "GUMI AI",
"version": "101",
"vendor": "INTERNET Co., Ltd.",
"language": "japanese",
"phoneset": "romaji"

Note: Never make voicebanks with the same name and different vendors, which may cause some problems.

Q4: Can I Mix `(default)` with `(base)`?

A: Not recommended. (default) is equivalent to an (auxiliary) at 100% strength, but (base) is different and is generally safest to mix only with other (base) vectors. Mixing (base) with (default) is often unpredictable.

Auto-Pitch (`f0_model` / `pitch`): The "Secret" Behind Automatic Singing Pitch

In Flat Manager, f0_model and pitch control a voicebank's Auto-Pitch characteristics.

"f0_model": "dcee89442f69984189a5b2aedbf9f090",
"pitch": "9D31DDBD30D22D3D4A195ABDD00D0F3E311D253EF28921BE97631C3EA27055BD62F983BDAC2EDDBC7724243DC266003DF53852BC0699D43DC8119DBDAACDA73E2288033EB2A995BD76A4113EBC717FBD9449BB3E8AFE193E58011BBCD5E7863D763E623DD0FBF83CE814BEBD548B83BE27D7D23DCBF8B23DE4982DBEABB693BC"

Replacing both f0_model and pitch together lets you steer a voicebank's Auto-Pitch behavior.

Q1: Example?

A: If you replace POPY's f0_model & pitch with Minus's f0_model & pitch, POPY's Auto-Pitch behavior will shift toward Minus.

Q2: Since `pitch` is also a float vector, can it be mixed?

A: In theory, yes, but it's usually not very useful. pitch is trained specifically and it's hard to evaluate mix quality like Vocal Modes. Using a high-quality Auto-Pitch set is generally the better option.

Singing Assistant Model (`sing_model`): Optional Plug-ins for Resonance / Articulation Behavior

In Flat Manager, sing_model mainly affects Articulation (oral target / placement) and how the voice behaves in the mixed-voice range, which in turn influences fundamental support. It has relatively smaller impact on pure-voice timbre and overall loudness.

"sing_model": "3f649ae6cb04ee4f7e9a7ed72ee29928"

As a rule of thumb: pitch changes how it's produced, not the timbre itself.

In Resonance terms: primarily affects fundamental support + phonation state
In Articulation terms: can move the "oral focus" forward/more concentrated/more closed, but doesn't strongly change tightness or voiced/unvoiced contrast

sing_model is largely independent from base_model, so it's generally safe to try different sing_model options as an enhancer for a voicebank you like.

That said, its impact is often subtle and the result may not fit your target voicebank. Test carefully.

Q1: How Do I Know What a `sing_model` Will Do?

A: Try it. Or analyze the source voicebank's singing characteristics and decide whether your target voicebank needs that direction.

Timing Model (`timing_model` / `timing`): Controlling Phoneme Durations and Consonant Behavior

In Flat Manager, timing_model and timing control phoneme-level timing. Together they influence duration, consonant phenomena, and the "feel" of transitions.

"timing_model": "7b0ea690ada94b4484b50d9d64a21cae",
"timing": "D8168F3D9147F5BC6D9FA93C9A3E923D530D8F3C88D4883C806D7B3CCA41003D456A56BCD5FFC0BC666A5EBDAAA5E2BCC0082FBDB6A66C3D60BABB3C33CF233D064AD83C6BB9AA3C2DC6633CF3DB16BE3870853C28F2A03BBEAE20BDA1BACEBCF2F48C3C46E6FDBC002B48BD6656D9BAF3D7753D56954BBC0FD731BD97F4D43D253F2EBBE46EFCBD09EC7EBDED411EBD36DB9FBC29C2C8BC66F5413DC79A713B7A5F183D4ABF34BDA58529BBBB7F563B098E52BDC212B1BC26F0B03D4EEE573D4157363D68DB51BCF759A73B131389BBE695923CD48E67BDD4D1FB3DC6B4E5BB222421BD75BB15BC7128883D4D88B1BDB2EFA33C6E18973C14FA673BA3D233BD4C5C3BBDB063D33C399F87BBFEA8763D8B8F353D420E643CF3BB8D3C416F49BC0133CA3CCAF3DABDE99E643DC98626BBC405FA3CE27EA43A2770643D244D10BC7D72A9BD620CCDBC3AC22E3D1EEC693C3D098C3BF50D56BDB5BC073C7D4A5F3C845246BD4A13283E100B7ABD081F723C8CBC96BC77FDF43CE7EDBEBCA93525BAAD1B883D3035D53C6BEA413D61A688BC30832FBD98C3D23CAA551EBD73009F3D8D7B9C3CFCAE3BBD888F343CA4B4C23CA837513C31A228BDD3089DBD6D32E7BB87CC773D7BD9173DA6776EBD001D7E3D1BBD1A3D59C5463BC997B6BCE341153C3E6E0E3D0AD9293D36FF2F3DC251E7BCF6E17E3D4983B4BBCE10DA3D182A8B3C"

Replacing both timing_model and timing together lets you steer phoneme behavior:

timing is more about how to place things (phoneme durations, Articulation tightness/looseness, transition dynamics)
timing_model is more about what rules exist (voicing, whether consonant events like cl/br can appear, and whether they are triggerable)

The combined effect usually looks like this: Articulation won't be rebuilt into another voicebank's timbre or oral target (that's closer to pitch), but it will feel like your voicebank's Articulation is re-timed and re-connected. You may hear tighter/looser Articulation, cleaner/smoother transitions, and noticeable duration changes for certain phonemes (e.g., m, n).

Within the rule boundaries provided by timing_model, voiced/unvoiced behavior or consonant events like cl/br may become easier to trigger or closer to the target's tendency. There is still a limit: even with a full swap, you typically move the trend toward the target rather than perfectly cloning its most extreme traits.

One-sentence summary: timing controls phoneme intensity/duration/connection, timing_model controls available phoneme behaviors and rules. Together they shape phoneme-level Articulation feel and consonant behavior.

This model set is also independent from the Base Model, so swapping is generally safe. Still, test carefully.

Q1: Can I use only one of `timing` or `timing_model`?

A: Yes. The effect may be smaller. Try and adjust it carefully.

Q2: Can I mix and match `timing` / `timing_model` / `sing_model` freely?

A: Yes. Their effects are often subtle and require careful control. If you switch all three together, the overall behavior tends to shift more consistently toward the chosen target.

Phoneme Tables: Switching Languages

Flat Manager supports switching/editing/adding phoneme tables. Open the editor and click the phoneme table (the "A" icon with three dots).

Each language has its own phoneme table, and each entry contains:

Phoneme name name: the identifier you type into the phoneme field.
Phoneme type type: e.g., stop, vowel, fricative, nasal, liquid.
System token token: the real token used in training/synthesis. If two phoneme names map to the same token, they are pronunciation-equivalent within the same language (cross-language equivalence is not guaranteed).

Flat enables you to use any system tokens in SynthV in any languages. It's also ok to change phoneme names or make new phonemes as you like. By editing phoneme tables (act on all the voicebanks) while setting which language a voicebank can use, you can make up new languages or write your own phoneme system (but the system still relies on the original training tokens, so results vary).

Flat has edited the phoneme tables of Cantonese and Spanish for Language Extensions.

{
  "name": "cantonese-xsampa-phones",
  "phonemes": [
    {"name":"a", "type":"vowel", "token":"a"}
  ]
}

Synthesizer V Studio Flat Editor Overview

Voicebank Manager

The voicebank library includes timbre, Vocal Mode, and pitch data for currently known SV1 singers (including some SV2 compatibility libraries and SV2 PLUS libraries). You can open a voicebank or send embeddings to the Vocal Mode editor (or named Mixer) for further processing.

Note: To reproduce timbre and Vocal Mode behavior accurately, using data from the same Base Model is recommended.

The style library supports grouping voicebanks by name / vendor / Base Model, and supports refresh, batch install, batch export, and search.

Voicebank Editor

Open: load the NOFS-JSON(.nofs) voicebank you want to edit.
Save: save the modified NOFS-JSON.
Metadata editing: edit name, version, vendor, default language. SV loads the highest version by default.
Timbre & Vocal Mode editing: paste 256-HEX strings. Right click a HEX string for magnitude adjustment, exporting an embedding to the Mixer, predicting extra, and more.
Reload: reload the NOFS-JSON.
Reset: generate a random voicebank.
Set translation: configure the translation file.
Export .sfpk: save/export a voicebank to .sfpk.
Install .sfpk: install a .sfpk file.
New Random Style: add a random style at the end of styles.

Voicebank editor field template:

{
  "name": "Voicebank display name",
  "version": "Version string",
  "vendor": "Vendor / publisher",
  "language": "Default / primary language",
  "phoneset": "Phoneme table (e.g., xsampa; can be inferred from language, not suggested)",
  "support_languages": [
    "List of supported languages(not suggested)"
  ],
  "base_model": "Base Model hash (locate/verify model files)",
  "sing_model": "Singing assistant model hash",
  "timing_model": "Timing model hash",
  "f0_model": "F0 model hash",
  "styles": [
    {
      "name": "Style label (e.g., (base)/(default))",
      "data": "Embedding vector (serialized 256-HEX string)"
    },
    {
      "name": "Style label",
      "data": "Embedding vector (serialized 256-HEX string)",
      "extra": "Extra scalar correction"
    }
  ],
  "pitch": "Auto-Pitch embedding (serialized 256-HEX string)",
  "timing": "Phoneme timing embedding (serialized 1024-HEX string)",
  "note": "Additional properties are ok for notes"
}

Note 1: You can add your own custom fields to the voicebank JSON (Flat Manager won't read them). This can be used as a "notes" area for recipes and management:

"Memory_1": "Notes: ...",
"Memory_2": "Notes: ..."

Note 2: Flat Manager has a default fill mechanism: if critical fields (like base_model) are missing, it will backfill them with defaults (e.g., 2c233d8e9b19f1f4dc0276ba3a5542c1). Voicebank with no (base) or (default) will be backfilled to 000... (256 zeros). The recommended practice is to never manually specify phoneset and support_languages, relying instead on the editor to autofill them. See the minimal example on the NOFS-JSON of Refresh (you can use Preview Editor to see it).

Mixer

The Mixer supports blending 256-HEX embeddings for timbre, Vocal Modes, and pitch.

Add channel: click "New Channel" and paste a 256-HEX embedding.
Add random channel: click "New Random Channel" to generate a random 256-HEX embedding.
Multi-channel mixing: You can send embeddings in Editor or Preview Editors into the Mixer for mixing.
Mix ratio: each channel has a weight slider.
Negative weights: each channel slider can go down to -100%.
Bus control: bus ratio ranges from 0% to 500% to scale overall blend strength. The checkbox on the right changes it to force normalization of the output magnitude.
Export: export current mix result back into the voicebank Editor.
Copy as JSON: copy the current mix in JSON format.

Phoneme Table Editor

Switch/browse phoneme tables. See the Phoneme Tables section in Basics and Background Concepts.

How to Install Language Extensions: Language extensions are essentially user dictionaries, but they will unlock hidden phonemes when used with flat. Due to the limitations of the original SynthV R2, a user dictionary can only be installed under one default language (means that only voicebanks with this default language can use the dictionary). Therefore, when installing the extension package, the installer will first prompt you to choose the default language for the extension. For example, if you want to use the Aver or Asterian voicebank with an extension for any language, you would select Japanese + English.

Next, the program will prompt you to choose which languages to install the extension for. If you need extensions for Russian and French, you should select Russian + French. After installation, Flat will allow the voicebank set to the default language of Japanese/French to extend with Russian/French phonemes.

Note: Having too many user dictionaries absolutely can slow down the startup of SynthV.

How to Use Language Extensions: After selecting a voicebank like Aver, and confirming that the extension for its default language (Japanese) is installed, go to the user dictionary tab in the SV sidebar. Select the dictionary, such as "ru(sp)_dict." Here, "ru" refers to the Russian extension, and "(sp)" indicates that the language must be switched to Spanish in the singer tab. In other words, the "ru" extension is used under the "sp" language setting.

Once this is done, you can directly sing in Russian lyrics.

Why Are Some Phonemes Silent / Auto-pitch Rendering Errors? Older base models are more likely to have missing phonemes. Flat only unlocks hidden phonemes and does not forcefully add new ones. The generation of auto-pitch is strongly correlated with phonemes, so if there are phoneme errors, auto-pitch will also encounter issues.

Advanced Exercises

This section focuses on special workflows and ways of thinking. It's not "better" than basic usage - just different. This guide is also limited by the author's experience and the time spent compiling it, so treat it as a starting point and validate by ear.

Base Model Thinking: Using Offset Models to Guide Voicebank Tuning

Recall the Base Model definition:

base_model is the foundational model. SV1 Pro likely trained on a large dataset, producing a Base Model that defines a feature space. By adjusting embedding vectors, singing can move continuously across different voice states.

In short, the Base Model is the lowest-level data source and stores a range of voice states.

So, changes in timbre and singing style are changes in embedding vectors inside the Base Model's high-dimensional feature space.

If you want to build or tune a voicebank, start with a suitable Base Model and then confirm the timbre parameters step by step: from (base) to (default) to (auxiliary). This is a layered tuning workflow: foundational timbre → default mode → specific modes.

When creating a voicebank via random search / voiceprint comparison / data mining, keep in mind:

On many SV1 Base Models, randomizing (base) alone can sometimes land near a target voicebank, because (base) carries high weight and strongly determines core characteristics. As a rough heuristic, if the cosine similarity between a candidate (base) embedding and the target embedding is above ~0.7, it may sound relatively close.
For SV2 compatibility libraries / PLUS compatibility libraries, (default) often plays a more central role, so starting analysis from (default) can be more effective.

During tuning:

Since (default) and (auxiliary) are equivalent categories, you can clear (default), rename it, and move it into the auxiliary section to tune the default mode as if it were an adjustable mode.
Some voicebanks' (default) vectors can act as very effective soft/strong Vocal Modes.
Within the same Base Model, mixing two different voicebanks' (default) vectors can create a hybrid embedding with traits from both. This can be a practical "XSY" workflow. You can then reuse that mixed (default) as an adjustable default mode to increase flexibility.
Auxiliary modes are built on top of (base) and (default). If the foundational styles are too far apart, auxiliary transfer may degrade. One workaround is to mix (base) and (default) toward the target first (for example, source:target = 7:3) to reduce feature-space distance. The ratio is not fixed; avoid overly large shifts because they can distort timbre and drift away from the source identity.

Similarity and Compatibility: Thinking About Relationships Between Base Models and Voicebanks

If different Base Models are "supposed" to be incompatible, why do some cross-model transfers work?

Different-Base-Model mixing can work, but the outcome depends strongly on Base Model similarity, feature-space distance between (base) layers, and how the underlying data overlaps.

Many Base Models are fine-tuned from the general SV1 Base Model, which means there can be partial compatibility. However, feature-space distance is difficult to estimate directly; Base Models trained far apart in time or on very different data can be much harder to transfer between.

An observed workflow: if a cross-model Vocal Mode works well, you can sometimes use a voicebank's (base) as an adapter so that another voicebank on the same Base Model can inherit that cross-model mode more smoothly. This is a practical form of "similarity and compatibility".

Summary: whether transfer works is mostly correlated with Base Model and (base) similarity. "Cross Base Model" does not automatically mean "impossible".

Magnitude and Influence: Is "More" Always Stronger?

In many cases, embedding magnitude affects a Vocal Mode's effective strength. But SV seems to enforce internal limits on some voicebanks: even if you increase the displayed percentage, the effect may saturate.

Large magnitudes can also cause clipping, no audible change, or extreme loudness.

In practice, influence isn't only magnitude—it also depends on distribution. If you cluster all Vocal Modes within a voicebank (high-dimensional analysis), outliers often sound distinctive (either very strong or very subtle), and the relationship with magnitude is not absolute.

When mixing, avoid blending strongly opposing "strong vs. weak" modes unless you're specifically exploring special effects. That kind of mixing can cancel amplitude in feature space and make the result bland.

Inverting Vocal Modes: Getting "Opposite" Effects

Based on a large amount of analysis, many Vocal Modes (notably on Base Models such as 6e5da191faa421a20b529b40c3aa4968) can produce a roughly "opposite" effect via simple inversion.

Some practical ideas:

Create an inverted mode using the magnitude tools described in A1.
Automate between inverted and non-inverted modes to achieve smooth transitions from -500% to 500%.

Takeaway: Where Are the Limits—and What Comes Next?

In the editor, a voicebank doesn't need to be a fixed timbre. You can push one voicebank into different singing states by adjusting oral target placement, Resonance / phonation behavior, Articulation tightness, phoneme transitions, breathiness, and more.

With that level of control, "similarity" becomes more than a subjective judgment. It can be turned into a workflow: align toward a target mode, infer which parameters to move, and iterate with structure. The long-term boundary may not be "timbre cloning", but rather a stable, reproducible space of singing strategies—something closer to a creative and pedagogical vocal design system, where higher freedom does not sacrifice stability.

Special Recipes

Here are some examples to get you started.

Recipe 1: Fine-tune mouth shape and Articulation using the Muxin v200 model

base: 30% shuo + 70% pastel
default: 30% shuo + 70% pastel
singmodel: Muxin
F0: Muxin
timing: Muxin
35% mild
50% shuo (powerful)

Recipe 2: Use "Minus (default)" as an alternative for whisper VM

Minus (default)

to be continued ...

Appendix

Phoneme Name	Pronunciation Type	System Token	Language	Phoneme System
aa	vowel	a	english	arpabet
ae	vowel	ARP_ae	english	arpabet
ah	vowel	A	english	arpabet
ao	vowel	ARP_ao	english	arpabet
aw	diphthong	AU	english	arpabet
ax	vowel	ARP_ax	english	arpabet
ay	diphthong	ARP_ay	english	arpabet
b	stop	p	english	arpabet
ch	affricate	ts`h	english	arpabet
d	stop	t	english	arpabet
dx	stop	ARP_dx	english	arpabet
dr	affricate	ARP_dr	english	arpabet
dw	affricate	ARP_dw	english	arpabet
dh	fricative	ARP_dh	english	arpabet
eh	vowel	e	english	arpabet
er	vowel	ARP_er	english	arpabet
ey	diphthong	ARP_ey	english	arpabet
f	fricative	f	english	arpabet
g	stop	k	english	arpabet
hh	aspirate	x	english	arpabet
ih	vowel	ARP_ih	english	arpabet
iy	vowel	i	english	arpabet
jh	affricate	ts`	english	arpabet
k	stop	kh	english	arpabet
l	liquid	l	english	arpabet
m	nasal	m	english	arpabet
n	nasal	n	english	arpabet
ng	nasal	N	english	arpabet
ow	diphthong	@U	english	arpabet
oy	diphthong	ARP_oy	english	arpabet
p	stop	ph	english	arpabet
q	stop	ARP_q	english	arpabet
r	semivowel	z`	english	arpabet
s	fricative	s	english	arpabet
sh	fricative	s`	english	arpabet
t	stop	th	english	arpabet
tr	affricate	ARP_tr	english	arpabet
tw	affricate	ARP_tw	english	arpabet
th	fricative	ARP_th	english	arpabet
uh	vowel	u	english	arpabet
uw	vowel	y	english	arpabet
v	fricative	ARP_v	english	arpabet
w	semivowel	w	english	arpabet
y	semivowel	j	english	arpabet
z	fricative	ts	english	arpabet
zh	fricative	ARP_zh	english	arpabet
pau	silence	pau	english	arpabet
sil	silence	sil	english	arpabet
cl	stop	cl	english	arpabet
br	breath	br	english	arpabet
a	vowel	a	japanese	romaji
i	vowel	i	japanese	romaji
u	vowel	u	japanese	romaji
e	vowel	e	japanese	romaji
o	vowel	o	japanese	romaji
N	vowel	N	japanese	romaji
cl	stop	cl	japanese	romaji
t	stop	th	japanese	romaji
d	stop	t	japanese	romaji
s	fricative	s	japanese	romaji
sh	fricative	s\	japanese	romaji
j	affricate	ts\	japanese	romaji
z	affricate	ts	japanese	romaji
ts	affricate	tsh	japanese	romaji
k	stop	kh	japanese	romaji
kw	stop	ROM_kw	japanese	romaji
g	stop	k	japanese	romaji
gw	stop	ROM_gw	japanese	romaji
h	aspirate	x	japanese	romaji
b	stop	p	japanese	romaji
p	stop	ph	japanese	romaji
f	fricative	f	japanese	romaji
ch	affricate	ts\h	japanese	romaji
ry	liquid	ROM_ry	japanese	romaji
ky	stop	ROM_ky	japanese	romaji
py	stop	ROM_py	japanese	romaji
dy	stop	ROM_dy	japanese	romaji
ty	stop	ROM_ty	japanese	romaji
ny	nasal	ROM_ny	japanese	romaji
hy	aspirate	ROM_hy	japanese	romaji
my	nasal	ROM_my	japanese	romaji
gy	stop	ROM_gy	japanese	romaji
by	stop	ROM_by	japanese	romaji
n	nasal	n	japanese	romaji
m	nasal	m	japanese	romaji
r	liquid	l	japanese	romaji
w	semivowel	w	japanese	romaji
v	semivowel	ARP_v	japanese	romaji
y	semivowel	j	japanese	romaji
pau	silence	pau	japanese	romaji
sil	silence	sil	japanese	romaji
br	breath	br	japanese	romaji
a	vowel	a	mandarin	xsampa
A	vowel	A	mandarin	xsampa
o	vowel	o	mandarin	xsampa
@	vowel	@	mandarin	xsampa
e	vowel	e	mandarin	xsampa
7	vowel	7	mandarin	xsampa
U	vowel	U	mandarin	xsampa
u	vowel	u	mandarin	xsampa
i	vowel	i	mandarin	xsampa
i\	vowel	i\	mandarin	xsampa
i`	vowel	i`	mandarin	xsampa
y	vowel	y	mandarin	xsampa
AU	diphthong	AU	mandarin	xsampa
@U	diphthong	@U	mandarin	xsampa
ia	diphthong	ia	mandarin	xsampa
iA	diphthong	iA	mandarin	xsampa
iAU	diphthong	iAU	mandarin	xsampa
ie	diphthong	ie	mandarin	xsampa
iE	diphthong	iE	mandarin	xsampa
iU	diphthong	iU	mandarin	xsampa
i@U	diphthong	i@U	mandarin	xsampa
y{	diphthong	y{	mandarin	xsampa
yE	diphthong	yE	mandarin	xsampa
ua	diphthong	ua	mandarin	xsampa
uA	diphthong	uA	mandarin	xsampa
u@	diphthong	u@	mandarin	xsampa
ue	diphthong	ue	mandarin	xsampa
uo	diphthong	uo	mandarin	xsampa
:\i	coda	:\i	mandarin	xsampa
r\`	coda	r\`	mandarin	xsampa
:n	coda	:n	mandarin	xsampa
N	coda	N	mandarin	xsampa
p	stop	p	mandarin	xsampa
ph	stop	ph	mandarin	xsampa
t	stop	t	mandarin	xsampa
th	stop	th	mandarin	xsampa
k	stop	k	mandarin	xsampa
kh	stop	kh	mandarin	xsampa
ts\	affricate	ts\	mandarin	xsampa
ts	affricate	ts	mandarin	xsampa
tsh	affricate	tsh	mandarin	xsampa
ts`	affricate	ts`	mandarin	xsampa
ts`h	affricate	ts`h	mandarin	xsampa
x	aspirate	x	mandarin	xsampa
f	fricative	f	mandarin	xsampa
s	fricative	s	mandarin	xsampa
s`	fricative	s`	mandarin	xsampa
ts\h	fricative	ts\h	mandarin	xsampa
s\	fricative	s\	mandarin	xsampa
m	nasal	m	mandarin	xsampa
n	nasal	n	mandarin	xsampa
l	liquid	l	mandarin	xsampa
z`	semivowel	z`	mandarin	xsampa
w	semivowel	w	mandarin	xsampa
j	semivowel	j	mandarin	xsampa
pau	silence	pau	mandarin	xsampa
sil	silence	sil	mandarin	xsampa
cl	stop	cl	mandarin	xsampa
br	breath	br	mandarin	xsampa
ts	affricate	ts	cantonese	xsampa
tsh	affricate	tsh	cantonese	xsampa
f	fricative	f	cantonese	xsampa
h	fricative	x	cantonese	xsampa
s	fricative	s	cantonese	xsampa
l	liquid	l	cantonese	xsampa
m	nasal	m	cantonese	xsampa
n	nasal	n	cantonese	xsampa
N	nasal	YUE_N	cantonese	xsampa
w	semivowel	w	cantonese	xsampa
j	semivowel	j	cantonese	xsampa
p	stop	p	cantonese	xsampa
ph	stop	ph	cantonese	xsampa
t	stop	t	cantonese	xsampa
th	stop	th	cantonese	xsampa
k	stop	k	cantonese	xsampa
kh	stop	kh	cantonese	xsampa
kw	stop	ROM_gw	cantonese	xsampa
kwh	stop	ROM_kw	cantonese	xsampa
a	vowel	a	cantonese	xsampa
6	vowel	6	cantonese	xsampa
E	vowel	e	cantonese	xsampa
e	vowel	e	cantonese	xsampa
i	vowel	i	cantonese	xsampa
I	vowel	ARP_ih	cantonese	xsampa
O	vowel	o	cantonese	xsampa
o	vowel	o	cantonese	xsampa
u	vowel	u	cantonese	xsampa
U	vowel	U	cantonese	xsampa
9	vowel	9	cantonese	xsampa
8	vowel	8	cantonese	xsampa
y	vowel	y	cantonese	xsampa
m=	vowel	m	cantonese	xsampa
N=	vowel	N	cantonese	xsampa
:i	coda	:\i	cantonese	xsampa
:u	coda	:u	cantonese	xsampa
:m	coda	:m	cantonese	xsampa
:n	coda	:n	cantonese	xsampa
:N	coda	N	cantonese	xsampa
:p_}	coda	:p_}	cantonese	xsampa
:t_}	coda	:t_}	cantonese	xsampa
:k_}	coda	:k_}	cantonese	xsampa
pau	silence	pau	cantonese	xsampa
sil	silence	sil	cantonese	xsampa
cl	stop	cl	cantonese	xsampa
br	breath	br	cantonese	xsampa
a	vowel	a	spanish	xsampa
e	vowel	e	spanish	xsampa
i	vowel	i	spanish	xsampa
o	vowel	o	spanish	xsampa
u	vowel	u	spanish	xsampa
U	semivowel	w	spanish	xsampa
I	semivowel	ES_I	spanish	xsampa
y	semivowel	j	spanish	xsampa
ll	semivowel	ll	spanish	xsampa
b	stop	b	spanish	xsampa
B	stop	B	spanish	xsampa
d	stop	d	spanish	xsampa
D	stop	D	spanish	xsampa
g	stop	g	spanish	xsampa
k	stop	k	spanish	xsampa
p	stop	p	spanish	xsampa
t	stop	t	spanish	xsampa
l	liquid	l	spanish	xsampa
rr	trill	rr	spanish	xsampa
r	liquid	r	spanish	xsampa
m	nasal	m	spanish	xsampa
n	nasal	n	spanish	xsampa
N	nasal	N	spanish	xsampa
J	nasal	ROM_ny	spanish	xsampa
f	fricative	f	spanish	xsampa
s	fricative	s	spanish	xsampa
C	fricative	ARP_th	spanish	xsampa
sh	fricative	s`	spanish	xsampa
ch	affricate	ts`h	spanish	xsampa
x	fricative	x	spanish	xsampa
pau	silence	pau	spanish	xsampa
sil	silence	sil	spanish	xsampa
cl	stop	cl	spanish	xsampa
br	breath	br	spanish	xsampa
4	liquid	l	korean	xsampa
6	vowel	6	korean	xsampa
b	stop	b	korean	xsampa
d	stop	d	korean	xsampa
dz\	affricate	ts\	korean	xsampa
e_o	vowel	e	korean	xsampa
g	stop	g	korean	xsampa
h	fricative	x	korean	xsampa
i	vowel	i	korean	xsampa
j	semivowel	j	korean	xsampa
ts\h	affricate	ts\h	korean	xsampa
k	stop	k	korean	xsampa
k_t	stop	k_t	korean	xsampa
l	liquid	l	korean	xsampa
M	nasal	U	korean	xsampa
m	nasal	m	korean	xsampa
n	nasal	n	korean	xsampa
N	coda	N	korean	xsampa
o	vowel	o	korean	xsampa
p	stop	p	korean	xsampa
p_t	stop	p_t	korean	xsampa
s	fricative	s	korean	xsampa
s_t	fricative	s_t	korean	xsampa
t	stop	t	korean	xsampa
t_t	stop	t_t	korean	xsampa
ts\_h	affricate	ts`h	korean	xsampa
u	vowel	u	korean	xsampa
V	vowel	A	korean	xsampa
w	semivowel	w	korean	xsampa
pau	silence	pau	korean	xsampa
sil	silence	sil	korean	xsampa
cl	stop	cl	korean	xsampa
br	breath	br	korean	xsampa