Synthesizer V Studio Flat: A Practical Guide (v1.35, revised)

Written by POPY
With thanks to W and M.

What Kind of Tool Is SV Flat?

What Is Synthesizer V Studio Flat?

Synthesizer V Studio Flat is an enhanced build of SV1 Pro created by some Synth fans, inspired by Yumekey. In addition to unlocking and bundling voicebanks, it ships with an add-on called Flat Manager that lets you manage voicebanks and edit them with much higher freedom.


Why Use Synthesizer V Studio Flat?

With Synthesizer V Studio Flat and its editors, you can modify existing SV1 Pro voicebanks with almost no practical limits. If your goal is to improve vocal quality and push for more refined results, SV Flat opens up an entirely different workflow.


What New Features Does Synthesizer V Studio Flat Provide?


How Do I Get Started With Synthesizer V Studio Flat?

Read through this guide end to end, then experiment section by section. That's the fastest way to build an intuition for what each parameter controls and how to iterate safely.


Basics and Background Concepts

SFPK Format

SFPK is the voicebank file format used by Synthesizer V Studio Flat. It can be opened and installed via Flat Manager. Conceptually, an SFPK is an archive; depending on the voicebank, it may contain Base Model files, images, NOFS-JSON, and more.

NOFS-JSON is Flat's lightweight voicebank format. By editing the JSON in Flat Manager, you can change metadata, timbre parameters, Vocal Modes, pitch-related parameters, phoneme tables, and more.

Timbre / Vocal Mode / pitch parameters in the JSON are 256-character HEX strings. Internally they are parsed as 32 fp32 values, forming a 32-dimensional embedding vector (Emb). By performing mathematical operations on these floats, you can blend or reinforce timbre and Vocal Modes.

Note: For batch packaging, you can zip multiple .sfpk files and rename the zip extension to .sfpks.


Base Model (base_model): The Common Denominator Behind Voicebanks

In Flat Manager, use the main menu sort option "By model" to group voicebanks by model. That model is the Base Model (base_model). In the voicebank JSON editor you'll also see entries such as:

"base_model": "2c233d8e9b19f1f4dc0276ba3a5542c1"

base_model is the foundational model a voicebank is built on. A practical way to think about it: SV1 Pro likely trained on large datasets (male and female voices from a training batch), producing a Base Model. In generative-model terms (e.g., VAE), a fixed Base Model defines a region of feature space. Within that space, adjusting embedding vectors (Emb) allows the singing to move smoothly across different voice states.

In short, the Base Model is the lowest-level data source for a voicebank. It stores a range of voice states and largely determines the timbre range the voicebank can reach.

Base Model field format:

"base_model": "2c233d8e9b19f1f4dc0276ba3a5542c1"

Q1: What Can I Do With Base Model Characteristics?

A: Current Flat Manager versions support mixing/transferring/modifying voicebank timbre parameters (base / default / auxiliary). These parameters are rooted in the Base Model's feature space. If you want to mix or transfer Vocal Modes between voicebanks, using the same base_model is strongly recommended; otherwise results may become unpredictable (sometimes surprisingly good, but not controllable).

Q2: Can I Only Replace base_model?

A: You can, but it's not recommended. A voicebank's timbre is built on top of its Base Model. Replacing only the Base Model usually behaves like generating a random voicebank and is rarely useful.

Q3: Could Different Base Models Be Similar?

A: Yes, aside from SV2 compatibility libraries / PLUS compatibility libraries. Many Base Models are fine-tuned from the SV1 "general" Base Model (2c233d8e9b19f1f4dc0276ba3a5542c1) with additional data, so they can be somewhat similar. On these fine-tuned Base Models (use release timing as a clue), you may sometimes reuse timbre parameters from the general SV1 Base Model, but the outcome still has significant uncertainty because the Base Model changed.


Timbre Parameters (base / default / auxiliary): Layered Offset Style Vectors

With a Base Model, SV needs a way to point to a specific voicebank. That's the role of timbre parameters.

At a high level, SynthV Pro can be treated as a three-stage "offset model": three types of embedding vectors (Emb) are stacked to shape a specific voicebank:

  1. (base): the foundational timbre vector. It anchors core characteristics such as Articulation, accent, singing style, and overall timbre. In most Base Models, (base) is what identifies the voicebank (with exceptions such as SV2 compatibility libraries / PLUS compatibility libraries). This is hidden in stock SV1 Pro.
  2. (default): the default Vocal Mode. After (base) is set, (default) anchors the voicebank's default singing mode. (default) is effectively a 100%-strength Vocal Mode and can also be reused as a Vocal Mode in other voicebanks. This is also hidden in stock SV1 Pro.
  3. (auxiliary): adjustable Vocal Modes. Thanks to SV1 Pro's Base Model data coverage, (auxiliary) can express not only timbre but also Resonance, vocal technique, breathiness, and more. In SV Flat, you can tweak/transfer/modify/build special (auxiliary) modes for specific dynamic effects.

(default) and (auxiliary) are equivalent categories and can be blended with each other. (base) is different: in general, (base) should only be blended with other (base) vectors.

Practical importance usually looks like this:

$$ \mathtt{(base) > (default) > (auxiliary)} $$

You can think of SV1 Pro as stacking three offset vectors in order: (base), then (default), then (auxiliary). The direction of a vector represents timbre features; its magnitude represents feature weight. Controlling these vectors gives you fine-grained control over Vocal Modes.

Replacing, mixing, or reshaping these three layers can significantly alter timbre, mouth shape behavior, and perceived texture. Adding or replacing auxiliary styles increases the editing range inside the editor (and works best when you stay aligned with the voicebank's base_model).

extra is a scalar correction used only for (auxiliary) styles. It adjusts details like aspiration; deleting it (default to 0) usually causes only small changes. Flat Manager can predict extra for an (auxiliary) style (currently limited to Base Model 2c233d8e9b19f1f4dc0276ba3a5542c1).

Safety note: When editing a voicebank, change the version and save once first. Flat Manager will create a new version branch and helps prevent accidental loss of the original.

Format example (styles):

"styles": [ { "name": "(base)", "data": "C622493E2C8638BEBE4E3B3EB7331FBE52C4DA3DCE1A21BD667B993D43A82D3E367652BEB53579BC7814C1BDBA9427BDFFB2913B14E9433D232DE03D5660BD3D2407653EBD37F8BB61C15E3D2F8478BDBD0A8E3D4AAE033EDCEDD83D23F5193E002985BD3A3D09BCA6CA46BD1C4D4BBDBDFDC0BDA52B81BC0201E43D7D4A383D" }, { "name": "(default)", "data": "877D863A8DEEB1BD7ED4BDBDF6BF1FBE1FB8EFBD32474CBE7058D6BCAF8804BE64DED9BCC66E8F3C64F1B1BC6CE880BE263808BE997563BD78C42EBE8D80423D1F2B78BE374A64BB0045F7BDE0CF92BEC45A8EBEE71019BCEEEB45BE6CF7CFBEC8ED133EDA4C19BD951A37BDEAC7733EE8EC98BD4A9C1DBE7DF3013CAC661E3C" }, { "name": "Gentle", "data": "6F231F3D0F37CEBD7EE0503D0080E53D92C8433E00A47C3C741E7B3EC178103EEC0DB3BC5E0556BD007F2E3E65E4233E3457D7BD62F9023EDC75C73D405DAF3ABF9E21BE0B8A593D003AF5BD4074CE3CC09DD2BBB4F4C33C443A59BDDC3775BE28B1C13C56E05D3C60170D3E54FD11BD6A14A13D409ED93B66ADB73C1E9DEF3D", "extra": 0.18505549430847168 } ]

Q1: How Do I Edit Timbre Parameters?

A0 — Add more Vocal Modes / create a blank template:
Open the voicebank editor, right-click any style and copy it, for example:

{ "name": "Gentle", "data": "6F231F3D0F37CEBD7EE0503D0080E53D92C8433E00A47C3C741E7B3EC178103EEC0DB3BC5E0556BD007F2E3E65E4233E3457D7BD62F9023EDC75C73D405DAF3ABF9E21BE0B8A593D003AF5BD4074CE3CC09DD2BBB4F4C33C443A59BDDC3775BE28B1C13C56E05D3C60170D3E54FD11BD6A14A13D409ED93B66ADB73C1E9DEF3D", "extra": 0.18505549430847168 }

Paste it back into the same list and change it into a blank template, for example:

{ "name": "TimbreStyle1", "data": "0000000000000080000000000000008000000000000000800000000000000000000000800000008000000080000000800000000000000000000000000000000000000000000000800000000000000080000000000000000000000000000000000000008000000080000000800000008000000080000000800000000000000000", "extra": 0 }

Make sure the JSON stays valid: commas are the most common issue (missing or extra, depending on whether you insert in the middle or at the end). Click Save (or press Ctrl+S) when done. This blank style is useful as a target slot for A2 — Mixing.

A1 — Magnitude control:
In the voicebank editor, click a style's data value. Press Ctrl+M (or right-click and select "Adjust embedding Magnitude") to view its absolute magnitude. Use the slider to set magnitude in [-5, 5] or just input a number without range limitation. Negative values invert the vector direction. You can also add an "enfored_length": 0 field (any number) to directly set the absolute magnitude.

Note: Large magnitudes can easily cause clipping or extreme loudness. Be cautious, inspect rendered waveforms, then audition. Manual input can exceed 5; use that sparingly.

A2 — Vocal Mode mixing:
Click the style's data value, then press Ctrl+E (or right-click "Export embedding to Mixer") to send it to the Mixer (calculator icon in Flat Manager). Mix channels with sliders, then click Export to write an auxiliary embedding back into the JSON you're editing. Alternatively, copy the 256-HEX result and paste it into the blank template created in A0.

Note: In theory you can mix any embeddings (base/default/auxiliary, different Base Models, random vectors, pitch embeddings, etc.). For controlled results, follow the Base Model guidance and keep edits targeted.

A3 — Vocal Mode transfer:
In the source voicebank, copy a style entry. In the destination voicebank, paste it somewhere after (base) and (default) inside styles, then ensure JSON commas are correct. Save (Ctrl+S).

A4 — Random style:
In the voicebank editor, click "New Random Style" (round button with a plus). Flat Manager appends a random style to the end of styles. Save (Ctrl+S).

A5 — Random voicebank (Reset):
In the voicebank editor, click "Reset" (square button with a plus). This generates a completely random voicebank. Save (Ctrl+S).

Q2: After Editing, How Do I Make Changes Take Effect in the SV Flat Editor?

A: Besides restarting the SV Flat editor, you can uninstall one minor version of the voicebank and use "Refresh" in the SV Flat editor to reload it.

Q3: How Do I Manage Too Many Vocal Modes Inside One Voicebank?

A: Use version branching. Change version to create separate variants so you can switch between them:

"name": "GUMI AI", "version": "101", "vendor": "INTERNET Co., Ltd.", "language": "japanese", "phoneset": "romaji"

Note: Never make voicebanks with the same name and different vendors, which may cause some problems.

Q4: Can I Mix (default) with (base)?

A: Not recommended. (default) is equivalent to an (auxiliary) at 100% strength, but (base) is different and is generally safest to mix only with other (base) vectors. Mixing (base) with (default) is often unpredictable.


Auto-Pitch (f0_model / pitch): The "Secret" Behind Automatic Singing Pitch

In Flat Manager, f0_model and pitch control a voicebank's Auto-Pitch characteristics.

"f0_model": "dcee89442f69984189a5b2aedbf9f090", "pitch": "9D31DDBD30D22D3D4A195ABDD00D0F3E311D253EF28921BE97631C3EA27055BD62F983BDAC2EDDBC7724243DC266003DF53852BC0699D43DC8119DBDAACDA73E2288033EB2A995BD76A4113EBC717FBD9449BB3E8AFE193E58011BBCD5E7863D763E623DD0FBF83CE814BEBD548B83BE27D7D23DCBF8B23DE4982DBEABB693BC"

Replacing both f0_model and pitch together lets you steer a voicebank's Auto-Pitch behavior.

Q1: Example?

A: If you replace POPY's f0_model & pitch with Minus's f0_model & pitch, POPY's Auto-Pitch behavior will shift toward Minus.

Q2: Since pitch is also a float vector, can it be mixed?

A: In theory, yes, but it's usually not very useful. pitch is trained specifically and it's hard to evaluate mix quality like Vocal Modes. Using a high-quality Auto-Pitch set is generally the better option.


Singing Assistant Model (sing_model): Optional Plug-ins for Resonance / Articulation Behavior

In Flat Manager, sing_model mainly affects Articulation (oral target / placement) and how the voice behaves in the mixed-voice range, which in turn influences fundamental support. It has relatively smaller impact on pure-voice timbre and overall loudness.

"sing_model": "3f649ae6cb04ee4f7e9a7ed72ee29928"

As a rule of thumb: pitch changes how it's produced, not the timbre itself.

sing_model is largely independent from base_model, so it's generally safe to try different sing_model options as an enhancer for a voicebank you like.

That said, its impact is often subtle and the result may not fit your target voicebank. Test carefully.

Q1: How Do I Know What a sing_model Will Do?

A: Try it. Or analyze the source voicebank's singing characteristics and decide whether your target voicebank needs that direction.


Timing Model (timing_model / timing): Controlling Phoneme Durations and Consonant Behavior

In Flat Manager, timing_model and timing control phoneme-level timing. Together they influence duration, consonant phenomena, and the "feel" of transitions.

"timing_model": "7b0ea690ada94b4484b50d9d64a21cae", "timing": "D8168F3D9147F5BC6D9FA93C9A3E923D530D8F3C88D4883C806D7B3CCA41003D456A56BCD5FFC0BC666A5EBDAAA5E2BCC0082FBDB6A66C3D60BABB3C33CF233D064AD83C6BB9AA3C2DC6633CF3DB16BE3870853C28F2A03BBEAE20BDA1BACEBCF2F48C3C46E6FDBC002B48BD6656D9BAF3D7753D56954BBC0FD731BD97F4D43D253F2EBBE46EFCBD09EC7EBDED411EBD36DB9FBC29C2C8BC66F5413DC79A713B7A5F183D4ABF34BDA58529BBBB7F563B098E52BDC212B1BC26F0B03D4EEE573D4157363D68DB51BCF759A73B131389BBE695923CD48E67BDD4D1FB3DC6B4E5BB222421BD75BB15BC7128883D4D88B1BDB2EFA33C6E18973C14FA673BA3D233BD4C5C3BBDB063D33C399F87BBFEA8763D8B8F353D420E643CF3BB8D3C416F49BC0133CA3CCAF3DABDE99E643DC98626BBC405FA3CE27EA43A2770643D244D10BC7D72A9BD620CCDBC3AC22E3D1EEC693C3D098C3BF50D56BDB5BC073C7D4A5F3C845246BD4A13283E100B7ABD081F723C8CBC96BC77FDF43CE7EDBEBCA93525BAAD1B883D3035D53C6BEA413D61A688BC30832FBD98C3D23CAA551EBD73009F3D8D7B9C3CFCAE3BBD888F343CA4B4C23CA837513C31A228BDD3089DBD6D32E7BB87CC773D7BD9173DA6776EBD001D7E3D1BBD1A3D59C5463BC997B6BCE341153C3E6E0E3D0AD9293D36FF2F3DC251E7BCF6E17E3D4983B4BBCE10DA3D182A8B3C"

Replacing both timing_model and timing together lets you steer phoneme behavior:

The combined effect usually looks like this: Articulation won't be rebuilt into another voicebank's timbre or oral target (that's closer to pitch), but it will feel like your voicebank's Articulation is re-timed and re-connected. You may hear tighter/looser Articulation, cleaner/smoother transitions, and noticeable duration changes for certain phonemes (e.g., m, n).

Within the rule boundaries provided by timing_model, voiced/unvoiced behavior or consonant events like cl/br may become easier to trigger or closer to the target's tendency. There is still a limit: even with a full swap, you typically move the trend toward the target rather than perfectly cloning its most extreme traits.

One-sentence summary: timing controls phoneme intensity/duration/connection, timing_model controls available phoneme behaviors and rules. Together they shape phoneme-level Articulation feel and consonant behavior.

This model set is also independent from the Base Model, so swapping is generally safe. Still, test carefully.

Q1: Can I use only one of timing or timing_model?

A: Yes. The effect may be smaller. Try and adjust it carefully.

Q2: Can I mix and match timing / timing_model / sing_model freely?

A: Yes. Their effects are often subtle and require careful control. If you switch all three together, the overall behavior tends to shift more consistently toward the chosen target.


Phoneme Tables: Switching Languages

Flat Manager supports switching/editing/adding phoneme tables. Open the editor and click the phoneme table (the "A" icon with three dots).

Each language has its own phoneme table, and each entry contains:

  1. Phoneme name name: the identifier you type into the phoneme field.
  2. Phoneme type type: e.g., stop, vowel, fricative, nasal, liquid.
  3. System token token: the real token used in training/synthesis. If two phoneme names map to the same token, they are pronunciation-equivalent within the same language (cross-language equivalence is not guaranteed).

Flat enables you to use any system tokens in SynthV in any languages. It's also ok to change phoneme names or make new phonemes as you like. By editing phoneme tables (act on all the voicebanks) while setting which language a voicebank can use, you can make up new languages or write your own phoneme system (but the system still relies on the original training tokens, so results vary).

Flat has edited the phoneme tables of Cantonese and Spanish for Language Extensions.

{ "name": "cantonese-xsampa-phones", "phonemes": [ {"name":"a", "type":"vowel", "token":"a"} ] }

Synthesizer V Studio Flat Editor Overview

Voicebank Manager

The voicebank library includes timbre, Vocal Mode, and pitch data for currently known SV1 singers (including some SV2 compatibility libraries and SV2 PLUS libraries). You can open a voicebank or send embeddings to the Vocal Mode editor (or named Mixer) for further processing.

Note: To reproduce timbre and Vocal Mode behavior accurately, using data from the same Base Model is recommended.

The style library supports grouping voicebanks by name / vendor / Base Model, and supports refresh, batch install, batch export, and search.


Voicebank Editor

  1. Open: load the NOFS-JSON(.nofs) voicebank you want to edit.
  2. Save: save the modified NOFS-JSON.
  3. Metadata editing: edit name, version, vendor, default language. SV loads the highest version by default.
  4. Timbre & Vocal Mode editing: paste 256-HEX strings. Right click a HEX string for magnitude adjustment, exporting an embedding to the Mixer, predicting extra, and more.
  5. Reload: reload the NOFS-JSON.
  6. Reset: generate a random voicebank.
  7. Set translation: configure the translation file.
  8. Export .sfpk: save/export a voicebank to .sfpk.
  9. Install .sfpk: install a .sfpk file.
  10. New Random Style: add a random style at the end of styles.

Voicebank editor field template:

{ "name": "Voicebank display name", "version": "Version string", "vendor": "Vendor / publisher", "language": "Default / primary language", "phoneset": "Phoneme table (e.g., xsampa; can be inferred from language, not suggested)", "support_languages": [ "List of supported languages(not suggested)" ], "base_model": "Base Model hash (locate/verify model files)", "sing_model": "Singing assistant model hash", "timing_model": "Timing model hash", "f0_model": "F0 model hash", "styles": [ { "name": "Style label (e.g., (base)/(default))", "data": "Embedding vector (serialized 256-HEX string)" }, { "name": "Style label", "data": "Embedding vector (serialized 256-HEX string)", "extra": "Extra scalar correction" } ], "pitch": "Auto-Pitch embedding (serialized 256-HEX string)", "timing": "Phoneme timing embedding (serialized 1024-HEX string)", "note": "Additional properties are ok for notes" }

Note 1: You can add your own custom fields to the voicebank JSON (Flat Manager won't read them). This can be used as a "notes" area for recipes and management:

"Memory_1": "Notes: ...", "Memory_2": "Notes: ..."

Note 2: Flat Manager has a default fill mechanism: if critical fields (like base_model) are missing, it will backfill them with defaults (e.g., 2c233d8e9b19f1f4dc0276ba3a5542c1). Voicebank with no (base) or (default) will be backfilled to 000... (256 zeros). The recommended practice is to never manually specify phoneset and support_languages, relying instead on the editor to autofill them. See the minimal example on the NOFS-JSON of Refresh (you can use Preview Editor to see it).


Mixer

The Mixer supports blending 256-HEX embeddings for timbre, Vocal Modes, and pitch.


Phoneme Table Editor

Switch/browse phoneme tables. See the Phoneme Tables section in Basics and Background Concepts.

How to Install Language Extensions: Language extensions are essentially user dictionaries, but they will unlock hidden phonemes when used with flat. Due to the limitations of the original SynthV R2, a user dictionary can only be installed under one default language (means that only voicebanks with this default language can use the dictionary). Therefore, when installing the extension package, the installer will first prompt you to choose the default language for the extension. For example, if you want to use the Aver or Asterian voicebank with an extension for any language, you would select Japanese + English.

Next, the program will prompt you to choose which languages to install the extension for. If you need extensions for Russian and French, you should select Russian + French. After installation, Flat will allow the voicebank set to the default language of Japanese/French to extend with Russian/French phonemes.

Note: Having too many user dictionaries absolutely can slow down the startup of SynthV.

How to Use Language Extensions: After selecting a voicebank like Aver, and confirming that the extension for its default language (Japanese) is installed, go to the user dictionary tab in the SV sidebar. Select the dictionary, such as "ru(sp)_dict." Here, "ru" refers to the Russian extension, and "(sp)" indicates that the language must be switched to Spanish in the singer tab. In other words, the "ru" extension is used under the "sp" language setting.

Once this is done, you can directly sing in Russian lyrics.

Why Are Some Phonemes Silent / Auto-pitch Rendering Errors? Older base models are more likely to have missing phonemes. Flat only unlocks hidden phonemes and does not forcefully add new ones. The generation of auto-pitch is strongly correlated with phonemes, so if there are phoneme errors, auto-pitch will also encounter issues.


Advanced Exercises

This section focuses on special workflows and ways of thinking. It's not "better" than basic usage - just different. This guide is also limited by the author's experience and the time spent compiling it, so treat it as a starting point and validate by ear.

Base Model Thinking: Using Offset Models to Guide Voicebank Tuning

Recall the Base Model definition:

base_model is the foundational model. SV1 Pro likely trained on a large dataset, producing a Base Model that defines a feature space. By adjusting embedding vectors, singing can move continuously across different voice states.

In short, the Base Model is the lowest-level data source and stores a range of voice states.

So, changes in timbre and singing style are changes in embedding vectors inside the Base Model's high-dimensional feature space.

If you want to build or tune a voicebank, start with a suitable Base Model and then confirm the timbre parameters step by step: from (base) to (default) to (auxiliary). This is a layered tuning workflow: foundational timbre → default mode → specific modes.

When creating a voicebank via random search / voiceprint comparison / data mining, keep in mind:

During tuning:

Similarity and Compatibility: Thinking About Relationships Between Base Models and Voicebanks

If different Base Models are "supposed" to be incompatible, why do some cross-model transfers work?

Different-Base-Model mixing can work, but the outcome depends strongly on Base Model similarity, feature-space distance between (base) layers, and how the underlying data overlaps.

Many Base Models are fine-tuned from the general SV1 Base Model, which means there can be partial compatibility. However, feature-space distance is difficult to estimate directly; Base Models trained far apart in time or on very different data can be much harder to transfer between.

An observed workflow: if a cross-model Vocal Mode works well, you can sometimes use a voicebank's (base) as an adapter so that another voicebank on the same Base Model can inherit that cross-model mode more smoothly. This is a practical form of "similarity and compatibility".

Summary: whether transfer works is mostly correlated with Base Model and (base) similarity. "Cross Base Model" does not automatically mean "impossible".

Magnitude and Influence: Is "More" Always Stronger?

In many cases, embedding magnitude affects a Vocal Mode's effective strength. But SV seems to enforce internal limits on some voicebanks: even if you increase the displayed percentage, the effect may saturate.

Large magnitudes can also cause clipping, no audible change, or extreme loudness.

In practice, influence isn't only magnitude—it also depends on distribution. If you cluster all Vocal Modes within a voicebank (high-dimensional analysis), outliers often sound distinctive (either very strong or very subtle), and the relationship with magnitude is not absolute.

When mixing, avoid blending strongly opposing "strong vs. weak" modes unless you're specifically exploring special effects. That kind of mixing can cancel amplitude in feature space and make the result bland.

Inverting Vocal Modes: Getting "Opposite" Effects

Based on a large amount of analysis, many Vocal Modes (notably on Base Models such as 6e5da191faa421a20b529b40c3aa4968) can produce a roughly "opposite" effect via simple inversion.

Some practical ideas:

Takeaway: Where Are the Limits—and What Comes Next?

In the editor, a voicebank doesn't need to be a fixed timbre. You can push one voicebank into different singing states by adjusting oral target placement, Resonance / phonation behavior, Articulation tightness, phoneme transitions, breathiness, and more.

With that level of control, "similarity" becomes more than a subjective judgment. It can be turned into a workflow: align toward a target mode, infer which parameters to move, and iterate with structure. The long-term boundary may not be "timbre cloning", but rather a stable, reproducible space of singing strategies—something closer to a creative and pedagogical vocal design system, where higher freedom does not sacrifice stability.


Special Recipes

Here are some examples to get you started.

Recipe 1: Fine-tune mouth shape and Articulation using the Muxin v200 model

base: 30% shuo + 70% pastel
default: 30% shuo + 70% pastel
singmodel: Muxin
F0: Muxin
timing: Muxin
35% mild
50% shuo (powerful)

Recipe 2: Use "Minus (default)" as an alternative for whisper VM

Minus (default)

to be continued ...

Appendix

Phoneme Name Pronunciation Type System Token Language Phoneme System
aa vowel a english arpabet
ae vowel ARP_ae english arpabet
ah vowel A english arpabet
ao vowel ARP_ao english arpabet
aw diphthong AU english arpabet
ax vowel ARP_ax english arpabet
ay diphthong ARP_ay english arpabet
b stop p english arpabet
ch affricate ts`h english arpabet
d stop t english arpabet
dx stop ARP_dx english arpabet
dr affricate ARP_dr english arpabet
dw affricate ARP_dw english arpabet
dh fricative ARP_dh english arpabet
eh vowel e english arpabet
er vowel ARP_er english arpabet
ey diphthong ARP_ey english arpabet
f fricative f english arpabet
g stop k english arpabet
hh aspirate x english arpabet
ih vowel ARP_ih english arpabet
iy vowel i english arpabet
jh affricate ts` english arpabet
k stop kh english arpabet
l liquid l english arpabet
m nasal m english arpabet
n nasal n english arpabet
ng nasal N english arpabet
ow diphthong @U english arpabet
oy diphthong ARP_oy english arpabet
p stop ph english arpabet
q stop ARP_q english arpabet
r semivowel z` english arpabet
s fricative s english arpabet
sh fricative s` english arpabet
t stop th english arpabet
tr affricate ARP_tr english arpabet
tw affricate ARP_tw english arpabet
th fricative ARP_th english arpabet
uh vowel u english arpabet
uw vowel y english arpabet
v fricative ARP_v english arpabet
w semivowel w english arpabet
y semivowel j english arpabet
z fricative ts english arpabet
zh fricative ARP_zh english arpabet
pau silence pau english arpabet
sil silence sil english arpabet
cl stop cl english arpabet
br breath br english arpabet
a vowel a japanese romaji
i vowel i japanese romaji
u vowel u japanese romaji
e vowel e japanese romaji
o vowel o japanese romaji
N vowel N japanese romaji
cl stop cl japanese romaji
t stop th japanese romaji
d stop t japanese romaji
s fricative s japanese romaji
sh fricative s\ japanese romaji
j affricate ts\ japanese romaji
z affricate ts japanese romaji
ts affricate tsh japanese romaji
k stop kh japanese romaji
kw stop ROM_kw japanese romaji
g stop k japanese romaji
gw stop ROM_gw japanese romaji
h aspirate x japanese romaji
b stop p japanese romaji
p stop ph japanese romaji
f fricative f japanese romaji
ch affricate ts\h japanese romaji
ry liquid ROM_ry japanese romaji
ky stop ROM_ky japanese romaji
py stop ROM_py japanese romaji
dy stop ROM_dy japanese romaji
ty stop ROM_ty japanese romaji
ny nasal ROM_ny japanese romaji
hy aspirate ROM_hy japanese romaji
my nasal ROM_my japanese romaji
gy stop ROM_gy japanese romaji
by stop ROM_by japanese romaji
n nasal n japanese romaji
m nasal m japanese romaji
r liquid l japanese romaji
w semivowel w japanese romaji
v semivowel ARP_v japanese romaji
y semivowel j japanese romaji
pau silence pau japanese romaji
sil silence sil japanese romaji
br breath br japanese romaji
a vowel a mandarin xsampa
A vowel A mandarin xsampa
o vowel o mandarin xsampa
@ vowel @ mandarin xsampa
e vowel e mandarin xsampa
7 vowel 7 mandarin xsampa
U vowel U mandarin xsampa
u vowel u mandarin xsampa
i vowel i mandarin xsampa
i\ vowel i\ mandarin xsampa
i` vowel i` mandarin xsampa
y vowel y mandarin xsampa
AU diphthong AU mandarin xsampa
@U diphthong @U mandarin xsampa
ia diphthong ia mandarin xsampa
iA diphthong iA mandarin xsampa
iAU diphthong iAU mandarin xsampa
ie diphthong ie mandarin xsampa
iE diphthong iE mandarin xsampa
iU diphthong iU mandarin xsampa
i@U diphthong i@U mandarin xsampa
y{ diphthong y{ mandarin xsampa
yE diphthong yE mandarin xsampa
ua diphthong ua mandarin xsampa
uA diphthong uA mandarin xsampa
u@ diphthong u@ mandarin xsampa
ue diphthong ue mandarin xsampa
uo diphthong uo mandarin xsampa
:\i coda :\i mandarin xsampa
r\` coda r\` mandarin xsampa
:n coda :n mandarin xsampa
N coda N mandarin xsampa
p stop p mandarin xsampa
ph stop ph mandarin xsampa
t stop t mandarin xsampa
th stop th mandarin xsampa
k stop k mandarin xsampa
kh stop kh mandarin xsampa
ts\ affricate ts\ mandarin xsampa
ts affricate ts mandarin xsampa
tsh affricate tsh mandarin xsampa
ts` affricate ts` mandarin xsampa
ts`h affricate ts`h mandarin xsampa
x aspirate x mandarin xsampa
f fricative f mandarin xsampa
s fricative s mandarin xsampa
s` fricative s` mandarin xsampa
ts\h fricative ts\h mandarin xsampa
s\ fricative s\ mandarin xsampa
m nasal m mandarin xsampa
n nasal n mandarin xsampa
l liquid l mandarin xsampa
z` semivowel z` mandarin xsampa
w semivowel w mandarin xsampa
j semivowel j mandarin xsampa
pau silence pau mandarin xsampa
sil silence sil mandarin xsampa
cl stop cl mandarin xsampa
br breath br mandarin xsampa
ts affricate ts cantonese xsampa
tsh affricate tsh cantonese xsampa
f fricative f cantonese xsampa
h fricative x cantonese xsampa
s fricative s cantonese xsampa
l liquid l cantonese xsampa
m nasal m cantonese xsampa
n nasal n cantonese xsampa
N nasal YUE_N cantonese xsampa
w semivowel w cantonese xsampa
j semivowel j cantonese xsampa
p stop p cantonese xsampa
ph stop ph cantonese xsampa
t stop t cantonese xsampa
th stop th cantonese xsampa
k stop k cantonese xsampa
kh stop kh cantonese xsampa
kw stop ROM_gw cantonese xsampa
kwh stop ROM_kw cantonese xsampa
a vowel a cantonese xsampa
6 vowel 6 cantonese xsampa
E vowel e cantonese xsampa
e vowel e cantonese xsampa
i vowel i cantonese xsampa
I vowel ARP_ih cantonese xsampa
O vowel o cantonese xsampa
o vowel o cantonese xsampa
u vowel u cantonese xsampa
U vowel U cantonese xsampa
9 vowel 9 cantonese xsampa
8 vowel 8 cantonese xsampa
y vowel y cantonese xsampa
m= vowel m cantonese xsampa
N= vowel N cantonese xsampa
:i coda :\i cantonese xsampa
:u coda :u cantonese xsampa
:m coda :m cantonese xsampa
:n coda :n cantonese xsampa
:N coda N cantonese xsampa
:p_} coda :p_} cantonese xsampa
:t_} coda :t_} cantonese xsampa
:k_} coda :k_} cantonese xsampa
pau silence pau cantonese xsampa
sil silence sil cantonese xsampa
cl stop cl cantonese xsampa
br breath br cantonese xsampa
a vowel a spanish xsampa
e vowel e spanish xsampa
i vowel i spanish xsampa
o vowel o spanish xsampa
u vowel u spanish xsampa
U semivowel w spanish xsampa
I semivowel ES_I spanish xsampa
y semivowel j spanish xsampa
ll semivowel ll spanish xsampa
b stop b spanish xsampa
B stop B spanish xsampa
d stop d spanish xsampa
D stop D spanish xsampa
g stop g spanish xsampa
k stop k spanish xsampa
p stop p spanish xsampa
t stop t spanish xsampa
l liquid l spanish xsampa
rr trill rr spanish xsampa
r liquid r spanish xsampa
m nasal m spanish xsampa
n nasal n spanish xsampa
N nasal N spanish xsampa
J nasal ROM_ny spanish xsampa
f fricative f spanish xsampa
s fricative s spanish xsampa
C fricative ARP_th spanish xsampa
sh fricative s` spanish xsampa
ch affricate ts`h spanish xsampa
x fricative x spanish xsampa
pau silence pau spanish xsampa
sil silence sil spanish xsampa
cl stop cl spanish xsampa
br breath br spanish xsampa
4 liquid l korean xsampa
6 vowel 6 korean xsampa
b stop b korean xsampa
d stop d korean xsampa
dz\ affricate ts\ korean xsampa
e_o vowel e korean xsampa
g stop g korean xsampa
h fricative x korean xsampa
i vowel i korean xsampa
j semivowel j korean xsampa
ts\h affricate ts\h korean xsampa
k stop k korean xsampa
k_t stop k_t korean xsampa
l liquid l korean xsampa
M nasal U korean xsampa
m nasal m korean xsampa
n nasal n korean xsampa
N coda N korean xsampa
o vowel o korean xsampa
p stop p korean xsampa
p_t stop p_t korean xsampa
s fricative s korean xsampa
s_t fricative s_t korean xsampa
t stop t korean xsampa
t_t stop t_t korean xsampa
ts\_h affricate ts`h korean xsampa
u vowel u korean xsampa
V vowel A korean xsampa
w semivowel w korean xsampa
pau silence pau korean xsampa
sil silence sil korean xsampa
cl stop cl korean xsampa
br breath br korean xsampa