보컬의 특징을 잘 학습 못하는것 같은 문제

AI 음성 채널

알림 알림 중 알림 취소

구독자 5413명 알림수신 122명 @The_Voice

TTS, VITS, SVC와 같은 딥러닝 음성 합성 기술 관련 정보와 이야기를 공유합니다.

❓질문 보컬의 특징을 잘 학습 못하는것 같은 문제

삑그리고다음

추천 0 비추천 0 댓글 13 조회수 1132 작성일 2023-07-19 08:07:11

⚠️ 이 게시물은 작성자가 삭제할 수 없도록 설정되어 있습니다.

https://arca.live/b/aispeech/81449478

현재 1시간 40분 정도의 분량의 보컬 데이터를 학습시키고 있는데
내가 설정을 잘못한건지 데이터가 너무 적은건지 생각보다 고음 지를때 특징이 잘 안살아서 문제야

지금 써본 레포가 RVC 베타 v2랑 Diffusion SVC 1.0인데

RVC는 잘 부르기는 하는데 학습된 음성보다 원곡의 음성의 목소리인가 헷갈릴 정도로 뭔가 싱크로율이 낮음

https://arca.live/b/aispeech/80694874

세팅은 이 글을 완전히 따라함

epoch는 가수라서 한 40정도 돌렸고

loss_disc=3.356, loss_gen=3.863, loss_fm=15.002,loss_mel=23.874, loss_kl=1.319

data:
  f0_extractor: "crepe" # 'parselmouth', 'dio', 'harvest', or 'crepe'
  f0_min: 65 # about C2
  f0_max: 800 # about G5
  sampling_rate: 44100
  block_size: 512 # Equal to hop_length
  duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
  encoder: "contentvec768l12" # 'hubertsoft', 'hubertbase', 'hubertbase768', 'contentvec', 'contentvec768' or 'contentvec768l12' or 'cnhubertsoftfish'
  cnhubertsoft_gate: 10 # only use with cnhubertsoftfish
  encoder_sample_rate: 16000
  encoder_hop_size: 320
  encoder_out_channels: 768 # 256 if using 'hubertsoft'
  encoder_ckpt: pretrain/contentvec/checkpoint_best_legacy_500.pt
  units_forced_mode: "nearest" # Recommended 'nearest',experiment 'rfa512to441' and 'rfa441to512' ; 'left'  only use for compatible with history code
  volume_noise: 0 # if not 0 ,add noise for volume in train ;;;;EXPERIMENTAL FUNCTION, NOT RECOMMENDED FOR USE;;;;
  train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
  valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
  extensions: # List of extension included in the data collection
    - wav
model:
  type: "Diffusion"
  n_layers: 20
  n_chans: 512
  n_hidden: 256
  use_pitch_aug: true
  n_spk: 1 # max number of different speakers
device: cuda
vocoder:
  type: "nsf-hifigan"
  ckpt: "pretrain/nsf_hifigan/model"
infer:
  speedup: 10
  method: "dpm-solver" # 'ddim', 'pndm', 'dpm-solver' or 'unipc'
env:
  expdir: exp/diffusion-test
  gpu_id: 0
train:
  num_workers: 0 # If your cpu and gpu are both very strong, set to 0 may be faster!
  amp_dtype: fp32 # fp32, fp16 or bf16 (fp16 or bf16 may be faster if it is supported by your gpu)
  batch_size: 64
  cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
  cache_device: "cpu" # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
  cache_fp16: true
  epochs: 100000
  interval_log: 10
  interval_val: 2000
  interval_force_save: 10000
  lr: 0.0002
  decay_step: 100000
  gamma: 0.5
  weight_decay: 0
  save_opt: false

이건 Diffusion SVC에 사용한 yaml임

둘다 멀티 스피커 그런거 아니고 혼자 학습시킴

그래야 원본 느낌이 살것 같아서

아직 epoch 10만 못채워서 현재 loss 그래프랑 csv, json 첨부함

https://files.catbox.moe/xnitu4.csv

https://files.catbox.moe/mbwgi8.json

근데 갠적으론 diffusion svc가 특징을 더 잘 따는것 같기도 함

중간에 잠깐 멈추고 method는 dpm-solver, pndm로 하나씩 출력 해봤는데

dpm-solver가 제일 잘 따라하는것 같았음

저 두 메소드의 차이를 알려주는 사람은 안보이더라

아무튼 뭘 잘못한건지 잘 모르겠음

레포를 갈아타야 하나...

삑그리고다음

2023-07-19 08:08:08

diffusion svc에 사용한 config이 좀 이상하게 나와서 여기에도 올림

data:
  f0_extractor: 'crepe' # 'parselmouth', 'dio', 'harvest', or 'crepe'
  f0_min: 65 # about C2
  f0_max: 800 # about G5
  sampling_rate: 44100
  block_size: 512 # Equal to hop_length
  duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
  encoder: 'contentvec768l12' # 'hubertsoft', 'hubertbase', 'hubertbase768', 'contentvec', 'contentvec768' or 'contentvec768l12' or 'cnhubertsoftfish'
  cnhubertsoft_gate: 10 # only use with cnhubertsoftfish
  encoder_sample_rate: 16000
  encoder_hop_size: 320
  encoder_out_channels: 768 # 256 if using 'hubertsoft'
  encoder_ckpt: pretrain/contentvec/checkpoint_best_legacy_500.pt
  units_forced_mode: 'nearest' # Recommended 'nearest',experiment 'rfa512to441' and 'rfa441to512' ; 'left'  only use for compatible with history code
  volume_noise: 0 # if not 0 ,add noise for volume in train ;;;;EXPERIMENTAL FUNCTION, NOT RECOMMENDED FOR USE;;;;
  train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
  valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
  extensions: # List of extension included in the data collection
    - wav
model:
  type: 'Diffusion'
  n_layers: 20
  n_chans: 512
  n_hidden: 256
  use_pitch_aug: true  
  n_spk: 1 # max number of different speakers
device: cuda
vocoder:
  type: 'nsf-hifigan'
  ckpt: 'pretrain/nsf_hifigan/model'
infer:
  speedup: 10
  method: 'dpm-solver' # 'ddim', 'pndm', 'dpm-solver' or 'unipc'
env:
  expdir: exp/diffusion-test
  gpu_id: 0
train:
  num_workers: 0 # If your cpu and gpu are both very strong, set to 0 may be faster!
  amp_dtype: fp32 # fp32, fp16 or bf16 (fp16 or bf16 may be faster if it is supported by your gpu)
  batch_size: 64
  cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
  cache_device: 'cpu' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
  cache_fp16: true
  epochs: 100000
  interval_log: 10
  interval_val: 2000
  interval_force_save: 10000
  lr: 0.0002
  decay_step: 100000
  gamma: 0.5
  weight_decay: 0
  save_opt: false

펼쳐보기▼

삑그리고다음

2023-07-19 08:13:03

diff는 유지보수 안되지 않나
그래서 더 최신인 diffusion svc가 나을것 같다고 생각했는데

펼쳐보기▼

cocoi

2023-07-19 08:20:45

*수정됨

f0_extractor: "harvest" 로 바꾸고 학습해봐
그리고 사전학습된 모델 넣어서 학습시킨거임?
lr (learning rate) 이 값도 좀 높은거같은데

펼쳐보기▼

삑그리고다음

2023-07-19 08:35:25

정확히 사전 학습 모델이 어떤 역할을 하는지 이해를 못해서 안넣음 
사전 훈련 데이터 안넣고 그냥 쌩으로 돌림

원래 처음에 harvest 쓸까 했는데 챈에 검색하다 보니까 무조건 crepe다 라는걸 봐서 그냥 crepe 넣음

펼쳐보기▼

cocoi

2023-07-19 08:36:28

사전학습모델로 빠르게 여러가지 시도해보고 가장 효과가좋은걸로 하면될듯

펼쳐보기▼

삑그리고다음

2023-07-19 08:38:00

ㅇㅇ

2023-07-19 08:21:45

*수정됨

에포크는 특정 값이 정해져있는게 아니라 데이터셋이랑 lr, 배치사이즈에 따라서 다 달라짐.. 어디서 가수는 40해야된다 이런 말도안되는...

펼쳐보기▼

ㅇㅇ

2023-07-19 08:23:01

텐서보드 log값 보면서 적정 학습을해야함.. 원본 느낌이 부족한거면 학습 덜해서 그럴 가능성이 높음.. 기계음나면 과학습이고.. 텐서보드로 보는게 가장 정확함

펼쳐보기▼

좀ㄱ

2023-07-19 08:25:55

갑자기 궁금한데 보통 loss/g/total에서 어느정도 값이 나오면 잘 된거임? 나는 대부분 38정도 뜨면 어색함없이 잘 되던데

펼쳐보기▼

cocoi

2023-07-19 08:30:59

모델마다 다름 대부분 그 값이 가장 낮을때가 best model 취급됨

펼쳐보기▼

ㅇㅇ

2023-07-19 08:34:39

*수정됨

harvest 학습 기준으로
학습 과정에서 특정 지표의 그래프를 분석할 때, 그래프가 위아래로 진동한 후에 최소점을 찍는 지점을 Best Step으로 간주함. 이후 그래프가 일정하게 유지되면, 해당 지점부터는 모델이 Overfitting 상태에 진입한 것으로 판단할 수 있음. 그래프 Smoothing 계수는 0.999로 설정하고 봐야됨

펼쳐보기▼

좀ㄱ

2023-07-19 08:35:24

넵 감사합니다 참고할게영

펼쳐보기▼

삑그리고다음

2023-07-19 08:37:34

*수정됨

smoothing 부분은 여태 몰랐네
그래프 가독성이 확 달라졌다  
현재 계속 loss가 줄어들고 있긴 함

펼쳐보기▼

본 게시물에 댓글을 작성하실 권한이 없습니다. 로그인 하신 후 댓글을 다실 수 있습니다. 아카라이브 로그인

전체글 개념글

최근 최근 방문 채널

최근 방문 채널

전체 일반 📄정보 💾자료 ❓질문 ❗공지 🔨운영

번호 제목

작성자 작성일 조회수 추천

공지 아카라이브 모바일 앱 이용 안내(iOS/Android)

*ㅎㅎ 2020.08.18 27858087

공지 ★필독★ AI 음성 채널 기본 통합 공지 (23-06-12)

ㅇㅇ 2023.03.06 24368

공지 ★필독★ 음성모델 공유 관련 규정 (23-06-14)

The_Voice 2023.06.13 14526

공지 AI 음성챈을 처음 방문한 히치하이커를 위한 안내서 (23-07-01)

Tacotron2 2023.06.07 42659

공지 채널 내에서 "AI 성우" 라는 용어 사용을 자제해주길 바람.

공지 국내 가수 및 스트리머, 성우를 활용한 창작물은 업로드 금지임

무명의개념 2023.07.04 3996

숨겨진 공지 펼치기(3개)

66 일반 뱀파이어 << 이 노래 커버한 사람들 고음 다 깨짐? [7]

미사키메이무메이 2024.04.07 215 1

65 ❓질문 다들 이런 화음중첩된건 어떻게 제거하시나요? [2]

야바주주비 2024.02.26 203 0

64 💾자료 ai-hub 애니체가 있는 한국어 음성 데이터셋

ㅇㅇ 2024.02.15 1076 5

63 ❓질문 가지고 있는 음성 파일 학습 시키는거 RVC로도 가능한가요? [1]

보벼보벼 2024.02.03 488 0

62 ❓질문 ddsp-svc 4.0 diffusion-fast 질문 [2]

_Technology_ 2024.01.25 282 0

61 ❓질문 rvc beta.7z 이거 쓰고 있는데 새버전 덮어씌워야 하는거에요? [1]

파라미터변수 2024.01.11 181 0

60 ❓질문 RvC 추론, 작업과정 질문 [2]

ljcsplehb 2024.01.04 276 0

59 ❓질문 VITS_fast_finetuing 따라한거 인퍼런스 질문 [7]

퍽 2023.12.22 202 0

58 ❓질문 rvc 한국어 모댈 찾았는데 [3]

ㅇㅇ 2023.12.19 785 0

57 일반 Bert-VITS2 tlqkf 못해먹겠네 [9]

PPAP 2023.12.16 511 1

56 ❓질문 vits tts gradio webui 안열림 문제 [6]

동원야쿠르트 2023.12.16 317 0

55 ❓질문 음성 학습 Nomalization 과정에서 Ran out of input 오류 [1]

Rishin 2023.11.28 165 0

54 ❓질문 rmvpe 추가를 어떻게 하는 건가요? ㅠㅠ [2]

아카카리으 2023.11.15 343 0

53 일반 ddsp 코랩 텐서보드 어떻게봄? [2]

aat 2023.11.12 281 0

52 일반 데이터셋이아니라모델데이터셋이아니라모델데이터셋이아니라모델 [10]

PPAP 2023.11.06 2544 21

51 일반 추론용 실험용 보컬 [2]

PPAP 2023.10.12 470 1

50 일반 (cmd창 텍스트 다 긁어서 스크롤김) 좋았어 죠타로! 난 RVC 로컬을 포기하겠다!!!!!! [2]

샴푸맛코 2023.09.25 571 0

49 📄정보 번역) SiFi-VITS2-44100-Ja [5]

PPAP 2023.09.25 536 3

48 ❓질문 rvc v3 나오는 걸 기다리며 로컬 해보는 중인데 정말 어렵네요 [9]

샴푸맛코 2023.09.23 786 0

전체글 개념글