論文メモ_3 - tokumei_meerkatのブログ

End-to-End Multi-Speaker Speech Recognition using Speaker Embeddingsand Transfer Learning
"VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking"と似たアイディア。single speakerでモデルを事前学習。

Self-supervised Attention Model for WeaklyLabeled Audio Event Classification
audio event classification taskで、正確な時間情報を含むラベル情報 (strong labels) がある場合、当該時刻の当該クラスにかかるattentionが高くなるように、lossを追加。strong labelsがない場合も、self-supervisedで学習できる。(TTS分野でも[1]のようにattention自体に対してlossを定義している手法がある、おいている仮定は違うが。)
[1] Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Forward-Backward Decoding for Regularizing End-to-End TTS
End-to-End TTSにおいて、decoderはleft-to-right (L2R)にdecodeしていく。そのため、exposure biasが問題となる。これに対処するため、right-to-left (R2L)にstepを進めるdecoderも学習し、両者の生成するメルスペクトログラム (もしくはhidden states)が一致するよう、新たにロスを追加する。メルスペクトログラムを対象とした場合、data augmentationともみなせる。
ASRでも同じモチベーションで研究が行われており、その一つが[2]
[2] Forward-Backward Attention Decoder

Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
End-to-endで音声翻訳をする際、直接frame-levelの音響特徴量を入力するのではなく、phoneme-levelの情報、具体的にはDNN-HMMで求めたアライメント、を入力としてMTを学習。

Very Deep Self-Attention Networks for End-to-End Speech Recognition
モチベーション：transformerをそのまま音声認識に適用しても、精度の改善は限定的なので、さらに改良を加える。residual layerのskipしない方の値に(dropoutと同じように)maskをかけるstochastic layerを提案。very deep: 48 transformer layers in total。

Almost Unsupervised Text to Speech and Automatic Speech Recognition
ASRのTTSの間のdualityに着目して、なるべく少ないpairwise dataだけを用いてASRとTTSモデルを学習。single speaker, 200 paired data on LJSpeech。⇔speech chain, cycle consistency。

Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping
Online recognitionのための、self attention network (SAN)を用いたend-to-end model。1) blankとcausalを考慮したマスクをかけてLMを学習 2)chunk単位での処理 (Chunk-hopping mechanism)

Towards Language-Universal End-to-End Speech Recognition
Multi-(3-) lingual end-to-end ASR。output unitの、複数言語間での共有を許す。output unitの発音が言語間で異なるので、言語情報を利用してhidden vectorに対してgatingを行う。また、対象言語と関係のないoutput unitはzero maskingする。