論文メモ - tokumei_meerkatのブログ

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019 2019年のDCASE challengeの総評。
Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging Audio taggingタスク。サンプル単位でのhidden vectorの類似度を図るsimilarity preserving KDを発展させ、サンプル内のフレーム間類似度をもとめるintra-utterance similarity preserving KDを提案。teacher modelとstudentモデルでこれを計算し、MSE lossを最小化させる。
Feature space Augmentation for Long-Tailed Data 画像分野での、imbalanced dataに対する特徴量空間でのdata augmentation。CAMを用いて判断根拠となっている部分とそうではない部分に分け、後者をいじる。
StoRIR: Stochastic Room Impulse Response Generation for Audio Data Augmentation data augmentation用に重畳音声を作成させる。オープンソース。 https://github.com/SRPOL-AUI/storir
Open-set Short Utterance Forensic Speaker Verification using Teacher-Student Network with Explicit Inductive Bias 長時間発話データで話者照合モデルを作成し、短時間発話用モデルをteacher student learningで学習。更新時の重みを対象に正則化こうをつける。
Sound Event Localization and Detection based on CRNN using Dense Rectangular Filters and Channel Rotation Data Augmentation sound event localicationに特化したdata augmentation。チャネルのスワッピング、方位角の回転。
Exploiting Spectral Augmentation for Code-Switched Spoken Language Identification code switching speechを対象としたLID(lanugage identification)タスクに特化したdata augmentation。SpecAugmentを参考に特定言語部分のスペクトルをマスクし、マスクしたスペクトルと対応する部分をマスクした言語ラベルを新たなサンプルとする。
Dice Loss for Data-imbalanced NLP Tasks F値を最終的な評価指標とするタスクをCEで解くと、accとF値に乖離が生じる。そのため、ダイス係数(=F1 score)をもとにした誤差関数を提案。(F値を使用するタスクで有効。accuracyが最終的な評価指標の場合はCEと使い分ける。)
Knowing What to Listen to: Early Attention for Deep Speech Representation Learning STFTの(時間軸方向ではなく)周波数ビンを対象にattentionをかける。モデルの初期段階でattentionをかけられることが他手法と異なると主張。