首頁猿問與librosa vs...

與librosa vs python_speech_features vs tensorflow

Python

青春有我 2022-08-16 16:08:06

我正在嘗試從音頻（.wav文件）中提取MFCC功能，我已經嘗試過，但它們給出了完全不同的結果：python_speech_featureslibrosaaudio, sr = librosa.load(file, sr=None)# librosahop_length = int(sr/100)n_fft = int(sr/40)features_librosa = librosa.feature.mfcc(audio, sr, n_mfcc=13, hop_length=hop_length, n_fft=n_fft)# psffeatures_psf = mfcc(audio, sr, numcep=13, winlen=0.025, winstep=0.01)情節本身更接近librosa的情節，但比例更接近python_speech_features。（請注意，這里我計算了80個mel條柱并取了前13個;如果我只用13個箱子進行計算，結果看起來也大不相同）。代碼如下：stfts = tf.signal.stft(audio, frame_length=n_fft, frame_step=hop_length, fft_length=512)spectrograms = tf.abs(stfts)num_spectrogram_bins = stfts.shape[-1]lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix( num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz, upper_edge_hertz)mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)features_tf = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[..., :13]features_tf = np.array(features_tf).T我想我的問題是：哪個輸出更接近MFCC的實際樣子？

查看完整描述

2 回答

繁星coding

TA貢獻1797條經驗獲得超4個贊

這里至少有兩個因素可以解釋為什么你會得到不同的結果：

mel尺度沒有單一的定義。實現兩種方式：Slaney和HTK。其他包可能會并且將使用不同的定義，從而導致不同的結果。話雖如此，整體情況應該是相似的。這就引出了第二個問題...Librosa
python_speech_features默認情況下，將能量作為第一個（索引零）系數（默認情況下），這意味著當您要求例如13 MFCC時，您實際上得到12 + 1。appendEnergyTrue

換句話說，您沒有比較13對13的系數，而是13對12的系數。能量可以具有不同的量級，因此由于不同的色標，會產生完全不同的圖像。librosapython_speech_features

現在，我將演示這兩個模塊如何產生類似的結果：

import librosa

import python_speech_features

import matplotlib.pyplot as plt

from scipy.signal.windows import hann

import seaborn as sns

n_mfcc = 13

n_mels = 40

n_fft = 512

hop_length = 160

fmin = 0

fmax = None

sr = 16000

y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)

mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,

n_mfcc=n_mfcc, n_mels=n_mels,

hop_length=hop_length,

fmin=fmin, fmax=fmax, htk=False)

mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,

numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,

preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)

如您所見，比例不同，但整體情況看起來非常相似。請注意，我必須確保傳遞給模塊的許多參數是相同的。

反對回復 2022-08-16

12345678_0001

TA貢獻1802條經驗獲得超5個贊

這就是那種讓我徹夜難眠的東西。這個答案是正確的（而且非常有用?。┑⒉煌暾?，因為它沒有解釋兩種方法之間的巨大差異。我的答案增加了一個重要的額外細節，但仍然無法實現完全匹配。

正在發生的事情很復雜，最好用下面的冗長代碼塊來解釋，該代碼塊與另一個包進行比較。librosapython_speech_featurestorchaudio

首先，請注意torchaudio的實現有一個參數，其默認值（False）模仿librosa實現，但如果設置為True將模仿python_speech_features。在這兩種情況下，結果仍然不準確，但相似之處是顯而易見的。log_mels
其次，如果你深入研究torchaudio實現的代碼，你會看到一個注釋，即默認值不是“教科書實現”（torchaudio的話，但我信任他們），而是為Librosa兼容性而提供的;火炬音頻中從一個切換到另一個的關鍵操作是：

mel_specgram = self.MelSpectrogram(waveform)

if self.log_mels:

log_offset = 1e-6

mel_specgram = torch.log(mel_specgram + log_offset)

else:

mel_specgram = self.amplitude_to_DB(mel_specgram)

第三，你會非常合理地想知道你是否可以強迫librosa正確行動。答案是肯定的（或者至少是“它看起來像它”），直接獲取mel頻譜圖，取它的基本對數，并使用它，而不是原始樣本，作為librosa mfcc函數的輸入。有關詳細信息，請參閱下面的代碼。
最后，要小心，如果您使用此代碼，請檢查查看不同功能時發生的情況。第 0 個特征仍然具有嚴重的無法解釋的偏移，并且較高的特征往往會彼此遠離。這可能很簡單，比如引擎蓋下的不同實現或略有不同的數字穩定性常數，或者它可能是可以通過微調來修復的東西，比如選擇填充，或者可能是某個地方的分貝轉換中的引用。我真的不知道。

下面是一些示例代碼：

import librosa

import python_speech_features

import matplotlib.pyplot as plt

from scipy.signal.windows import hann

import torchaudio.transforms

import torch

n_mfcc = 13

n_mels = 40

n_fft = 512

hop_length = 160

fmin = 0

fmax = None

sr = 16000

melkwargs={"n_fft" : n_fft, "n_mels" : n_mels, "hop_length":hop_length, "f_min" : fmin, "f_max" : fmax}

y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)

# Default librosa with db mel scale

mfcc_lib_db = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,

n_mfcc=n_mfcc, n_mels=n_mels,

hop_length=hop_length,

fmin=fmin, fmax=fmax, htk=False)

# Nearly identical to above

# mfcc_lib_db = librosa.feature.mfcc(S=librosa.power_to_db(S), n_mfcc=n_mfcc, htk=False)

# Modified librosa with log mel scale (helper)

S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, fmin=fmin,

fmax=fmax, hop_length=hop_length)

# Modified librosa with log mel scale

mfcc_lib_log = librosa.feature.mfcc(S=np.log(S+1e-6), n_mfcc=n_mfcc, htk=False)

# Python_speech_features

mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,

numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,

preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)

# Torchaudio 'textbook' log mel scale

mfcc_torch_log = torchaudio.transforms.MFCC(sample_rate=sr, n_mfcc=n_mfcc,

dct_type=2, norm='ortho', log_mels=True,

melkwargs=melkwargs)(torch.from_numpy(y))

# Torchaudio 'librosa compatible' default dB mel scale

mfcc_torch_db = torchaudio.transforms.MFCC(sample_rate=sr, n_mfcc=n_mfcc,

dct_type=2, norm='ortho', log_mels=False,

melkwargs=melkwargs)(torch.from_numpy(y))

feature = 1 # <-------- Play with this!!

plt.subplot(2, 1, 1)

plt.plot(mfcc_lib_log.T[:,feature], 'k')

plt.plot(mfcc_lib_db.T[:,feature], 'b')

plt.plot(mfcc_speech[:,feature], 'r')

plt.plot(mfcc_torch_log.T[:,feature], 'c')

plt.plot(mfcc_torch_db.T[:,feature], 'g')

plt.grid()

plt.subplot(2, 2, 3)

plt.plot(mfcc_lib_log.T[:,feature], 'k')

plt.plot(mfcc_torch_log.T[:,feature], 'c')

plt.plot(mfcc_speech[:,feature], 'r')

plt.grid()

plt.subplot(2, 2, 4)

plt.plot(mfcc_lib_db.T[:,feature], 'b')

plt.plot(mfcc_torch_db.T[:,feature], 'g')

plt.grid()

老實說，這些實現都沒有令人滿意：

Python_speech_features采取了一種莫名其妙的奇怪方法，用能量替換第0個特征，而不是用它來增強，并且沒有常用的delta實現。
默認情況下，Librosa是非標準的，沒有警告，并且缺乏一種明顯的方法來增加能量，但在圖書館的其他地方具有高度勝任的delta函數。
Torchaudio將模擬兩者，也具有多功能的delta功能，但仍然沒有干凈，明顯的能量獲取方式。

反對回復 2022-08-16

2 回答
0 關注
325 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

與librosa vs python_speech_features vs tensorflow

與librosa vs python_speech_features vs tensorflow

2 回答

添加回答