MicroSpeech部署到MCU - 来自Madliar

## MicroSpeech

TFLite MicroSpeech 是 TensorFlow Lite（TFLite）的一个示例项目，专为在资源受限的微控制器（MCU）或嵌入式设备上实现关键词识别（Keyword Spotting, KWS）而设计。它能够实时检测简单的语音命令（如“yes”“no”“on”“off”等），适用于智能家居、可穿戴设备等低功耗场景。

这里记录一下从代码生成到边缘端部署的过程，并简单进行性能评估。

### 代码生成

首先需要克隆官方的[tflite-micro代码仓库](https://github.com/tensorflow/tflite-micro)，代码生成脚本的路径在```tensorflow/lite/micro/tools/project_generation/create_tflm_tree.py```。执行脚本即可查看帮助信息。

TFLite支持使用arm推出的CMSIS-NN库进行推理加速，相比通用指令计算，性能通常可提升5~10倍。但CMSIS-NN需要移植，这里只是验证可行性，为了减少部署难度直接生成通用代码：
```
python tensorflow/lite/micro/tools/project_generation/create_tflm_tree.py  --example="micro_speech" ../gen_micro_speech
```
这样就会在 tflite-micro 同级目录下生成项目代码，以```gen_micro_speech```命名。这里的通用代码可以在x86平台编译运行，在gen_micro_speech目录下添加CMakeList.txt，内容如下：
```
cmake_minimum_required(VERSION 3.28)
project(gen_micro_speech)
set(CMAKE_CXX_STANDARD 17)
add_compile_options(-fno-rtti -fno-exceptions)
file(GLOB_RECURSE SRC ${CMAKE_CURRENT_SOURCE_DIR}/*.cc)
add_executable(gen_micro_speech${SRC})

target_include_directories(gen_micro_speech PUBLIC
        ${CMAKE_CURRENT_SOURCE_DIR}/
        ${CMAKE_CURRENT_SOURCE_DIR}/examples
        ${CMAKE_CURRENT_SOURCE_DIR}/examples/micro_speech
        ${CMAKE_CURRENT_SOURCE_DIR}/third_party/flatbuffers/include
        ${CMAKE_CURRENT_SOURCE_DIR}/third_party/gemmlowp
        ${CMAKE_CURRENT_SOURCE_DIR}/third_party/kissfft
        ${CMAKE_CURRENT_SOURCE_DIR}/third_party/ruy
)
target_compile_definitions(gen_micro_speech PUBLIC
        TF_LITE_STATIC_MEMORY
)
```

其中除了包含tflite kernal、第三方库的路径外，还添加了一些额外的参数，其中包括：
* ```add_compile_options(-fno-rtti -fno-exceptions)``` 禁用运行时类型信息和C++异常
* 定义了```TF_LITE_STATIC_MEMORY```

添加这些参数一方面原因是为了减少代码体积、提高性能，另一方面是适配嵌入式系统的限制。尤其是第二条宏定义，如不定义，则代码会寻找```tensorflow/lite/array.h```，该文件应该是一个动态分配空间用于model的arena的库，但tensorflow并为提供该代码，因此需要额外移植。为了简便，直接定义静态内存即可。

编译运行：
```
mkdir build && cd build && cmake .. && make -j20
```

截止25年4月会报错：
```
/usr/bin/ld: CMakeFiles/gen_micro_speech.dir/tensorflow/lite/micro/hexdump_test.cc.o:(.bss+0x0): multiple definition of `micro_test::tests_passed'; CMakeFiles/gen_micro_speech.dir/examples/micro_speech/micro_speech_test.cc.o:(.bss+0x0): first defined here
...
```
这是由于在```tensorflow/lite/micro/hexdump_test.cc```中，也写了一些测试操作，使用了```TF_LITE_MICRO_TESTS_BEGIN```等宏定义，而该宏定义展开则是定义了main函数来执行测试的逻辑，因此产生冲突。删除```tensorflow/lite/micro/hexdump_test.cc```该文件，再次执行即可通过，在build目录生成可执行文件```gen_micro_speech ```。运行该文件，输出：
```
Testing NoFeatureTest
AudioPreprocessor model arena size = 9944
Testing YesFeatureTest
AudioPreprocessor model arena size = 9944
Testing NoTest
AudioPreprocessor model arena size = 9944
MicroSpeech model arena size = 7304
MicroSpeech category predictions for <no>
  0.0000 silence
  0.0547 unknown
  0.0000 yes
  0.9453 no
Testing YesTest
...
...
6/6 tests passed
~~~ALL TESTS PASSED~~~

```

### 部署到嵌入式设备

#### 1. 工程创建

这里选择stm32h723VGT6 (1MB flash)，打开CubeIDE创建工程H723_MicroSpeech。需要注意的：
* 由于TFLite是C++项目，项目类型也需要选择为C++
* 选择 "copy onty the nessary lib files"，此项会在Drivers目录下生成CMSIS核心文件（非CMSIS-NN，二者不是同一概念）
* 配置最高时钟和串口，生成代码
* release模式可以便捷禁用调试、开启高优化等级以极限压榨性能，因此要**选择release模式**

在项目根目录下新建```External```文件夹，用于存放TFLite文件和用户代码。这里添加一个```inner_main.h```和```inner_main.c```文件，位于```External/custom```下，并定义了```myMain()```执行自己的逻辑。调用链为：
```main.c -> inner_main.c -> tensorflow codes```

此时需要将External文件夹添加进项目配置，右键项目-> properties -> C/C++ general -> Paths and Symbols，在 “includes”选项卡里，为GNU C和GNU C++都添加上 ```External``` path；并在“source location”选项卡中，点选```External```文件夹加入，否则在编译时找不到源文件，即找不到函数实现。

#### 2. 移植TFLite文件

这里添加两个```makefile_options```的参数，用来生成cortex-m架构的代码，执行
```
python3 tensorflow/lite/micro/tools/project_generation/create_tflm_tree.py  --makefile_options="TARGET=cortex_m_generic TARGET_ARCH=cortex-m7" --example="micro_speech" ../gen_micro_speech_cortex
```

将生成的```gen_micro_speech_cortex```文件夹放在```External```文件夹下面，**删除```tensorflow/lite/micro/hexdump_test.cc```文件**并同样在 Paths and Symbols 下的 “includes”选项卡里，为GNU C和GNU C++添加以下目录：
* External/gen_micro_speech_cortex
* External/gen_micro_speech_cortex/third_party
* External/gen_micro_speech_cortex/third_party/kissfft
* External/gen_micro_speech_cortex/third_party/ruy
* External/gen_micro_speech_cortex/third_party/gemmlowp
* External/gen_micro_speech_cortex/third_party/flatbuffers/include
* External/gen_micro_speech_cortex/examples/micro_speech

#### 3. 解决DCB报错

添加完成后编译会报错：
```
../External/gen_micro_speech_cortex/tensorflow/lite/micro/cortex_m_generic/micro_time.cc:55:5: error: 'DCB' was not declared in this scope
   55 |     DCB->DEMCR |= DCB_DEMCR_TRCENA_Msk;
      |     ^~~
```
编译器找不到 ARM Cortex-M 的 DCB (Debug Control Block) 寄存器定义。右键项目-> properties -> C/C++ general -> Paths and Symbols，在“Symbols”选项卡中，添加定义```CMSIS_DEVICE_ARM_CORTEX_M_XX_HEADER_FILE```，值为```"stm32h7xx.h"```，此宏定义会渲染成源文件中```#include ...```语句，所以不可去掉双引号。同时，TFLite使用的```DCB->DEMCR```是旧版的DCB（Debug Control Block）定义写法，在ARM Cortex-M系列中，调试寄存器通常通过DCB或CoreDebug结构体访问，直接编译依旧不通过。

解决办法是将```tensorflow/lite/micro/cortex_m_generic/micro_time.cc```文件中```DCB->DEMCR |= DCB_DEMCR_TRCENA_Msk;```替换为```CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;```。再次编译，此报错消失。

#### 4. std::byte未识别报错

编译时可能会报另一个错误：
```
../External/gen_micro_speech_cortex/tensorflow/lite/micro/hexdump.h:26:25: error: ISO C++ forbids declaration of 'type name' with no type [-fpermissive]
   26 | void hexdump(Span<const std::byte> region);
      |                         ^~~
```
这是由于工程未启用C++17标准，解决办法：右键项目-> properties -> C/C++ build -> Settings -> MCU/MPU G++ Compiler -> Miscellaneous，点击Add 按钮添加 flag "-std=c++17"。

#### 5. 解决“array.h 未寻找到报错”
```
../External/gen_micro_speech_cortex/tensorflow/lite/kernels/kernel_util.cc:28:10: fatal error: tensorflow/lite/array.h: No such file or directory
   28 | #include "tensorflow/lite/array.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~
```
在右键项目-> properties -> C/C++ general -> Paths and Symbols，在“Symbols”选项卡中，添加定义```TF_LITE_STATIC_MEMORY```

#### 6. 解决main重复定义的错误

解决上述所有问题之后，编译会再次报错：
```
micro_speech_test.cc:(.text.startup.main+0x0): multiple definition of `main'; ./Core/Src/main.o:main.c:(.text.startup.main+0x0): first defined here
collect2.exe: error: ld returned 1 exit status
make: *** [makefile:110: H723_MicroSpeech.elf] Error 1
```
这是因为实例代码会使用 ```TF_LITE_MICRO_TESTS_BEGIN```宏，该作为测试框架的定义，展开后定义了一个main函数，与项目中已经定义的main函数产生冲突。

重写此逻辑，新建 ```micro_speech_test.h```，做一个测试接口，部分文件内容：
```
#ifndef __MICRO_SPEECH_TEST_H
#define __MICRO_SPEECH_TEST_H

#ifdef __cplusplus
extern "C" {
#endif

void Invoke();

#ifdef __cplusplus
}
#endif

#endif
```
在 ```micro_speech_test.cc```中，实现Invoke:
```
...
#include "..."
#include "custom/debug.h"

namespace {

constexpr size_t kArenaSize = 1024*50; // 28584;  // xtensa p6
alignas(16) uint8_t g_arena[kArenaSize];

...
using MicroSpeechOpResolver = tflite::MicroMutableOpResolver<4>;
using AudioPreprocessorOpResolver = tflite::MicroMutableOpResolver<18>;

TfLiteStatus RegisterOps(MicroSpeechOpResolver& op_resolver) { ...
}

TfLiteStatus RegisterOps(AudioPreprocessorOpResolver& op_resolver) { ...
}

} // endof namespace

extern "C" {

TfLiteStatus GenerateSingleFeature(const int16_t* audio_data,
                                   const int audio_data_size,
                                   int8_t* feature_output,
                                   tflite::MicroInterpreter* interpreter) {
...
}

TfLiteStatus GenerateFeatures(const int16_t* audio_data,
                              const size_t audio_data_size,
                              Features* features_output) {
  ...
}

TfLiteStatus LoadMicroSpeechModelAndPerformInference(const Features& features, const char* expected_label) {
  // Map the model into a usable data structure. This doesn't involve any
  // copying or parsing, it's a very lightweight operation.
  const tflite::Model* model = tflite::GetModel(g_micro_speech_quantized_model_data);
  if(model->version() != TFLITE_SCHEMA_VERSION) {
	  print("schema version error");
	  return kTfLiteError;
  }
  ... ...
  std::copy_n(&features[0][0], kFeatureElementCount, tflite::GetTensorData<int8_t>(input));

uint32_t tick = HAL_GetTick();
  if (interpreter.Invoke() != kTfLiteOk){
	  return kTfLiteError;
  }
  tick = HAL_GetTick() - tick;
  print("micro_speech audio performance cost: %d ms.", tick);

... ...
  return kTfLiteOk;
}

void Invoke() {
	... ...
	TfLiteStatus ts = GenerateFeatures(audio_data, audio_data_size, &g_features);
	if (ts != kTfLiteOk) {
		print("gen fature error when test audio");
	}
	LoadMicroSpeechModelAndPerformInference(g_features, label);
	... ...
}
```

下载到设备，会输出：
```
... ...
AudioPreprocessor model arena size = 8448
MicroSpeech model arena size = 6788
start invoke!!!
micro_speech audio performance cost: 303 ms.
```
这里不包含预处理音频提取MFCC特征的耗时。实例中一个时间片为30ms，步进为20ms，相邻重叠区为10ms。要达到实时处理的目标，单次推理耗时必须控制在20毫秒以内，考虑到CPU还需处理其他任务，推理耗时在10ms以内更佳。

#### 7. 简单优化

针对这个耗时，优化的手段有很多，比如：
* 启用-O3优化
* 将tensor_arena 缓存放置于紧耦合SRAM区，如DTCM 或 AXI-SRAM以降低延迟
* 启用L1 cache
* 使用CMSIS-NN和DSP库加速

这里仅验证一下开启L1 cache之后的速度，同时打开I-Cache和D-Cache：
```
micro_speech audio performance cost: 51ms.
```
竟然提升6倍多！后续将继续移植CMSIS-NN，启用硬件加速并尝试将推理耗时压缩到10ms级别。