Audio acquisition with Espressif ESP32 WROOM and WROVER modules


Since 2017, Espressif is working on integrated hardware and software for audio acquisition and processing which is available through different boards and kits and a software framework. Espressif is targeting smart speaker applications with this, like Alexa and her friends.

The ESP32 is 2.4 GHz Wi-Fi and Bluetooth combo, 32 bit dual core chip running up to 240 MHz, designed for mobile, wearable electronics, and Internet-of-Things (IoT) applications.

It has several peripherals on board including I2S interfaces to easy integrate with dedicated audio chips. These hardware features together with the Espressif Audio Development Framework (ESP-ADF) software provide a powerful platform to implement audio applications including native wireless networking and powerful user interface.

The ESP-ADF provides a range of API components including Audio Streams, Codecs and Services organized in Audio Pipeline, all integrated with audio hardware through Media HAL and with Peripherals onboard of ESP32.


ESP32-WROOM module series

The modules from the ESP32 WROOM Series are powerful, generic Wi-Fi+BT+BLE MCU modules that target a wide variety of applications, ranging from low-power sensor networks to the most demanding tasks, such as voice encoding, music streaming and MP3 decoding. At the core of this module is the ESP32-D0WDQ6 chip.

Modules in the ESP32 WROOM Series contain the ESP32 SoC, flash memory, precision discrete components and a PCB/IPEX antenna which achieves outstanding RF performance in space-constrained applications.

All modules in the ESP32 WROOM Series are suitable for commercial application development with a robust 4-layer FCC, CE ( RED ), IC, TELEC, SRRC & KCC-compliant design and a wide operating temperature range of -40°C to 85°C.

ESP32-WROOM Datasheet
ESP32 Modules and Boards » ESP32-WROOM-32D / ESP32-WROOM-32U

ESP32-LyraT board


The ESP32-LyraT development board is a hardware platform designed for the dual-core ESP32 audio applications, e.g., Wi-Fi or BT audio speakers, speech-based remote controllers, smart-home appliances with audio functionality(ies), etc.

ESP32-WROVER module

The ESP32-WROVER module contains ESP32 chip to provide Wi-Fi / BT connectivity and data processing power as well as integrates 32 Mbit SPI flash and 32 Mbit PSRAM for flexible data storage. This is a step upgrade of ESP32-WROOM-32x modules with an additional 8MB SPI PSRAM (Pseudo static RAM).

ESP32-WROVER Datasheet
ESP32 Modules and Boards » ESP32-WROVER

Audio Codec Chip

The Audio Codec Chip, an ES8388, is a low power stereo audio codec with a headphone amplifier. It consists of 2-channel ADC, 2-channel DAC, microphone amplifier, headphone amplifier, digital sound effects, analog mixing and gain functions. It is interfaced with ESP32-WROVER Module over I2S and I2S buses to provide audio processing in hardware independently from the audio application.


ESP32-LyraT V4.3 Board Layout

ESP32-LyraT V4.3 Electrical Block Diagram

ESP32-LyraT Functional Block Diagram

In the press

ESP32-A1S audio development kit

Based on the A.I. Thinker ESP32-A1S module, which is Ai-Thinker’s equivalent to Espressif’s ESP32-WROVER series module. Clocked at up to 240MHz with 520 KB SRAM, and 8MB on-module PSRAM.

Audio hardware

  • 3.5mm headphone jack for stereo audio output
  • 3.5mm LINE-IN jack
  • 2x 2-pin header for left and right speakers up to 4Ω/3W output
  • 2x built-in microphones


In the press


The hardware gets supported by the Espressif ESP-ADF Audio Development Framework for ESP32.


The Espressif Audio Development Framework (ESP-ADF).


The API provides a way to develop audio applications using Elements like Codecs, Streams or Filters.
The application is developed by combining the Elements into a Pipeline.

The audio data is typically acquired using an input Stream, processed with Codecs and Filters, and finally output with another Stream. There is an Event Interface to facilitate communication of the application events. Interfacing with specific hardware is done using Peripherals.

When reading about the pipeline-based architecture of the ESP-ADF software framework, this really feels like a trimmed down verbatim copy of GStreamer:

The [GStreamer] framework is based on plugins that will provide the various codec and other functionality. The plugins can be linked and arranged in a pipeline. This pipeline defines the flow of the data.

What is GStreamer?

Pretty cool.


Elements of the Audio Development Framework


Sample Organization of Elements in Audio Pipeline


The example provided by Recording WAV file and upload to HTTP Server already connects I2S to HTTP, which is pretty nifty.

To give you an idea what is currently happening there: Just recently, important example programs including improvements to the underlying machinery have been added or updated to

This feels promising enough that we should at least try to put our feet on that regarding audio acquisition and -processing on ESP machines. Will be happy to hear about anything from whomever finds time to get into this during cold(?) winter times.

1 Like


@clemens and other straight Arduino users who are reading this might ask:

Is it possible to use ESP-ADF from Arduino IDE with other Arduino libraries?


Just some non-burning tires ahead: Read about how to configure »Arduino core for ESP32« as a component of ESP-IDF. So, things should be turned inside out from an Arduino perspective to make the other things work. As always ;].

For getting started with everything in general, you might enjoy reading these editors’ picks:

ESP-ADF v1.0 released


Espressif Systems released version 1.0 of its Espressif Audio Development Framework (ESP-ADF) just a few days ago

ESP-ADF is an open-source platform that can be used for developing a variety of audio applications, ranging from connected speakers to story-telling toys.

ESP-ADF v1.0 Released | Espressif Systems (Shanghai, China on Jan 4, 2019)


ESP-ADF includes a rich set of features, such as codecs, source and sink streams, pipelining support, different services and controls, and even a wake-word engine.

Espressif’s Audio Development Framework:

  • Supports popular audio formats: MP3, AAC, WAV, OGG, AMR, TS, OPUS, SPEEX, etc.
  • Supports the creation of sound effects with tools such as: EQ, Mixer and Resample.
  • Plays music from sources such as: HTTP, HLS (HTTP Live), SD card, Bluetooth A2DP/HFP.
  • Integrates Media services such as: DLNA, Airplay, WeChat, Internet radio.
  • Supports voice recognition and integration with online services: Alexa, DuerOS, Turing, IFLYTEK, TmallGenie, RooBo, etc.

Potential applications deploying ESP-ADF include smart speakers, voice-activated walkie-talkies, broadcasters and other audio-enabled solutions, such as connected story-telling toys and point-reading pens.

Now in 2019, it looks like they’ve released something reasonable.

Espressif’s New Voice Assistant, ESP-Skainet, Released


WakeNet, which is a wake word engine built upon neural network, is specially designed for low-power embedded MCUs. Now, the WakeNet model supports up to 5 wake words.


MultiNet is a lightweight model specially designed based on CRNN and CTC for the implementation of multi-command recognization with ESP32. Now, up to 100 speech commands, including customized commands, are supported.

ESP-ADF v2.0 released

1 hour ago, ESP-ADF 2.0 has been released. It includes many updates and new features, the documentation is available at Espressif Audio Development Guide.


Hi again,

just for the records: ESP-ADF v2.1 has been released on 23 Jul 2020 and ESP-ADF v2.2 on 5 Nov 2020. The release pages on GitHub are well suited to have a glimpse at the respective updates and improvements.

With kind regards,

1 Like

ESP-ADF v2.3 has just been released.