Investigating core panics on the LoPy4

pascalschaefer · December 30, 2019, 11:38am

Continuing the discussion from Testing the custom "dragonfly" builds on Pycom devices:

thanks a lot for the Dragonfly firmware. I have tested LoPy4-1.20.1.r1-0.7.0-vanilla-dragonfly-onewire with our MicroPython code.

As described by you, the file upload in visual studio has worked very good with your package.
Unfortunately our random core panics still occurs.

They are always happening during a Lora socket read/write/ maybe it is also in the nvram_save function. But only if the device is busy adding events and in parallel via Timer.Alarm a threat is started to synchronize the device time using the following Lora Code.

def sendPayload(self, data, updateTime = True):
    try:
        with self.socketLock:
            self.log("> sending", len(data), "bytes:", binascii.hexlify(data))
            responseData = None
            # create a LoRa socket
            s = socket.socket(socket.AF_LORA, socket.SOCK_RAW)
            try:
                # data rate: 5=SF7 (fastest), 4=SF8, 3=SF9 (longest airtime)
                #s.setsockopt(socket.SOL_LORA, socket.SO_DR, 3)

                # send the data
                s.setblocking(True)
                #bla = 4 / 0
                s.send(data)
                s.setblocking(False)
                # read downlink data
                responseData = s.recv(64)
            except Exception as e:
                self.log("ERROR", "LORA Socket Exception", e)
            s.close()
            if responseData != None:
                responseLen = len(responseData)
                if responseLen > 0:
                    self.log("< received", responseLen, "bytes:", binascii.hexlify(responseData))
                else:
                    self.log("< no downlink")
            # log
            if updateTime == True:
                self.lastUplinkTime = time.time()
            self.log(self.stats())
            time.sleep_ms(10)
            # save frame counters
            self.lora.nvram_save()
            time.sleep_ms(5)
            return responseData
    except Exception as e:
        self.log("ERROR", "Unable to send payload", e.args[0], e)
    return False

Normally during a normal device setup, it takes some days till the core panic occur.
Time gets synchronized each hour, and from time to time a event gets added to the device via badging.

when making automated tests with starting the Time.Alarm each minute and in parallel adding events each minute (writing to a file), it crashes within minutes.

I really don’t know how to go forward with this, because the core panic freezes the devices. If it would restart as desired we could live with it.

Thanks for all your investments

weef · December 30, 2019, 11:29pm

Beim genannten Problem kann ich nicht helfen, daher als workaround:

Ist denn der watchdog aktiviert?

Oder ist der ‘kaputt’, siehe z.B. hier:

Andreas · January 1, 2020, 10:11pm

Dear Pascal,

thanks for writing in and welcome to our little community.

Thank you for the detailed description of the conditions within you are observing the core panic.

Can I ask you for a core dump in order to further investigate this issue?

With kind regards,
Andreas.

pascalschaefer · January 2, 2020, 4:30pm

Hi @Andreas @weef

Thanks a lot for your answers.

The watchdog is activated yet seems to be broken. After 1 minute or longer, the device is not restarting. It is very strange, because we had similar situations earlier this year were 2 threads were running and once one ended in a core panic the device restarted successfully.

But now with 1.20.1.r1 it is no longer restarting and the device seems to be completely freezed.
It seems related to the Lora Modul / ESP32 and Timer.Alarm(), because our Firmware supports the exact same functionality via WLAN.

When the device is in WLAN Mode there are no Core Panics, because the Time Synchronization doesn’t need to be run separately via Timer.Alarm(), because Request/Response are straight forward.

I have attached the log of the Core Panic. It is always the same output. From the log it looks like that even the printing breaks suddenly, and then the esp32 is in a state, were the watchdog / restart doesn’t work.

I had once the idea, to disable the print in the esp32, that without the printing of the exception the device would maybe restart.

However, I was not able to recompile the esp32 part with the pycom as described in following post:
I think you do similar things for the dragonfly package ?

With kind regards,
Pascal.

Here is the log output

TRACE [LORA] Preparing to send CMD = 2 , SEQ_NO = 23
TRACE [LORA] CMD 0x01 [NFC_DETECTED] SEQ# 23 . uid = 02000000 , ts = 1577980530
TRACE [LORA] > sending 11 bytes: b'01170072120e5e02000000'
TRACE [LORA] < no downlink
TRACE [LORA] (rx_timestamp=148130812, rssi=-96, snr=7.0, sfrx=5, sftx=5, tx_trials=0, tx_power=0, tx_time_on_air=62, tx_counter=16186, tx_frequency=867300000)
TRACE [EventSender] pending events: True
TRACE [EventSender] Publishing event # 24  with CMD 2
TRACE [EventSender] Handling event # 24  with CMD 2
TRACE [LORA] Preparing to send CMD = 2 , SEQ_NO = 24
TRACE [LORA] CMD 0x01 [NFC_DETECTED] SEQ# 24 . uid = 03000000 , ts = 1577980531
TRACE [LORA] > sending 11 bytes: b'01180073120e5e03000000'
TRACE [Clock] Starting time sync
TRACE [LORA] CMD 0x04 [TIME_REQUEST] ID# 1 . our_time = 1577980589 (2020, 1, 2, 15, 56, 29, 3, 2)
TRACE [LORA] < no downlink
TRACE [LORA] (rx_timestamp=148130812, rssi=-96, snr=7.0, sfrx=5, sftx=5, tx_trials=0, tx_power=0, tx_time_on_air=62, tx_counter=16187, tx_frequency=868300000)
TRACE [LORA] > sending 7 bytes: b'040100ad120e5e'
Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed)
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
......
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core�Previou

Andreas · January 2, 2020, 7:47pm

Dear Pascal,

sorry, there’s no full core dump. Have you been able to catch one?

With kind regards,
Andreas.

pascalschaefer · January 2, 2020, 9:34pm

Hi @Andreas

Thanks for your response.
Unfortunately there is no core dump printed.
All crashes just result in above messages.
Is there a possibility to extend the log to print the desired dump?

Many thanks and Kind Regards
Pascal

Andreas · January 2, 2020, 9:39pm

Hmm, sorry for that!

There is a compile-time flag to turn off printing core dumps. However, with the dragonfly builds, it should be turned on.

I don’t have a clue why no core dump is printed when your device observes a core panic.

pascalschaefer · January 3, 2020, 1:09pm

Hi @Andreas

Thanks a lot for your message and all your investment trying to help solving this problem.
I am open to help in any way to also improve the long term stability.

Do you see any chance I can support with these two approaches:

Debugging the device? Is there any profound way to do this on my own?
Recompile the pycom firmware and esp32 code with

CONFIG_ESP32_PANIC_SILENT_REBOOT

instead of

CONFIG_ESP32_PANIC_PRINT_REBOOT

to check whether the device restarts, when no message is printed? I know that this is not a solution, because the actual cause needs to be fixed. For us it would just be not so dramatic if the device restarts properly when a core panic happens.

Many thanks and best regards, Pascal

pascalschaefer · January 7, 2020, 8:24am

Hi @Andreas

Short question.
Do you know whether these long term stability issues and core panics is recognized by pycom and the esp32 community?

It seems to be a big topic for many people.
To me it is unclear whether and when we can expect any improvements.

Do you know something about that?

Thanks a lot in advance and Best Regards, Pascal

Thias · January 7, 2020, 11:18am

In my setup using the current Terkin version from the master branch in LoRaWAN mode and sending payloads every 5 minutes I’ve not seen any crashes since I put the LoPy4-1.20.1.r1-0.7.0-vanilla-dragonfly-onewire firmware onto the device a couple of weeks ago. Terkin nowadays also uses the nvram_save/nvram_restore capabilities and everything works as expected and super stable.

Please try giving the device some rest after the s.send call before invoking s.recv, e.g. 5 seconds. Also, setblocking is always set to False in our Terkin routines. I remember people running into trouble with a blocking socket.

pascalschaefer · January 7, 2020, 12:29pm

Dear @Thias

Thank you so much for your suggestion. I have tried it out. Unfortunately the device is still crashing.

I think now, it is more related to a second task run by Timer.Alarm(), which results in a Stack Size or Memory Size problem inside the ESP32. I make now an automated test without synchronizing the Device Time in intervals with Timer.Alarm(). If this works, we restart the device once a day, synchronize the time at boot and I think we are kind of fine.

Thanks for all the help you guys.

Best Regards, Pascal

Andreas · January 7, 2020, 9:34pm

Dear Pascal,

If you are able to share some code we could use to reproduce this error, we might be able to look into this. As we’ve invested a considerable amount of time into getting a stable firmware, I would be interested to investigate this further, if time permits.

Oh, you are not alone. Others might be observing the same thing.

With kind regards,
Andreas.

Andreas · January 9, 2020, 11:37pm

It looks like they succeeded after switching to Dragonfly.

So, we are looking at the next observations of core dumps on LoPy4 devices.

https://forum.pycom.io/topic/5423/core-dump-problem

Andreas · January 9, 2020, 11:44pm

Dear @pascalschaefer,

@sita on the Pycom user forum (see above: https://forum.pycom.io/topic/5560/guru-meditation-error/13) was able to get stable runtime behavior using the Dragonfly firmware on two LoPy4 devices. However, it took him/her two attempts to do so. So, I am humbly asking if you might want to try again.

Would you be able to share more of the corresponding MicroPython code in order to reproduce the crash or to just inspect where the error might be originating from?

Still no luck with core dumps?

With kind regards,
Andreas.

pascalschaefer · January 13, 2020, 9:17am

Dear @Andreas

Thanks for your message.
My company doesn’t allow me to publish the full source.
I am building currently a minimal version to reproduce the issue.

Best Regards, Pascal.

pascalschaefer · January 13, 2020, 2:16pm

Hi @Andreas

I have prepared the source to reproduce the core panic.
Because I can’t upload the zip file, I will send it to your email address.

When you start the device it happens mostly after 2-3 Minutes.
Sometimes not, maybe you need 2-3 tries.

In main.py you need to update the options json properties to insert your lora_app_eui / lora_app_key . Mode is otaa but can be changed to abp.

After 10 seconds, 10 Events are added and 25 Seconds later the time synchronization starts.
After some time the core panic happens.

In the package is also an example log of the core panic.

Please let me know if you have questions or whether it also works on your side.
Used setup:

(sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.3.0-vanilla-psramfix-unicore', version='69dd8b5d-dirty on 2019-11-05', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')

Thanks for all your support
Best Regards, Pascal

Thias · January 13, 2020, 2:22pm

You did not try the dragonfly variant which is supposed to fix the core panics yet?

pascalschaefer · January 13, 2020, 2:24pm

HI @Thias

oops, seems I have installed a wrong version. Will do the retest right away

pascalschaefer · January 13, 2020, 2:37pm

Hi @Thias

I have tested again with (sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.7.0-vanilla-dragonfly-onewire-i2s', version='daf40f36-dirty on 2019-12-04', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')
After the third round (waiting 2-3 minutes, if no corepanic, then restart device)
the core panic happens again.

Are you also interested in the source?

pascalschaefer · January 13, 2020, 2:43pm

Hi @Thias @Andreas

I have published the source in github: