Skip to content

BLEUart data loss #173

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Nenik opened this issue Aug 30, 2018 · 10 comments
Closed

BLEUart data loss #173

Nenik opened this issue Aug 30, 2018 · 10 comments

Comments

@Nenik
Copy link
Contributor

Nenik commented Aug 30, 2018

When an application reacts to an incoming BLEUart message by trying to send multiple messages, though only the first message makes it through, the rest is silently dropped. If I add delays between the messages, they get delivered just fine, but there is no reasonable deterministic way to resolve optimal delays. Worse yet, the dropped messages are reposted as sent by the framework code.
Sample code where I observe this issue (telemetry-like, the rest of my protocol is simple one-packet request -> one packet response) looks like:

    int ret = sendReply("!S\n");
    DEBUG("Stats: '!S' : %d\n", ret); 

    ret = sendReply("+Bd:" __DATE__ "\n");
    DEBUG("Stats: " "'+Bd:" __DATE__ "' : %d\n", ret); 

    ret = sendReply("+Bt:" __TIME__ "\n");
    DEBUG("Stats: " "'+Bt:" __TIME__ "' : %d\n", ret);

    snprintf(reply, sizeof(reply), "+C:%dmAs\n", _batt_mas);
    ret = sendReply(reply);
    DEBUG("Stats: %d of %s", ret, reply);

where sendReply is just:

int sendReply(const char *msg) {
   return bleuart.write(msg, strlen(msg));
}

and the sample debug output (if enabled) looks like:

Stats: '!S' : 3
Stats: '+Bd:Aug 29 2018' : 16
Stats: '+Bt:23:13:36' : 13
Stats: 10 of +C:290mAs

so the BLEuart reports the data as sent, but only "!S" is received on the remote (central) side.
Adding delay(500) between the writes gets all the messages through.

Now, I know the SoftDevice is capable of queueing a bunch of notifications towards the next connection events - when coding against pure SD and nRF SDK, I have used sd_ble_gatts_hvx() in a loop till I get BLE_ERROR_NO_TX_BUFFERS, then I'd reschedule and retry with the rest of my data later. That worked really well with NUS-like service.
That's neither possible with BLECharacteristic.notify (which erases the full error code to boolean success flag) nor handled behind the scenes by the framework (by code inspection, it looks to me the code would drop any extra data w/o possible recovery).

@hathach
Copy link
Member

hathach commented Aug 30, 2018

thanks for posting, can you also post your complete test sketch so that we could check it later.

@hathach
Copy link
Member

hathach commented Aug 30, 2018

also try to enable debug level = 1 to see if it print out any error messages.

@Nenik
Copy link
Contributor Author

Nenik commented Aug 31, 2018

Getting my project over in a portable way would be more work than setting up a trivial sample. You can think of something as simple as:

void uartRxCallback(void) {
    if (bleuart.read() == '!') {
        bleuart.write("Hello1", strlen("Hello1"));
        bleuart.write("Hello2", strlen("Hello2"));
        bleuart.write("Hello3", strlen("Hello3"));
        bleuart.write("Hello4", strlen("Hello4"));
    }
}

Anyway, I have debugged into the framework, and the issue is easy to describe. BLECharacteristic::notify() is declared returning boolean where true means success, while the code inside tries to return some (non-zero, or course) error codes on problems. Thus a problem is interpreted as success.
Now there are two things that could go wrong in the implementation. The first is Bluefruit's own arbitrary notification packet limit, Bluefruit.Gap.getHvnPacket() will fail silently on the second write with the default configuration. That's easy to workaround by raising that limit through configPrphConn(...),
so I tried setting that to 6.
The second is deeper, sd_ble_gatts_hvx still failing on the 4th packet with NRF_ERROR_RESOURCES. This one is at least turned into a false return value, so the upper layer (BLEUart) properly reports 0 bytes delivered and allows for proper retry. That's the way it should behave, except for giving up earlier than I expected - this was my mistake - I need to call configPrphConn() before begin().
But either way, I now have a semi-reliable workaround.

Now to the utility of scheduling more than 1 notification packet: power consumption. It allows me to go with longer connection interval. If I queue multiple notifications at once, the protocol allows sending them inside a single connection event - Central sends an empty prompt, peripheral replies with the first notification, but signals it has more. This allows the central to keep going till the end of the timeslice, which helps the two devices to exchange 4-6 such notifications in, like 1.25us. Imagine splitting that traffic over multiple connection events - you'd either get bad latency, or have to go with shorter connection interval, wasting battery. How many packets can make it depends on the stacks on both devices, but from my experience nRF to nRF would happily do 6, while cc2541 was only doing 5 and even always discarding the 5th. Still better than 1....

@hathach
Copy link
Member

hathach commented Aug 31, 2018

Thanks for putting effort into debugging. Since we are focusing on nrf52840 and usb now. We will check this out later when we focus on ble features starting with bsp 0.9.x. There is a lot to do with SD v6 from multiple PHYs to extended advertising etc....

PS: if you figure out the fix, please consider to submit a PR to help others

@hathach
Copy link
Member

hathach commented Aug 31, 2018

@Nenik Do you use Arduino IDE if not could you describe a bit about your set-up there. Arduino is not a great IDE but accessible by most people. We have an plan to include segger project later in the repo for adv user to do debugging with jlink. But I would take a bit of time to set it up.

@Nenik
Copy link
Contributor Author

Nenik commented Sep 1, 2018

I'll see if I can post a simple fix later. I'd need to update first, still using 0.8.3...

I currently use Arduino IDE, as it was the fastest way to get me off the ground. Being old school, I don't need much bells and whistles for this relatively small project (~10 source files, one per subsystem) and I valued having a working build and deployment system on Linux out of the box.

But my setup is "interesting" nonetheless - I primarily prototype/test against nRF52 Feather in a breadboard with peripherals next to it, while I also sometimes need to deploy to the actual HW, a custom board based on Taiyo Yuden "EYSHSNZWZ" module. Due to limited I/O and board routing, those two target have wildly different pin assignment, so I have created a custom board definition (in packages/adafruit/hardware/nrf52/0.8.3/boards.txt) and my code has an adaptation layer defining the actual assignments. I could still build for the target board from the Arduino IDE by switching the target board and rebuilding, but I have to deploy manually (no serial port brought out on the current custom board, only SWD pads for pogo-pin probe).
I am also comfortable debugging directly from gdb through JLinkGdbServer whenever needs come.
Once I'll move more development to the target board, I'll invest more time to command line builds and JLink deployment - I have seen few examples here and on the forums...

As for the custom board - it would be great if there was a common, officially sanctioned way to do that besides directly modifying the package, though this is hardly a priority for you :-)

@hathach
Copy link
Member

hathach commented Sep 1, 2018

@Nenik thanks for explaining your setup. I see it is a bit complicated there. Normally, I would like to have a minimal sketch since it is easier for us to reproduce and fix the issue. Don't worry about the fix, we will analyze this later.

For custom board, we don't have any plans to support that, you definitely need to modify variant to make it fit your design. But we will try to make it easier for other to do that later on.

Nenik added a commit to Nenik/Adafruit_nRF52_Arduino that referenced this issue Sep 1, 2018
In case the application is sending more notifications back-to-back
and the framework has no free buffer slots error code was returned
instead of false, which was interpreted as success.
Return false in such a case, in line with SD buffer overflow handling
below.
@Nenik
Copy link
Contributor Author

Nenik commented Sep 1, 2018

Posted a pull request with a trivial, but tested fix.
Attaching a test case

NotifyTestCase.zip

@hajdbo
Copy link
Contributor

hajdbo commented Jan 8, 2019

Link to the PR for the fix: #178

hathach added a commit that referenced this issue Jan 9, 2019
Fix upstream issue #173 - lost UART packets.
@hathach
Copy link
Member

hathach commented Jan 9, 2019

thanks @hajdbo for the link 👍 . closed by PR #178

@hathach hathach closed this as completed Jan 9, 2019
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants