Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Campaign Status "Finished" Before All Emails Were Sent #1931

Closed
subhash-ngowda opened this issue Jul 5, 2024 · 21 comments
Closed

Campaign Status "Finished" Before All Emails Were Sent #1931

subhash-ngowda opened this issue Jul 5, 2024 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@subhash-ngowda
Copy link

Version:

  • listmonk: [v3.0.0]
  • OS: [Ubuntu LTS]

Description of the bug:
We are encountering an issue where the campaign status is marked as "finished" before all emails have been sent.
Campaign Details

  • Campaign Type: Email
  • Subscriber List: Single opt-in list with 200,000 subscribers
  • Blocklisted Subscribers: Approximately 20,000
  • Batch Size: 200
  • Issue Observed: Campaign status changed to "finished" after sending only 37 emails

Analysis and Findings
Upon further investigation, we identified a potential bug in the query/code that might be causing this issue. Below are the details (Code Snippet from Pipe.go):

// NextSubscribers processes the next batch of subscribers in a given campaign.
// It returns a bool indicating whether any subscribers were processed
// in the current batch or not. A false indicates that all subscribers
// have been processed, or that a campaign has been paused or cancelled.

Campaign Scheduling Mechanism
The campaign scheduling mechanism uses last_subscriber_id and max_subscriber_id to process subscribers
in batches. The last_subscriber_id acts as a pointer until max_subscriber_id is reached.
The subIDs query fetches distinct subscriber IDs and their subscription status from subscriber_lists where the list
ID is among the lists from campLists. Additionally, it filters out subscribers with the status unsubscribed.
If all subscribers that meet the criteria in subIDs are blocklisted, the subs CTE will return zero records due to the condition that excludes blocklisted subscribers.
Observed Behaviour in Our Case
In our scenario, during the second iteration, all 200 records were blocklisted, resulting in zero records being returned.
Consequently, the NextSubscribers process returned false, causing the campaign status to be changed to "finished" without sending emails to the entire list.

Relevant Code in Pipe.go

// Line 85
if len(subs) == 0 {
return false, nil
}

Conclusion
The issue stems from the query filtering out blocklisted subscribers and returning zero records if all remaining subscribers in a batch are blocklisted. This causes the NextSubscribers process to return false prematurely, changing the campaign status to "finished".

Recommendations
To address this issue, we recommend modifying the query and/or logic to ensure that the campaign does not prematurely
finish when encountering batches of blocklisted subscribers. This will ensure that emails are sent to all eligible subscribers in the list.

@subhash-ngowda subhash-ngowda added the bug Something isn't working label Jul 5, 2024
@MaximilianKohler
Copy link
Contributor

This might be the main issue for #1762, and #1802.

@nayanthulkar28
Copy link

Hi @subhash-ngowda , I tried to replicate the above bug with batch_size=2 but didn't notice results as you mention in the solution above. In the CTE subIDs we are filtering out with status != unsubscribe which also indicated that subIDs won't include blocklist subscribers because blocklist unsubscribe lists while creating or updating. The subIDs proceed with all valid subscribers with batch size.

@subhash-ngowda
Copy link
Author

This might be the main issue for #1762, and #1802.
Yes

@subhash-ngowda
Copy link
Author

@nayanthulkar28 We were only able to reproduce this if the users are in multiple lists and they are blocklisted from any of the list. Take a look at this below attached image.
image

@nayanthulkar28
Copy link

Understood!

@nayanthulkar28
Copy link

I couldn't find the case where a subscriber is blocklisted and it has status!='unsubscribed' (Please let me know the when this case happens).

Considering this usecase, here we can fix this query accordingly:

Remove this limit in subIDs

subIDs AS (
    SELECT DISTINCT ON (subscriber_lists.subscriber_id) subscriber_id, list_id, status FROM subscriber_lists
    WHERE
        -- ARRAY_AGG is 20x faster instead of a simple SELECT because the query planner
        -- understands the CTE's cardinality after the scalar array conversion. Huh.
        list_id = ANY((SELECT ARRAY_AGG(list_id) FROM campLists)::INT[]) AND
        status != 'unsubscribed' AND
        subscriber_id > (SELECT last_subscriber_id FROM camps) AND
        subscriber_id <= (SELECT max_subscriber_id FROM camps)
    // ORDER BY subscriber_id LIMIT $2
),

Add it in subs

subs AS (
    SELECT subscribers.* FROM subIDs
    LEFT JOIN campLists ON (campLists.list_id = subIDs.list_id)
    INNER JOIN subscribers ON (
        subscribers.status != 'blocklisted' AND
        subscribers.id = subIDs.subscriber_id AND

        (CASE
            -- For optin campaigns, only e-mail 'unconfirmed' subscribers.
            WHEN (SELECT type FROM camps) = 'optin' THEN subIDs.status = 'unconfirmed' AND campLists.optin = 'double'

            -- For regular campaigns with double optin lists, only e-mail 'confirmed' subscribers.
            WHEN campLists.optin = 'double' THEN subIDs.status = 'confirmed'

            -- For regular campaigns with non-double optin lists, e-mail everyone
            -- except unsubscribed subscribers.
            ELSE subIDs.status != 'unsubscribed'
        END)
    )
    ORDER BY subscribers.id LIMIT $2
),

@MaximilianKohler
Copy link
Contributor

I couldn't find the case where a subscriber is blocklisted and it has status!='unsubscribed' (Please let me know the when this case happens).

I think it happens when a subscriber clicks the "unsubscribe" link in an email and chooses "permanent block list", or it may happen when you manually blocklist someone. I think it used to happen when the system would automatically blocklist them for bounce/complaint, but I think at some point that changed to only blocklisting them and not unsubscribing them.

@knadh knadh self-assigned this Jul 12, 2024
@noxwaste
Copy link

noxwaste commented Aug 6, 2024

I'd like to add to this and state that I received the same problem. 166 of 238 e-mails were sent out. The problem with the "ubsubscribe" or "block list" theory is that this is a brand new installation with the very first campaign I've ever created. Our employees haven't even had the chance to unsubscribe yet because they've never received an e-mail like this from us in the past.

@lucasferreira
Copy link

lucasferreira commented Aug 17, 2024

Hi folks, just for aggregate in this issue, we was facing the same problem with our campaigns and after reading this issue we delete all our blocked users from our lists and now our campaigns do not stop in midway anymore.

Today we could sent 1.5mi e-mails again without problems, so this thing about "import a lot of blocked users from csv" could be really a important fact in this investigation.

@knadh knadh closed this as completed in 139267d Sep 3, 2024
@knadh
Copy link
Owner

knadh commented Sep 3, 2024

Oops, closed this accidentally. Investigating this issue currently.

@knadh knadh reopened this Sep 3, 2024
@knadh
Copy link
Owner

knadh commented Sep 4, 2024

Did quite a few tests on this, and it's trickier than I thought.

subIDs AS (
    SELECT DISTINCT ON (subscriber_lists.subscriber_id) subscriber_id, list_id, status FROM subscriber_lists
    WHERE
        -- ARRAY_AGG is 20x faster instead of a simple SELECT because the query planner
        -- understands the CTE's cardinality after the scalar array conversion. Huh.
        list_id = ANY((SELECT ARRAY_AGG(list_id) FROM campLists)::INT[]) AND
        status != 'unsubscribed' AND
        subscriber_id > (SELECT last_subscriber_id FROM camps) AND
        subscriber_id <= (SELECT max_subscriber_id FROM camps)
        ORDER BY subscriber_id LIMIT $2
)

The max_subscriber_id is computed once when the campaign starts. That's the ID of the last subscriber expected to be processed when the campaign finishes. If there are 200k subscribers, assuming sequential IDs, last_subscriber_id = 0 and max_subscriber_id = 200k when the campaign starts. That's the entire range. With a batch size of 1000 (which is default), no matter how many blocklisted subscriptions exist between 0 and 200k, this query will always return non-blocklisted rows. Unless, max_subscriber_id changes after a campaign's creation, where there are subscribers beyond the max point.

@subhash-ngowda In your case, were subscribers blocklisted/changed after the campaign's creation?

@subhash-ngowda
Copy link
Author

@knadh No, the subscribers were not changed/blocklisted after the campaigns's creation.

@knadh
Copy link
Owner

knadh commented Sep 6, 2024

hmm, @subhash-ngowda then it is not possible that in the subIDs query, the conditions will ever return 0 rows before max_subscriber_id is reached, right? Referring to your observation here:

Additionally, it filters out subscribers with the status unsubscribed. If all subscribers that meet the criteria in subIDs are blocklisted, the subs CTE will return zero records due to the condition that excludes blocklisted subscribers.

@ohaeusler
Copy link

ohaeusler commented Sep 10, 2024

As mentioned in #1762, I've been able to mitigate the issue by increasing the batch size. As my mail server hasn't built up reputation yet and isn't used for newsletter regularly, I've decreased the batch size immediately after setting up listmonk a few months ago. Back at this time, it all worked fine and sending out a campaign to 500ish subscribers.

Now I've got two new lists, both double opt-in as opposed to the last time, where I managed this aside from listmonk.
The first list has 170 Subscribers, where 57 people have confirmed the double option mail. The second list has 343 subscribers, with 79 who have confirmed the double opt-in.

Before increasing the batch size to 100, the campaign finished early at the last confirmed subscriber, if there are no confirmed subscribers in the next batch of subscribers.

I can remember seeing a finished campaign with last_subscriber_id at 1102. After manually increasing the last_subscriber_id to 1119 it worked until finishing on 1150 again. Increased last_subcriber_id to 1168, did 1169, then it finished again. At this time I've increased the batch size to 100 which made it finish the campaign properly.

image

EDIT:

I've deleted any blocklisted subscribers before these steps, so that shouldn't be an issue

@MaximilianKohler
Copy link
Contributor

I couldn't find the case where a subscriber is blocklisted and it has status!='unsubscribed' (Please let me know the when this case happens).

In my case, it happened when I imported the list and selected "blocklisted", which automatically marks them as unsubscribed as well.

Before increasing the batch size to 100, the campaign finished early at the last confirmed subscriber, if there are no confirmed subscribers in the next batch of subscribers.

Ah! Then I think your experience does match mine. #1762 (comment)

Knadh commented here #1762 (comment) about how the process works.

I just checked the list of 150k subscribers that failed and there's a large (4890) number of blocklisted+unsubscribed contacts grouped up right where the campaign ended prematurely. So the last subscriber who was sent an email was the one right before the beginning of the blocklisted group.

My batch size at the time was the default 500. The list of 150k subscribers includes subscribers from 10 to 1,443,624.

It definitely seems to be an issue of "if there are no valid/confirmed subscribers in the current/next batch, the campaign is marked finished".

@ohaeusler
Copy link

ohaeusler commented Sep 10, 2024

Yep, I can confirm that. The statement (with a batch size of 10 and removed condition for running campaign) returns no results, if last_subscriber_id is one below a batch of subscribers without status confirmed:
image

This is the result after increasing the last_subscriber_id for the campaign to one below the next confirmed subscriber.
image

Could a possible solution implement a case switch for opt-in/double-opt-in lists already in subIDs? Something like this (I#ve never used PostgresSQL before, so thanks ChatGPT :D). Oh, and please make sure to double check this. it's just a quick idea and a bit of a guess in the blue.

subIDs AS (
    SELECT DISTINCT ON (subscriber_lists.subscriber_id) subscriber_id, list_id, status FROM subscriber_lists
    WHERE
        list_id = ANY((SELECT ARRAY_AGG(list_id) FROM campLists)::INT[]) AND
        status != 'unsubscribed' AND
        subscriber_id > (SELECT last_subscriber_id FROM camps) AND
        subscriber_id <= (SELECT max_subscriber_id FROM camps) AND

        (CASE
            -- For optin campaigns, only include 'unconfirmed' subscribers.
            WHEN (SELECT type FROM camps) = 'optin' THEN status = 'unconfirmed' AND (SELECT optin FROM campLists) = 'double'

            -- For regular campaigns with double optin lists, only include 'confirmed' subscribers.
            WHEN (SELECT optin FROM campLists) = 'double' THEN status = 'confirmed'

            -- For regular campaigns with non-double optin lists, include all subscribers except unsubscribed.
            ELSE status != 'unsubscribed'
        END)
    ORDER BY subscriber_id LIMIT $2
)

@knadh
Copy link
Owner

knadh commented Sep 13, 2024

After a marathon debugging session with @vividvilla, it looks like we've nailed the issue. Thank you everyone who debugged and shared cues and clues on this thread!

This happens when:

  • There are blocklisted users who have list subscriptions that are not unsubscribed. Tested all paths that blocklists users (checkbox UI, query UI, list -> subscribers UI, users blocklisting themselves, bulk import UI), and they all set all subscriptions to unsubscribed correctly. However, the bounce routine does not do this when it's set to blocklist users. It blocklists users, but does not change the subscriptions to unsubscribed. This is a bug.
  • When, a whole batch is users who are in double opt-in lists, but have not confirmed yet (!) I think this is the main culprit, and a really subtle miss/bug!

Working on fixing both of these. @subhash-ngowda, in this screenshot, you have a blocklisted user with a non-unusbscribed list. Can you recollect how you ended up with that state? Same thing @nayanthulkar28 had pointed out. There's a clue or potentially another subtle bug there either. The only path I've found so far is bounces. Was that the case?

@MaximilianKohler
Copy link
Contributor

In my case, where blocklisted users have list subscriptions that are not unsubscribed:

I have a local CSV list of contacts. I've imported some of them as blocklisted. When I want to send an email out to all my contacts I use the local CSV file, create a new list and campaign, and import the subscribers from the CSV and "overwrite". So none of them are going to be unsubscribed from the new lists I create even though they're all blocklisted already.

One reason I do it this way is because I usually send out 2-3k emails per day (a new list & campaign each day), but my full list is 1.2+ million users. According to various guidelines, I need to gradually increase the number of emails I send per day, so I generally start with ~5k and then double that each day till I get to ~500k per day. I create a new list+campaign for each day.

@subhash-ngowda
Copy link
Author

@knadh
Below is the test environment setup we created to reproduce the issue.

  • Created a list of 1000 subscribers (list A) and imported it to a newly created system which does not have any customer data or campaign history.
  • Once the import was done, manually marked some 100 customers status as “blocklisted” at the end of the list.
  • Created another list of 1000 subscribers (list B) which also includes a few customers from list A with the status as “blocklisted” and imported into the system.
  • Repeated the above step one more time including some subscribers from list A and list B.

Note: We have tested this on both double-optin and single-optin lists and the result are the same. Subscribers are imported as confirmed users.

This is very similar to our real time use case where we do our campaigns daily and also use other systems along with Listmonk.

When we import our daily target segment to a new list, it is possible that we will have duplicate records in the import with respect to the data present in the system. If the duplicate user is already in the blocklisted status (due to the customer’s un-subscription action) then we end up a user who is “blocklisted” but not unsubscribed in one of the lists, he is present.

@knadh
Copy link
Owner

knadh commented Sep 17, 2024

Thanks @subhash-ngowda. I was able to reproduce the issue. Working on a fix.

knadh added a commit that referenced this issue Sep 24, 2024
This has been a hair-pulling rabbit hole of an issue. #1931 and others.
When the `next-campaign-subscribers` query that fetches $n subscribers
per batch for a campaign returns no results, the manager assumes
that the campaign is done and marks as finished.

Marathon debugging revealed fundamental flaws in qyery's logic that
would incorrectly return 0 rows under certain conditions.
- Based on the "layout" of subscribers for eg: a series of blocklisted
  subscribers between confirmed subscribers.
  A series of unconfirmed subscribers in a batch belonging to a double
  opt-in list.
- Bulk import blocklisting users, but not marking their subscriptions
  as 'unsubscribed'.
- Conditions spread across multiple CTEs resulted in returning an
  arbitrary number of rows and $N per batch as the selected $N rows
  would get filtered out elsewhere, possibly even becoming 0.

After fixing this and testing it on our prod instance that has
15 million subscribers and ~70 million subscriptions in the
`subscriber_lists` table, ended up discovered significant inefficiences
in Postgres query planning. When `subscriber_lists` and campaign list IDs
are joined dynamically (CTE or ANY() or any kind of JOIN that involves)
a query, the Postgres query planner is unable to use the right indexes.

After testing dozens of approaches, discovered that statically passing
the values to join on (hardcoding or passing via parametrized $1 vars),
the query uses the right indexes. The difference is staggering.
For the particular scenario on our large prod DB to pull a batch,
~15 seconds vs. ~50ms, a whopping 300x improvement!

This patch splits `next-campaign-subscribers` into two separate queries,
one which fetches campaign metadata and list_ids, whose values are then
passed statically to the next query to fetch subscribers by batch.

In addition, it fixes and refactors broken filtering and counting logic
in `create-campaign` and `next-campaign` queries.

Closes #1931, #1993, #1986.
@knadh
Copy link
Owner

knadh commented Sep 24, 2024

This has been fixed and has been tested in production on our large prod instance. The fix (bf26ec8) was way more complex than I imagined, and while working on it, I stumbled upon significant performance improvements in those queries (30x speed up on subscriber batch pulling on large installations). The changes are all documented in the commit message.

This fix is on the multiuser (upcoming v4.0.0) branch and is available as an RC if you're interested in testing it out along with the new user management and auth system: #543 (comment)

@knadh knadh closed this as completed Sep 30, 2024
knadh added a commit that referenced this issue Oct 13, 2024
This has been a hair-pulling rabbit hole of an issue. #1931 and others.
When the `next-campaign-subscribers` query that fetches $n subscribers
per batch for a campaign returns no results, the manager assumes
that the campaign is done and marks as finished.

Marathon debugging revealed fundamental flaws in qyery's logic that
would incorrectly return 0 rows under certain conditions.
- Based on the "layout" of subscribers for eg: a series of blocklisted
  subscribers between confirmed subscribers.
  A series of unconfirmed subscribers in a batch belonging to a double
  opt-in list.
- Bulk import blocklisting users, but not marking their subscriptions
  as 'unsubscribed'.
- Conditions spread across multiple CTEs resulted in returning an
  arbitrary number of rows and $N per batch as the selected $N rows
  would get filtered out elsewhere, possibly even becoming 0.

After fixing this and testing it on our prod instance that has
15 million subscribers and ~70 million subscriptions in the
`subscriber_lists` table, ended up discovered significant inefficiences
in Postgres query planning. When `subscriber_lists` and campaign list IDs
are joined dynamically (CTE or ANY() or any kind of JOIN that involves)
a query, the Postgres query planner is unable to use the right indexes.

After testing dozens of approaches, discovered that statically passing
the values to join on (hardcoding or passing via parametrized $1 vars),
the query uses the right indexes. The difference is staggering.
For the particular scenario on our large prod DB to pull a batch,
~15 seconds vs. ~50ms, a whopping 300x improvement!

This patch splits `next-campaign-subscribers` into two separate queries,
one which fetches campaign metadata and list_ids, whose values are then
passed statically to the next query to fetch subscribers by batch.

In addition, it fixes and refactors broken filtering and counting logic
in `create-campaign` and `next-campaign` queries.

Closes #1931, #1993, #1986.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants