Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

chat_message_join can indicate that a single message belongs to multiple chats #135

Closed
ReagentX opened this issue Jun 21, 2023 · 2 comments · Fixed by #136
Closed

chat_message_join can indicate that a single message belongs to multiple chats #135

ReagentX opened this issue Jun 21, 2023 · 2 comments · Fixed by #136
Assignees
Labels
bug Something isn't working crate: database Related to the database crate

Comments

@ReagentX
Copy link
Owner

ReagentX commented Jun 21, 2023

image

We iterate over the messages table with this query:

/// Stream messages from the database with optional filters
///
/// # Example:
///
/// ```
/// use imessage_database::util::dirs::default_db_path;
/// use imessage_database::tables::table::{Diagnostic, get_connection};
/// use imessage_database::tables::messages::Message;
/// use imessage_database::util::query_context::QueryContext;
///
/// let db_path = default_db_path();
/// let conn = get_connection(&db_path).unwrap();
/// let context = QueryContext::default();
/// Message::stream_rows(&conn, &context).unwrap();
pub fn stream_rows<'a>(
db: &'a Connection,
context: &'a QueryContext,
) -> Result<Statement<'a>, TableError> {
if !context.has_filters() {
return Self::get(db);
} else {
let filters = context.generate_filter_statement();
// If database has `thread_originator_guid`, we can parse replies, otherwise default to 0
Ok(db.prepare(&format!(
"SELECT
*,
c.chat_id,
(SELECT COUNT(*) FROM {MESSAGE_ATTACHMENT_JOIN} a WHERE m.ROWID = a.message_id) as num_attachments,
(SELECT COUNT(*) FROM {MESSAGE} m2 WHERE m2.thread_originator_guid = m.guid) as num_replies
FROM
message as m
LEFT JOIN {CHAT_MESSAGE_JOIN} as c ON m.ROWID = c.message_id
{filters}
ORDER BY
m.date;
"
))
.unwrap_or(db.prepare(&format!(
"SELECT
*,
c.chat_id,
(SELECT COUNT(*) FROM {MESSAGE_ATTACHMENT_JOIN} a WHERE m.ROWID = a.message_id) as num_attachments,
(SELECT 0) as num_replies
FROM
message as m
LEFT JOIN {CHAT_MESSAGE_JOIN} as c ON m.ROWID = c.message_id
{filters}
ORDER BY
m.date;
"
)).map_err(TableError::Messages)?))
}
}

The query uses a LEFT JOIN, so we see one row for each occurrence of a message in the chat_message_join table.

LEFT JOIN {CHAT_MESSAGE_JOIN} as c ON m.ROWID = c.message_id

This results in duplicated messages in a conversation, since this software successfully deduplicates chats:

/// Given the initial set of duplicated chats, deduplicate them based on the participants
///
/// This returns a new hashmap that maps the real chat ID to a new deduplicated unique chat ID
/// that represents a single chat for all of the same participants, even if they have multiple handles
fn dedupe(duplicated_data: &HashMap<i32, Self::T>) -> HashMap<i32, i32> {
let mut deduplicated_chats: HashMap<i32, i32> = HashMap::new();
let mut participants_to_unique_chat_id: HashMap<Self::T, i32> = HashMap::new();
// Build cache of each unique set of participants to a new identifier:
let mut unique_chat_identifier = 0;
for (chat_id, participants) in duplicated_data {
match participants_to_unique_chat_id.get(participants) {
Some(id) => {
deduplicated_chats.insert(chat_id.to_owned(), id.to_owned());
}
None => {
participants_to_unique_chat_id
.insert(participants.to_owned(), unique_chat_identifier);
deduplicated_chats.insert(chat_id.to_owned(), unique_chat_identifier);
unique_chat_identifier += 1;
}
}
}
deduplicated_chats
}

In my testing, the only messages that exhibited this were iMessages resent as SMS.

@ReagentX ReagentX added bug Something isn't working crate: database Related to the database crate labels Jun 21, 2023
@ReagentX ReagentX self-assigned this Jun 21, 2023
@ReagentX ReagentX changed the title chat_message_join can indicate that a single message can belong to multiple chats chat_message_join can indicate that a single message belongs to multiple chats Jun 21, 2023
@ReagentX ReagentX moved this to Todo in 1.5: Arroyo Lupine Jun 21, 2023
ReagentX added a commit that referenced this issue Jun 23, 2023
@ReagentX ReagentX moved this from Todo to In Progress in 1.5: Arroyo Lupine Jun 24, 2023
@ReagentX
Copy link
Owner Author

Using a subquery solves the problem:

SELECT
    *,
    c.chat_id,
    (SELECT COUNT(*) FROM message_attachment_join a WHERE m.ROWID = a.message_id) as num_attachments,
    (SELECT b.chat_id FROM chat_recoverable_message_join b WHERE m.ROWID = b.message_id) as deleted_from,
    (SELECT COUNT(*) FROM message m2 WHERE m2.thread_originator_guid = m.guid) as num_replies
FROM
    message as m
    LEFT JOIN (SELECT * from chat_message_join GROUP BY message_id) as c ON m.ROWID = c.message_id
ORDER BY
    m.date;

However, this comes with a very large performance impact: a normal export (no attachment management, debug build) nosedived from an average rate of 6,928.03 messages per second to about 202.26.

@ReagentX
Copy link
Owner Author

Since the result set from the query is ordered:

We can just keep track of the last-rendered ROWID and skip rendering if we see the same ID twice. This has negligible performance impact.

ReagentX added a commit that referenced this issue Jun 24, 2023
@ReagentX ReagentX mentioned this issue Jun 24, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in 1.5: Arroyo Lupine Jun 24, 2023
@ReagentX ReagentX mentioned this issue Jun 24, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working crate: database Related to the database crate
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant