Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Misdetection] zip file misdetected as <any type you want> #765

Open
kazet opened this issue Oct 24, 2024 · 0 comments
Open

[Misdetection] zip file misdetected as <any type you want> #765

kazet opened this issue Oct 24, 2024 · 0 comments
Labels
misdetection This issue is about a misdetection on a content type currently supported needs triage This issue still needs triage by one of the maintainers

Comments

@kazet
Copy link

kazet commented Oct 24, 2024

Hello,

This Python script that modifies any zip file to be misdetected as any other file:

import binascii
import random
from magika import Magika

with open("test.zip", "rb") as f:
    x = f.read()

m = Magika()
prefix = b""

ZIP_FILE_FORMATS = ["zip", "jar", "rpm", "epub", "ods"]

for i in range(10000):
    old_res = m.identify_bytes(prefix + x)
    new_prefix = prefix

    operation = random.choice(
        ["REMOVE_FIRST", "REMOVE_LAST", "ADD_FIRST", "ADD_LAST"] + ["REPLACE"] * 2
    )
    if operation == "REMOVE_FIRST" and len(new_prefix) > 1:
        new_prefix = new_prefix[1:]
    elif operation == "REMOVE_LAST" and len(new_prefix) > 1:
        new_prefix = new_prefix[:-1]
    elif operation == "ADD_FIRST":
        new_prefix = bytes([random.randint(0, 255)]) + new_prefix
    elif operation == "ADD_LAST":
        new_prefix = new_prefix + bytes([random.randint(0, 255)])
    elif operation == "REPLACE" and len(new_prefix) >= 1:
        i = random.randint(0, len(new_prefix) - 1)
        new_prefix = (
            new_prefix[:i] + bytes([random.randint(0, 255)]) + new_prefix[i + 1 :]
        )
        assert len(new_prefix) == len(prefix)

    new_res = m.identify_bytes(new_prefix + x)
    if (
        new_res.output.ct_label not in ZIP_FILE_FORMATS
        and new_res.dl.ct_label not in ZIP_FILE_FORMATS
    ):
        print("success: prefix=", binascii.hexlify(new_prefix), "result=", new_res)
        break

    if new_res.output.score < old_res.output.score:
        prefix = new_prefix

with open("out.zip", "wb") as f:
    f.write(new_prefix + x)

The above script produces zips misdetected as jpegs, pcaps, etc., even if the magic numbers aren't proper jpeg, pcap magic numbers.

Example:

success: prefix= b'bed801' result= MagikaResult(path='-', dl=ModelOutputFields(ct_label='jpeg', score=0.8492175936698914, group='image', mime_type='image/jpeg', magic='JPEG image data', description='JPEG image data'), output=MagikaOutputFields(ct_label='unknown', score=0.8492175936698914, group='unknown', mime_type='application/octet-stream', magic='data', description='Unknown binary data'))

success: prefix= b'04b224' result= MagikaResult(path='-', dl=ModelOutputFields(ct_label='pcap', score=0.5953978300094604, group='application', mime_type='application/vnd.tcpdump.pcap', magic='pcap capture file', description='pcap capture file'), output=MagikaOutputFields(ct_label='unknown', score=0.5953978300094604, group='unknown', mime_type='application/octet-stream', magic='data', description='Unknown binary data'))
@kazet kazet added misdetection This issue is about a misdetection on a content type currently supported needs triage This issue still needs triage by one of the maintainers labels Oct 24, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
misdetection This issue is about a misdetection on a content type currently supported needs triage This issue still needs triage by one of the maintainers
Projects
None yet
Development

No branches or pull requests

1 participant