-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
'Magic' operation - automatically detect and run operations #239
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…and some encoding types.
…ntities, URL encoding, escaped Unicode, and Quoted Printable encoding.
…anguages by default, to lower false positives and improve performance.
…various simple encodings like XOR or bit rotates.
…e' even though their output cannot be analysed
…s tooltips explaining the properties.
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
The 'Magic' operation attempts to automatically detect what format the input data is in and which operations can be used to make more sense of it. It does this through a variety of methods:
Once a possible encoding has been detected, the 'Magic' operation performs that operation and carries out the same process again. This can continue for several levels, controlled by the 'Depth' argument.
Examples
This example shows the 'Magic' operation detecting three levels of encoding. The results are listed in order of likelihood. The first row shows that the three operations 'From Base64', 'Gunzip' and 'From Hex' will result in an output that looks quite likely to be written in English. The second row shows that just running 'From Base64' results in an output that looks like a gzip file. The third row shows that the raw data without any operations applied doesn't look very much like any language, although it is closest to Portuguese.
This example shows a PNG image which has been URL and Base32 encoded. The 'Magic' operation has correctly detected these encodings and has also discovered that the 'Render Image' operation can be used to further improve the recipe.
This example shows the 'Magic' operation correctly discovering Hindi text underneath three levels of encoding and compression.
Details
The three detection methods mentioned above are explained here in further detail.
Magic bytes
This detection method was already available in CyberChef in the form of the 'Detect file type' operation. It has been incorporated into this operation to provide further metadata to make decisions from.
Regular expressions to detect encodings
Patterns have been added to all relevant operations in the
OperationConfig.js
file. These patterns specify as strictly as possible what the data should look like if it is to match the operation. For example, the following configuration is used for the 'From Base64' operation:Alternative patterns can be added for use with different arguments, for example Base64 encoding using the BinHex alphabet is specified like so:
Byte frequency analysis
Using Pearson's Chi-Squared test, we can determine how closely a given set of data matches the byte frequency of a certain language. To generate the truth data, I downloaded dumps of Wikipedia in 284 different languages, stripped out the wiki formatting, then measured the frequency of every byte. This gave me a set of data, unique to each language, which shows how common each byte is when the characters are encoded in UTF-8.
Future improvements