The Fraud Blocker Company - FBC

Preventing fraudulent transactions for good 👊

1.0 The context

1.1 What is fraud?

Fraud is an intentionally deceptive action designed to provide the perpetrator with an unlawful gain or to deny a right to a victim. In addition, it is a deliberate act (or failure to act) with the intention of obtaining an unauthorized benefit, either for oneself or for the institution, by using deception or false suggestions or suppression of truth or other unethical means, which are believed and relied upon by others. Depriving another person or the institution of a benefit to which he/she/it is entitled by using any of the means described above also constitutes fraud.

Types of fraud include tax fraud, credit card fraud, wire fraud, securities fraud, and bankruptcy fraud. Fraudulent activity can be carried out by one individual, multiple individuals or a business firm as a whole.

1.2 Key Facts (as Aug/2020)

Fraud involves deceit with the intention to illegally or unethically gain at the expense of another.
In finance, fraud can take on many forms including making false insurance claims, cooking the books, pump & dump schemes, and identity theft leading to unauthorized purchases.
Fraud costs the economy billions of dollars each and every year, and those who are caught are subject to fines and jail time.
Consumer fraud occurs when a person suffers from a financial loss involving the use of deceptive, unfair, or false business practices.
With identity theft, thieves steal your personal information, assume your identity, open credit cards, bank accounts, and charge purchases.
Mortgage scams are aimed at distressed homeowners to get money from them.
Credit and debit card fraud is when someone takes your information off the card and makes purchases or offers to lower your credit card interest rate.
Fake charities and lotteries prey on peoples' sympathy or greed.
Debt collection fraud tries to collect on unpaid bills whether they are yours or not.
COVID-19 scams are a new type of fraud designed to prey on your fear or financial need.

1.3 The Fraud Blocker Company - FBC

The Fraud Blocker Company (FBC) is a company specialized in detecting fraud in financial transactions made through mobile devices. The company has a service called “Blocker Fraud” in which it guarantees the blocking of fraudulent transactions.

The company's business model is of the Service type with the monetization made for the performance of the service provided, that is, the customer pays a fixed fee on the success in detecting fraud in the customer's transactions. In addition, the FBC is expanding and to acquire customers more quickly, it has adopted a very aggressive strategy, FBC will:

Receive 25% of the value of each transaction that is truly detected as fraud (true positives).

Receive 5% of the value of each transaction detected as fraud, but the transaction is truly legitimate (false positives).

Refund 100% of the value to the customer, for each transaction detected as legitimate, however the transaction is truly a fraud (false negatives).

With this aggressive strategy, the company assumes the risks of failing to detect fraud and is remunerated for assertive fraud detection. For the client, it is an excellent business to hire the FBC. Although the fee charged is very high on success, 25%, the company reduces its costs with fraudulent transactions correctly detected and the damage caused by an error in the anti-fraud service will be covered by the FBC itself.

For FBC, in addition to getting many customers with this risky strategy to guarantee reimbursement in the event of a failure to detect customer fraud, it depends only on the precision and accuracy of the models built by its Data Scientists, that is, the greater the precision of “Fraud Blocker” model, the greater the company's revenue. However, if the model has low precision, the company could have a huge loss.

PS 1: All the references are stated at the end of this README.

PS 2: You can find useful information at section 1 of my notebook.

2.0 The challenges

Given that we have a historical data set containing 6MM+ transactions, how might we solve FBC business challenges in order to maximize its revenue and minizes its risks and losses?

3.0 The solution

In this project, I managed to address the challenge by developing a classification model that can be deployed to FBC. The model has a precision that ranges from 96.52% to 98.38% and can help FBC in calculating the expected revenue taking into account the business model. Thus, the company can experiment different values for charging its clients while knowing the model precision.

3.1 What drove the solution

3.1.1 Exploratory Data Analysis

Descriptive Analysis

Key points:

The min amount for a trasaction is 0.01.
The min balance for both origin and destination is 0.00.
The max amount for a transaction is 92,445,520.00.

Hypothesis Map

This map to help us to decide which variables we need in order to validate the hypotheses.

Univariate Analysis - Target

Number of fraudulent transactions: 8097 (0.13% of the total transactions)

Number of non fraudulent transactions: 6252434 (99.87% of the total transactions)

Univariate Analysis - Numerical

As observed,

There is a higher concentration for amount between 59873.14 and 442412.39.
In general, the major concentration of balance is around zero.

Univariate Analysis - Categorical

Key points:

Cash-out and payment represent the major part of the total transactions.
The great majority of the transactions are not flagged as fraud.
There is only one type of origin: customer.
There are more transactions whose destination is customer.
The origin old balance status is non zero for the majority of transactions.
There are more transactions whose the origin new balance status is zero.
There are more transactions whose the destination old balance status is non zero.
There are more transactions whose the destination new balance status is non zero.
No transactions has the the same origin as destination.
There are more transactions whose origin old balance is higher than the new.
The destination old balance is higher than the new for the majority of the transctions.
The transaction direction C2C is almost the double of C2M.
Number of unique values for name_orig: 6,251,489 (1.0% of the total transactions)
Number of unique values for name_dest: 2,677,202 (0.43% of the total transactions)

3.1.2 Hypothesis validation - Bivariate Analysis

Main Hypothesis

H1. Payment represent 50% of the total fraudulent transactions. (FALSE)

Transaction Type	Percentage of the total
CASH_OUT	50.02
TRANSFER	49.98
PAYMENT	0.00
DEBIT	0.00
CASH_IN	0.00

As observed, there is no fraudulent transaction for payment. Fraudulent transactions occur for cash out and transfer.

Thus, the hypothesis is FALSE.

H2. Transfer represent 50% of the total non fraudulent transactions. (FALSE)

Transaction Type	Percentage of the total
CASH_OUT	35.20
TRANSFER	33.85
PAYMENT	21.99
DEBIT	8.31
CASH_IN	0.65

As observed, transfer represents 8.3% of the total non fraudulent transactions. Cash-out and payment are the majority of transaction types.

Thus, the hypothesis is FALSE.

H3. Customer to Merchant (C2M) represent at least 50% of the total fraudulent transactions. (FALSE)

Transaction direction	Percentage of the total
C2C	100.00
C2M	0.00

As observed, C2C represents 100% of the total fraudulent transactions, while there is no C2M.

Thus, the hypothesis is FALSE.

H8. Destination non zero new balance represents 50% of the total fraudulent transactions. (TRUE)

Destination New Balance status	Percentage of the total
Non zero	50.19
Zero	49.81

As observed, destination non zero new balance represents ~50% of the total fraudulent transactions.

Thus, the hypothesis is TRUE.

H11. At least 95% of the total fraudulent transactions are flagged as fraud. (FALSE)

Flagged as fraud?	Percentage of the total
Yes	99.80
No	0.20

As observed, ~99.8%of the total fraudulent transactions are not flagged as fraud.

Thus, the hypothesis is FALSE.

Hypothesis summary

ID	Hypothesis	Conclusion
H1	Payment represent 50% of the total fraudulent transactions	FALSE
H2	Transfer represent 50% of the total non fraudulent transactions	FALSE
H3	Customer to Merchant (C2M) represent at least 50% of the total fraudulent transactions	FALSE
H4	Customer to Customer (C2C) represent at least 50% of the total non fraudulent transactions	TRUE
H5	The average amount of a non fraudulent transaction is equivalent to 20% of the average amount of a fraudulent	FALSE
H6	Merchant destination represents 40% of the total fraudulent transactions	FALSE
H7	Customer destination represents 40% of the total non fraudulent transactions	FALSE
H8	Destination non zero new balance represents 50% of the total fraudulent transactions	TRUE
H9	Origin non zero old balance represents 80% of the total fraudulent transactions	FALSE
H10	Origin zero new balance represents 40% of the total fraudulent transactions	FALSE
H11	At least 95% of the total fraudulent transactions are flagged as fraud	FALSE
H12	At least 20% of the total non fraudulent transactions are flagged as fraud	FALSE

3.1.3 Machine Learning

Tests were made using different algorithms.

Performance Metrics

The highlighted cells correspond to the max value at each column.

As observed, for the context of our project, we want to maximize the precision and recall scores, so both metric are important for us. Thus, we can look for the highest f1-score. Looking at the results, XGBClassifier is the algorithm that satisfies our need.

Confusion Matrices

As observed, XGBClassifier gives us higher TP and TN rates while keeping the FP and, more importantly, the FN at minimum.

3.1.4 Business Performance

Let's recall the FBC business model.

Receive 25% of the value of each transaction that is truly detected as fraud (true positives).
Receive 5% of the value of each transaction detected as fraud, but the transaction is truly legitimate (false positives).
Refund 100% of the value to the customer, for each transaction detected as legitimate, however the transaction is truly a fraud (false negatives).

Considering the parameters:

Median amount of a transaction: $7,4871.8 (we're using the median because the amount distribution is highly skewed)
Portfolio: 6,362,620 transactions (fraudulent + non fraudulent)

True positive amount = [0.25 x (median amount) x (number of TP transactions)]

False positive amount = [0.05 x (median amount) x (number of FP transactions)]

False negative amount = [1 x (median amount) x (number of FN transactions)]

Total revenue = True positive amount + False positive amount - False negative amount

Doing the calculations, we have:

Median transaction amount: $74,871.80

Best scenario total revenue = $879,336,709.79

Worst scenario total revenue = $1,404,441,971.97

Note that the worst scenario of our ML model gives us higher revenue, this is due to the FBC princing model and the higher number of FP that model generates.

3.1.5 Machine Learning Performance for the chosen algorithm

The chosen algorithm was the XGBoost Classifier. In addition, I made a performance calibration on it.

Precision, Recall, ROC AUC and other metrics

These are the metrics obtained from the test set.

precision	recall	f1-score	roc auc	cohen kappa	accuracy
0.8894	0.9930	0.9383	0.9989	0.9378	0.9989

The summary below shows the metrics comparison after running a cross validation score with stratified K-Fold with 10 splits in the full data set.

Algorithm	avg precision	avg recall	avg f1-score	avg roc auc
XGBoost Classifier	0.9795 (+/- 0.0090)	0.9447 (+/- 0.0087)	0.9618 (+/- 0.0065)	0.9996 (+/- 0.0006)
XGBoost Classifier + Tuned	0.9819 (+/- 0.0075)	0.9329 (+/- 0.0235)	0.9568 (+/- 0.0128)	0.9996 (+/- 0.0007)
XGBoost Classifier + Tuned + Calibrated	0.9745 (+/- 0.0093)	0.9570 (+/- 0.0114)	0.9657 (+/- 0.0070)	0.9995 (+/- 0.0010)

Although the Tuned HP + Calibrated model has a slightly lower precision, it has a higher recall and f1-score which is fair enough for our project needs. In addition, for being calibrated, in the future predictions it will be more stable and confident which is good for both businesses.

4.0 Next Steps

4.1 Develop an app that intakes a portfolio of transactions and assigns for each transaction its respective probability of being fraudulent.

4.2 Run a Design Discovery to uncover facts that could be missing in our analysis in order to enrich the data that we have and improve the model performance.

4.3 Build a model retraining pipeline.

References

Context

Business case

https://sejaumdatascientist.com/crie-uma-solucao-para-fraudes-em-transacoes-financeiras-usando-machine-learning/

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
img		img
.gitignore		.gitignore
README.md		README.md
categorical_report.html		categorical_report.html
pa002_blocker_fraud_company_bruno_katekawa_c01.ipynb		pa002_blocker_fraud_company_bruno_katekawa_c01.ipynb
requirements.txt		requirements.txt

brunokatekawa/fraud_blocker_company

Folders and files

Latest commit

History

Repository files navigation

The Fraud Blocker Company - FBC

Preventing fraudulent transactions for good 👊

1.0 The context

1.1 What is fraud?

1.2 Key Facts (as Aug/2020)

1.3 The Fraud Blocker Company - FBC

2.0 The challenges

3.0 The solution

3.1 What drove the solution

3.1.1 Exploratory Data Analysis

Descriptive Analysis

Hypothesis Map

Univariate Analysis - Target

Univariate Analysis - Numerical

Univariate Analysis - Categorical

3.1.2 Hypothesis validation - Bivariate Analysis

Main Hypothesis

H1. Payment represent 50% of the total fraudulent transactions. (FALSE)

H2. Transfer represent 50% of the total non fraudulent transactions. (FALSE)

H3. Customer to Merchant (C2M) represent at least 50% of the total fraudulent transactions. (FALSE)

H8. Destination non zero new balance represents 50% of the total fraudulent transactions. (TRUE)

H11. At least 95% of the total fraudulent transactions are flagged as fraud. (FALSE)

Hypothesis summary

3.1.3 Machine Learning

Performance Metrics

Confusion Matrices

3.1.4 Business Performance

True positive amount = [0.25 x (median amount) x (number of TP transactions)]

False positive amount = [0.05 x (median amount) x (number of FP transactions)]

False negative amount = [1 x (median amount) x (number of FN transactions)]

Total revenue = True positive amount + False positive amount - False negative amount

3.1.5 Machine Learning Performance for the chosen algorithm

Precision, Recall, ROC AUC and other metrics

4.0 Next Steps

References

Context

Business case

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages