Skip to content

Detecting Fraudulent Blockchain Accounts on Ethereum with Supervised Machine Learning

License

Notifications You must be signed in to change notification settings

PZ808/ETH-scam-ml

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Ethereum-Fraud-Detection (forked and modded)

Detecting Fraudulent Blockchain Accounts on Ethereum with Supervised Machine Learning using sklearn's Library


Introduction

Since 2021, more than 46,000 people lost over $1 billion to cryptocurrency scams, nearly 60 times more compared to 2018.1 The Federal Trade Commission (FTC) found that the top cryptocurrencies used to pay scammers were Bitcoin (70%), Tether (10%) and Ethereum (9%).1 Especially, with the most recent incident with FTX, a crypto exchange which misused more than $1 billion of client’s funds, it becomes ever more important to stay vigilant when navigating through the cryptocurrency world.2 To enforce deterrence against fraudulent scams, we used supervised machine learning techniques such as Logistic Regression, Naive Bayes, SVM, XGboost, LightGBM, MLP, Tabnet and Stacking to detect and predict fraudulent Ethereum accounts. This would add business value by enhancing fraudulent account detection features on crypto exchanges and crypto wallets, enabling people to navigate confidently through the cryptocurrency world and safeguard their personal assets. We set an objective to achieve more than 90% F1 score for machine learning models in predicting fraudulent accounts on the Ethereum blockchain.


Data

There are 2 data sources : Kaggle and Etherscan

Kaggle

The Kaggle dataset is downloaded from https://www.kaggle.com/datasets/vagifa/ethereum-frauddetection-dataset and can be found in ./Data/address_data_k.csv

Etherscan

Data are mined from etherscan from https://etherscan.io/accounts/label/phish-hack (Currently data has been taken off Etherscan, but we have saved our data) and can be found in ./Data/address_data_e.csv

Combined without Time Series

Data from Kaggle and Etherscan are combined and can be found in ./Data/address_data_combined.csv

Time-Series

One key aspect of the dataset that we realised was missing was the time series element. Although each observation in our data was a user account, this data was generated by aggregating individual transactions. By doing so, valuable information could have been “flattened out”. The flow of Ethereum transactions are intrinsically time series data that could be used in our model, such as seasonality of transactions. These information was extracted using the 'tsfresh' library and can be found in ./Data/Transaction_data and the new features extracted can be found in ./Data/new_ts_features_only.csv.

Combined with Time Series

Data from Kaggle and Etherscan including time series can be found in ./Data/address_data_combined_ts.csv

Data Description

We started with a Kaggle dataset of 9841 observations. Each observation is a unique Ethereum account, with each variable being an aggregate statistic over all transactions performed by that unique account, such as total Ether value received or average time between transactions. The data also distinguishes between account-to-account transactions and account-to-smart contract transactions. However, the dataset was highly imbalanced, with only 2179 out of 9841 (22.14%) being marked as fraud. To address the imbalance, we leveraged an API provided by Etherscan, a “Block Explorer and Analytics Platform for Ethereum”. This allowed us to retrieve transactions made by any given account address on the Ethereum blockchain. As a result, the number of fraudulent accounts in our dataset climbed to 4339 observations, making the combined dataset less imbalanced (45.97% fraud).



About

Detecting Fraudulent Blockchain Accounts on Ethereum with Supervised Machine Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.9%
  • Python 0.1%