6 ML Model Mistakes People Make With Crypto Data
Most crypto ML models fail before the market gets a chance to break them. The bugs aren't in the architecture search or the hyperparameter sweep. They're in how people are feeding data in and what they're asking the model to predict.
Here are six mistakes that show up constantly, what's going wrong, and what to do instead.
1. Feeding Raw Values Instead of Aggregates
If your model's input is raw price ticks, individual transfer amounts, or unprocessed event counts, you've already lost.
Raw blockchain data is noisy, irregularly spaced, and dominated by outliers. A single whale transfer at $400M is making every other transfer invisible on the same scale. A raw price series has no concept of how fast the price moved, how much volume supported it, or whether the move was a single block anomaly.
What to do instead: Aggregate to meaningful windows sized to the chain you're working with. Ethereum finalizes a block every ~12 seconds, Solana every ~400ms, and BSC every ~3 seconds, so the right aggregation window is following block time, not defaulting to minute candles built for TradFi. Instead of raw price, use OHLCV aggregated per block or per N blocks. Instead of raw transfer amounts, aggregate to total value transferred per block range, median transfer size, count of transfers above a threshold. For DEX data specifically, pool-level volume aggregates, tick-weighted liquidity depth, and fee revenue per block are the features carrying signal.
The rule: if your input would look chaotic on a chart with no smoothing, it'll look chaotic to your model too.
2. Not Deciding What the Model Is Actually Learning
This is a fundamental architecture mistake that no amount of tuning can fix. There are two very different things a model can be doing with time-series data, and most people are conflating them:
Finding relationships between features. "When volume is spiking on-chain while the funding rate is negative, what happens to price?" This is a cross-feature dependency problem. Models like iTransformer (applying attention across the feature dimension rather than the time dimension) are designed for exactly this. If you want to know how your features are interacting, this is the architecture.
Finding temporal patterns and regimes. "After three consecutive lower-highs in the 15-minute OHLCV series, what follows?" This is a time-pattern and sequence problem. LSTMs, TCNs (Temporal Convolutional Networks), and standard time-axis Transformers are built for this. If you want to know when something is likely to happen based on what came before, this is the architecture.
Using an LSTM to find cross-feature relationships gives you underperforming soup. Using iTransformer on a regime detection problem means missing the sequential structure entirely.
Decide what question you're actually asking. Then pick the architecture that's answering it.
3. Skipping Normalization and Letting Values Run Wild
Crypto values are spanning comically different magnitudes. ETH price might be $3,200. A pool's 24-hour volume might be $180,000,000. Gas in gwei might be 14. A liquidity position's tick range might be 887272.
Feeding those raw into a neural network means the optimizer is spending most of its capacity just figuring out the scale difference between features. Gradients are becoming unstable. Training is slowing down by orders of magnitude. Some features are effectively going invisible.
What to do instead: Normalize everything before training.
- Min-max scaling works when you know the range of a feature won't change dramatically out-of-sample (rare in crypto).
- Z-score normalization (subtracting mean, dividing by standard deviation) is more robust for most features.
- Rolling z-score is better still for non-stationary crypto data, normalizing each feature relative to its own recent window (e.g., 30-day rolling mean and std), so the model is seeing how today's value compares to recent history rather than an all-time baseline.
Always fit your scaler on training data only. Fitting on the full dataset and then splitting is leaking future information into your training set, a common mistake producing suspiciously good backtest results that collapse in live trading.
4. Predicting Raw Price When You Should Be Predicting a Derived Metric
Asking a model to predict tomorrow's ETH price in USD is almost always the wrong target. Raw price is non-stationary, its statistical properties are shifting constantly as the market grows, contracts, and changes regime. A model trained on 2022 price levels has no meaningful framework for 2024 price levels.
What to predict instead:
Log returns. log(P_t / P_{t-1}) is stationary, scale-invariant, and directly corresponding to trading P&L. It's what most professional quant strategies are predicting.
Volatility-adjusted returns. Log return divided by recent realized volatility, accounting for the fact that a 2% move in a quiet market is very different from a 2% move during a liquidation cascade.
Directional probability. Instead of predicting the exact return, predict the probability that the next candle closes above the current close. This is turning the problem into binary classification, which is often easier to model reliably and maps cleanly to actual trading decisions.
Regime labels. If your goal is knowing when to be in or out of the market, predict regime (trending/mean-reverting/volatile/flat) rather than price direction. This is a classification problem and far more tractable.
The target you're choosing should match the decision you're trying to make. If you're asking "should I enter a long position here," predict probability of up-move, not price level.
5. Assuming a Bigger Network Is Finding Better Relationships
The instinct to add more layers when a model is underperforming is understandable but usually wrong for crypto data.
The problem isn't that the model is too small to find patterns. The problem is usually that the patterns in the training data are fragile, inconsistently present, or the dataset is too small to support a complex model without severe overfitting.
A 3-layer LSTM with 128 hidden units, properly regularized, trained on clean aggregated features will almost always outperform a 12-layer Transformer on the same crypto dataset. The Transformer has more capacity to memorize noise.
What actually helps:
- Dropout (0.2 to 0.5 between layers) during training forces robustness
- L2 regularization on weights penalizing complexity
- Early stopping on validation loss, stopping when validation stops improving, not when training loss does
- Fewer features, higher quality, where 8 well-constructed features are outperforming 80 raw features almost universally
- Ensembling small models, where three small models with different feature sets, averaged at prediction time, are more robust than one large model
More parameters is not a free lunch. In crypto data specifically, you have limited samples relative to the complexity of the market. Constrain the model to make it generalize.
6. Building a Model Without Planning for It Losing Steam
This is the mistake that kills live deployments. The model backtests well. It forward-tests okay. You put it live. Three hours later, it's taking the wrong side of every trade.
Two compounding problems are causing this:
Working with a small effective dataset. Crypto markets are changing regimes faster than equity markets. A dataset from Q1 2024 may be almost useless for training a model you're deploying in Q4 2024. Bull markets, bear markets, sideways chop, and liquidity crises all have different statistical properties. If you train on 6 months of data, you may have only 1 to 2 full regime cycles, which isn't enough to learn generalizable patterns. Minimum viable dataset: 2+ full market cycles, with regime-balanced sampling.
Ignoring concept drift and shelf life. Even a well-trained model on a large dataset goes stale because the market is adapting. MEV strategies are changing, liquidity dynamics are shifting, new protocols are creating new patterns. Assume your model has a shelf life. Build monitoring into your system, tracking prediction accuracy on live data in rolling windows, setting a threshold for retraining, automating the retraining pipeline. A model that's retraining weekly on fresh data with a fixed architecture will outperform a "perfect" static model over any horizon longer than a few weeks.
The fix isn't better training. It's treating the model as infrastructure that requires maintaining, not a one-time artifact.
The Common Thread
All six mistakes are coming from treating crypto ML like a standard supervised learning problem on clean tabular data. Crypto data is high-noise, non-stationary, regime-shifting, and sparse relative to the complexity of what you're trying to predict.
The models that work are modest in size, training on carefully constructed aggregated features, predicting derived targets, with explicit regime awareness and built-in retraining cycles.
Start simpler than you think you need to. Add complexity only when you can show on held-out data that it's helping.
Have a crypto ML setup you want to pressure-test? Reach out.