Full implementation details:
https://github.com/russellballestrini/logs-to-bitmap
We use reproducible computer science in this repo.
Overview
What if we could detect malicious HTTP requests by literally looking at them? In this experiment, we transform HTTP request logs into bitmap images and use computer vision techniques to identify anomalies. The results are striking: 100% detection accuracy using simple visual pattern recognition.
This project, developed in collaboration with Ajax Davis and Lisa Watts, explores an unconventional approach to cybersecurity: converting text-based security data into visual representations that machine learning models can analyze as images.
The computer vision implementation is based on Adrian Rosebrock's excellent PyImageSearch tutorial on anomaly detection with OpenCV and scikit-learn, which provides the foundational techniques for extracting visual features and training Isolation Forest models.
The Hypothesis
Traditional anomaly detection analyzes log files as text, parsing fields and applying statistical models. But what if we treated logs as visual data instead? Our hypothesis:
- Different HTTP request patterns create visually distinct bitmap signatures
- Computer vision models can identify these visual differences
- Small, just-in-time trained models can achieve high accuracy
- Visual analysis might reveal patterns invisible to traditional text analysis
The beauty of this approach is its simplicity: we're literally taking screenshots of logs and asking "does this look normal?"
Quick Start
Clone the repository and get started in minutes:
# Clone the repository
git clone https://github.com/russellballestrini/logs-to-bitmap
cd logs-to-bitmap
# Install dependencies
pip install -r requirements.txt
# Generate 1K sample dataset with anomalies (automated)
make crawl-1k
# Train the anomaly detection model
make train-model-1k
# View the results
cat anomaly_scores/all_1k_scores.txt
How It Works
The system operates in three key stages:
Log Generation
A Pyramid web server captures HTTP requests and logs each into a separate ascii file and bitmap file!
log_data = { 'Timestamp': datetime.datetime.now().isoformat(), 'Request ID': request_id, 'Endpoint': endpoint, 'Client Address': request.client_addr, 'User-Agent': user_agent, 'Method': request.method, 'URL': request.url, 'Headers': headers }
Bitmap Conversion
Each log entry is rendered as a grayscale bitmap using structured field layout:
# Create grayscale bitmap from structured log data img = Image.new('L', (width, height), color=255) # White background draw = ImageDraw.Draw(img) # Draw each field with proper text wrapping for line in formatted_lines: draw.text((10, y_position), line, font=font, fill=0) # Black text y_position += line_height img.save(bitmap_path, 'BMP')
Feature Extraction
Computer vision extracts visual features from each bitmap:
- Color histograms (HSV color space)
- Text density (ratio of non-white pixels)
- Spatial distribution (horizontal/vertical projections)
- Image dimensions and aspect ratio
Anomaly Detection
An Isolation Forest algorithm identifies outliers based on visual features:
# Train isolation forest on visual features detector = IsolationForest(contamination=0.01, n_estimators=200) detector.fit(features_scaled)
Pyramid Web Server Configuration
The system uses a Pyramid web server with custom middleware to capture and process every HTTP request in real-time:
def request_counter_middleware(handler, registry):
"""Middleware to count requests and assign sequence numbers"""
def middleware(request):
global request_counter
with counter_lock:
request_counter += 1
request.request_number = request_counter
return handler(request)
return middleware
Each incoming request triggers the log_request() function which:
- Extracts request metadata: timestamp, client IP, user agent, headers, method, URL
- Generates unique identifiers: sequential request numbers and UUID-based request IDs
- Creates structured log files: saves detailed ASCII logs to logs/ directory
- Renders bitmap images: converts log data to grayscale BMP files in bitmaps/ directory
The system uses Liberation Mono or DejaVu Sans Mono fonts to ensure consistent character spacing across all bitmap renders, making visual patterns more reliable for machine learning analysis.
Synthetic Data Generation
To test the anomaly detection system, we use a controlled crawler that generates HTTP traffic with known anomalies:
# 99% normal traffic uses consistent user agent
normal_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
# 1% anomalous traffic uses diverse user agents
anomaly_user_agents = [
"curl/7.68.0", # Command-line tool
"Python-urllib/3.8", # Python script
"Googlebot/2.1", # Search crawler
"Mozilla/5.0...Firefox/78.0", # Different browser
"MSIE 6.0", # Legacy browser
]
The crawler (crawler_1k.py) operates as follows:
- Generates 1,000 total requests with controlled timing (10-30ms delays)
- Randomly selects 10 requests (exactly 1%) to receive anomalous user agents
- Distributes traffic across two endpoints (66% to /, 33% to /hello)
- Reports progress every 100 requests with timing statistics
- Creates known ground truth by logging which requests received anomaly user agents
This approach provides an ideal test dataset where we know exactly which requests should be flagged as anomalous, enabling precise measurement of detection accuracy.
Model Training and Scoring
The visual anomaly detection training process operates in three phases:
Phase 1: Feature Extraction
Each bitmap undergoes computer vision analysis to extract a 136-dimensional feature vector:
# Extract visual features from bitmap
features = np.concatenate([
color_histogram, # 128 dimensions (8×4×4 HSV bins)
[text_density], # Text pixel ratio
[horizontal_var], # Line distribution variance
[vertical_var], # Character spacing variance
[aspect_ratio], # Image proportions
[height, width] # Absolute dimensions
])
Phase 2: Model Training
The Isolation Forest algorithm trains on the visual features without requiring labeled data:
# Configure for 1% contamination (10 anomalies in 1000 samples)
detector = IsolationForest(
contamination=0.01, # Expected anomaly rate
n_estimators=200, # Decision trees for ensemble
random_state=42 # Reproducible results
)
# Train on scaled features
features_scaled = StandardScaler().fit_transform(features)
detector.fit(features_scaled)
Phase 3: Scoring and Reporting
The trained model generates anomaly scores for all samples, where lower (more negative) scores indicate higher anomaly likelihood:
# Generate anomaly scores for all samples
predictions = detector.predict(features_scaled) # +1 = normal, -1 = anomaly
scores = detector.score_samples(features_scaled) # Continuous anomaly scores
# Sort by anomaly score (most anomalous first)
results = sorted(zip(image_paths, predictions, scores), key=lambda x: x[2])
The system outputs comprehensive scoring reports showing request IDs, anomaly status, and exact numerical scores for every sample in the dataset.
The Results
Testing with 1,000 HTTP requests containing 10 different user agents achieved remarkable results:
Dataset Statistics:
- 100-sample dataset: ~20MB bitmaps, 1.1MB compressed
- 1K-sample dataset: 200MB bitmaps (1,000 files × 207KB each), 12MB compressed
- Compression ratio: ~16:1 for bitmaps due to repetitive visual patterns
Seemingly Near-Perfect Classification:
- Precision: 100% (no false positives)
- Recall: 100% (no false negatives)
- Clear separation: -0.011 points between worst anomaly and best normal
Detected Anomalies:
The system successfully identified all injected anomalies:
ANOMALIES (10 out of 1000) - Score Range: -0.8402 to -0.6189
============================================================
Request #975: -0.8402 ⚠️ ANOMALY (Ubuntu Chrome 95)
Request #989: -0.8273 ⚠️ ANOMALY (Googlebot crawler)
Request #61: -0.8153 ⚠️ ANOMALY (curl/7.68.0)
Request #502: -0.8069 ⚠️ ANOMALY (Python-urllib/3.8)
...
NORMAL REQUESTS (990 out of 1000) - Score Range: -0.6078 to -0.3769
===================================================================
Request #631: -0.6078 ✅ NORMAL
Request #334: -0.5953 ✅ NORMAL
...
Why This Works
The success stems from a fundamental insight: different user agents create visually distinct patterns when rendered as monospace text:
- Length variations: Mobile user agents are shorter than desktop ones
- Character patterns: Bots like "curl/7.68.0" have unique visual signatures
- Bracket positioning: Browser version numbers create specific visual rhythms
- Whitespace distribution: Different agents produce unique spacing patterns
The Isolation Forest algorithm excels at finding these visual outliers without needing labeled training data.
Conclusion
By transforming HTTP logs into bitmap images, we've shown that computer vision can achieve seemingly near-perfect anomaly detection on structured text data. This unconventional approach suggests that visual representations of security data might reveal patterns invisible to traditional analysis methods.
The implications extend beyond HTTP logs. Any structured text data—from system logs to network packets—could benefit from visual analysis. Sometimes the best way to find anomalies in your data is simply to look at it.
Special thanks to Ajax Davis and Lisa Watts for exploring this unconventional approach to security analysis with me.