Random   •   Archives   •   RSS   •   About   •   Contact

Visual Anomaly Detection: Turning HTTP Requests into Bitmaps for Machine Learning

Full implementation details:

https://github.com/russellballestrini/logs-to-bitmap

We use reproducible computer science in this repo.

Overview

What if we could detect malicious HTTP requests by literally looking at them? In this experiment, we transform HTTP request logs into bitmap images and use computer vision techniques to identify anomalies. The results are striking: 100% detection accuracy using simple visual pattern recognition.

This project, developed in collaboration with Ajax Davis and Lisa Watts, explores an unconventional approach to cybersecurity: converting text-based security data into visual representations that machine learning models can analyze as images.

The computer vision implementation is based on Adrian Rosebrock's excellent PyImageSearch tutorial on anomaly detection with OpenCV and scikit-learn, which provides the foundational techniques for extracting visual features and training Isolation Forest models.

The Hypothesis

Traditional anomaly detection analyzes log files as text, parsing fields and applying statistical models. But what if we treated logs as visual data instead? Our hypothesis:

  1. Different HTTP request patterns create visually distinct bitmap signatures
  2. Computer vision models can identify these visual differences
  3. Small, just-in-time trained models can achieve high accuracy
  4. Visual analysis might reveal patterns invisible to traditional text analysis

The beauty of this approach is its simplicity: we're literally taking screenshots of logs and asking "does this look normal?"

Quick Start

Clone the repository and get started in minutes:

# Clone the repository
git clone https://github.com/russellballestrini/logs-to-bitmap
cd logs-to-bitmap

# Install dependencies
pip install -r requirements.txt

# Generate 1K sample dataset with anomalies (automated)
make crawl-1k

# Train the anomaly detection model
make train-model-1k

# View the results
cat anomaly_scores/all_1k_scores.txt

How It Works

The system operates in three key stages:

  1. Log Generation

    A Pyramid web server captures HTTP requests and logs each into a separate ascii file and bitmap file!

    log_data = {
        'Timestamp': datetime.datetime.now().isoformat(),
        'Request ID': request_id,
        'Endpoint': endpoint,
        'Client Address': request.client_addr,
        'User-Agent': user_agent,
        'Method': request.method,
        'URL': request.url,
        'Headers': headers
    }
    

Bitmap Conversion

Each log entry is rendered as a grayscale bitmap using structured field layout:

# Create grayscale bitmap from structured log data
img = Image.new('L', (width, height), color=255)  # White background
draw = ImageDraw.Draw(img)

# Draw each field with proper text wrapping
for line in formatted_lines:
    draw.text((10, y_position), line, font=font, fill=0)  # Black text
    y_position += line_height

img.save(bitmap_path, 'BMP')
  1. Feature Extraction

    Computer vision extracts visual features from each bitmap:

    • Color histograms (HSV color space)
    • Text density (ratio of non-white pixels)
    • Spatial distribution (horizontal/vertical projections)
    • Image dimensions and aspect ratio
  2. Anomaly Detection

    An Isolation Forest algorithm identifies outliers based on visual features:

    # Train isolation forest on visual features
    detector = IsolationForest(contamination=0.01, n_estimators=200)
    detector.fit(features_scaled)
    

Pyramid Web Server Configuration

The system uses a Pyramid web server with custom middleware to capture and process every HTTP request in real-time:

def request_counter_middleware(handler, registry):
    """Middleware to count requests and assign sequence numbers"""
    def middleware(request):
        global request_counter
        with counter_lock:
            request_counter += 1
            request.request_number = request_counter
        return handler(request)
    return middleware

Each incoming request triggers the log_request() function which:

  1. Extracts request metadata: timestamp, client IP, user agent, headers, method, URL
  2. Generates unique identifiers: sequential request numbers and UUID-based request IDs
  3. Creates structured log files: saves detailed ASCII logs to logs/ directory
  4. Renders bitmap images: converts log data to grayscale BMP files in bitmaps/ directory

The system uses Liberation Mono or DejaVu Sans Mono fonts to ensure consistent character spacing across all bitmap renders, making visual patterns more reliable for machine learning analysis.

Synthetic Data Generation

To test the anomaly detection system, we use a controlled crawler that generates HTTP traffic with known anomalies:

# 99% normal traffic uses consistent user agent
normal_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"

# 1% anomalous traffic uses diverse user agents
anomaly_user_agents = [
    "curl/7.68.0",                    # Command-line tool
    "Python-urllib/3.8",              # Python script
    "Googlebot/2.1",                  # Search crawler
    "Mozilla/5.0...Firefox/78.0",     # Different browser
    "MSIE 6.0",                       # Legacy browser
]

The crawler (crawler_1k.py) operates as follows:

  1. Generates 1,000 total requests with controlled timing (10-30ms delays)
  2. Randomly selects 10 requests (exactly 1%) to receive anomalous user agents
  3. Distributes traffic across two endpoints (66% to /, 33% to /hello)
  4. Reports progress every 100 requests with timing statistics
  5. Creates known ground truth by logging which requests received anomaly user agents

This approach provides an ideal test dataset where we know exactly which requests should be flagged as anomalous, enabling precise measurement of detection accuracy.

Model Training and Scoring

The visual anomaly detection training process operates in three phases:

Phase 1: Feature Extraction

Each bitmap undergoes computer vision analysis to extract a 136-dimensional feature vector:

# Extract visual features from bitmap
features = np.concatenate([
    color_histogram,      # 128 dimensions (8×4×4 HSV bins)
    [text_density],       # Text pixel ratio
    [horizontal_var],     # Line distribution variance
    [vertical_var],       # Character spacing variance
    [aspect_ratio],       # Image proportions
    [height, width]       # Absolute dimensions
])

Phase 2: Model Training

The Isolation Forest algorithm trains on the visual features without requiring labeled data:

# Configure for 1% contamination (10 anomalies in 1000 samples)
detector = IsolationForest(
    contamination=0.01,    # Expected anomaly rate
    n_estimators=200,      # Decision trees for ensemble
    random_state=42        # Reproducible results
)

# Train on scaled features
features_scaled = StandardScaler().fit_transform(features)
detector.fit(features_scaled)

Phase 3: Scoring and Reporting

The trained model generates anomaly scores for all samples, where lower (more negative) scores indicate higher anomaly likelihood:

# Generate anomaly scores for all samples
predictions = detector.predict(features_scaled)  # +1 = normal, -1 = anomaly
scores = detector.score_samples(features_scaled)  # Continuous anomaly scores

# Sort by anomaly score (most anomalous first)
results = sorted(zip(image_paths, predictions, scores), key=lambda x: x[2])

The system outputs comprehensive scoring reports showing request IDs, anomaly status, and exact numerical scores for every sample in the dataset.

The Results

Testing with 1,000 HTTP requests containing 10 different user agents achieved remarkable results:

Dataset Statistics:

  • 100-sample dataset: ~20MB bitmaps, 1.1MB compressed
  • 1K-sample dataset: 200MB bitmaps (1,000 files × 207KB each), 12MB compressed
  • Compression ratio: ~16:1 for bitmaps due to repetitive visual patterns

Seemingly Near-Perfect Classification:

  • Precision: 100% (no false positives)
  • Recall: 100% (no false negatives)
  • Clear separation: -0.011 points between worst anomaly and best normal

Detected Anomalies:

The system successfully identified all injected anomalies:

ANOMALIES (10 out of 1000) - Score Range: -0.8402 to -0.6189
============================================================
Request #975: -0.8402 ⚠️ ANOMALY (Ubuntu Chrome 95)
Request #989: -0.8273 ⚠️ ANOMALY (Googlebot crawler)
Request #61:  -0.8153 ⚠️ ANOMALY (curl/7.68.0)
Request #502: -0.8069 ⚠️ ANOMALY (Python-urllib/3.8)
...

NORMAL REQUESTS (990 out of 1000) - Score Range: -0.6078 to -0.3769
===================================================================
Request #631: -0.6078 ✅ NORMAL
Request #334: -0.5953 ✅ NORMAL
...

Why This Works

The success stems from a fundamental insight: different user agents create visually distinct patterns when rendered as monospace text:

  1. Length variations: Mobile user agents are shorter than desktop ones
  2. Character patterns: Bots like "curl/7.68.0" have unique visual signatures
  3. Bracket positioning: Browser version numbers create specific visual rhythms
  4. Whitespace distribution: Different agents produce unique spacing patterns

The Isolation Forest algorithm excels at finding these visual outliers without needing labeled training data.

Conclusion

By transforming HTTP logs into bitmap images, we've shown that computer vision can achieve seemingly near-perfect anomaly detection on structured text data. This unconventional approach suggests that visual representations of security data might reveal patterns invisible to traditional analysis methods.

The implications extend beyond HTTP logs. Any structured text data—from system logs to network packets—could benefit from visual analysis. Sometimes the best way to find anomalies in your data is simply to look at it.

Special thanks to Ajax Davis and Lisa Watts for exploring this unconventional approach to security analysis with me.




Want comments on your site?

Remarkbox — is a free SaaS comment service which embeds into your pages to keep the conversation in the same place as your content. It works everywhere, even static HTML sites like this one!

Remarks: Visual Anomaly Detection: Turning HTTP Requests into Bitmaps for Machine Learning

© Russell Ballestrini.