At HUMAN, we process a lot of data that we collect from websites' users, to classify and analyze the behavior of the users and determine if they are legitimate or not. A key component of any such analysis involves processing the user's IP address. Knowing more information about the request source IP can have many benefits, from geographic location, service provider, verifying if the IP appeared in bad reputation lists we maintain, and more. As the data we needed to categorize grew and the number of requests went up, we started looking into a better solution than simply iterating over each IP and matching it against a list of networks. We were already familiar with MaxMind as a solution for GeoLocation, when we implemented our own auto-update mechanism and put the legwork into researching the most suitable library for faster lookup.
Some of the benefits of the MaxMind DB file format are:
- The file format holds the same structure used for binary lookup, usually it means that the reader library will map the file into memory and use it directly to look up the IP information. This makes loading and searching the database very fast.
- Multiple reader libraries that read from this file format (different programming languages and multiple implementations)
- Has spec MaxMind DB File Format Specification
This solution allowed us to process a large amount of traffic with a fast lookup and a small memory footprint. Internally we use multiple databases and a small cache layer to eliminate lookups. We use it to store, distribute and efficiently manage IP databases in our network, from IP information that we extract, collect or that our algorithms generate based on traffic we analyze.
We found this process to be extremely efficient, and as it is fairly simple to demonstrate the use of it, we wanted to share the steps we did, by provide a detailed example. In the following example we will take the Amazon AWS IP Address Ranges and:
- Build an application that will collect the data
- Build a database from the data
- Use the database for enrichment purposes
Grabbing the raw IP information
The following code (collect.go) will collect IP ranges information from Amazon and build a CSV with a service and region associated for each range:
package main
import (
"encoding/csv"
"encoding/json"
"io/ioutil"
"net/http"
"os"
)
func main() {
// request amazon ip ranges data
res, _ := http.Get("https://ip-ranges.amazonaws.com/ip-ranges.json")
defer res.Body.Close()
body, _ := ioutil.ReadAll(res.Body)
// parse json information
var aws struct {
SyncToken string `json:"syncToken"`
CreateDate string `json:"createDate"`
Prefixes []struct {
IPPrefix string `json:"ip_prefix"`
Region string `json:"region"`
Service string `json:"service"`
} `json:"prefixes"`
}
json.Unmarshal(body, &aws)
// ouput as csv
writer := csv.NewWriter(os.Stdout)
defer writer.Flush()
writer.Write([]string{"service", "region", "network"})
for _, p := range aws.Prefixes {
writer.Write([]string{p.Service, p.Region, p.IPPrefix})
}
}
We will also provide a short Dockerfile to use the code above:
FROM golang:1-alpine
RUN mkdir /app
WORKDIR /app
COPY collect.go ./
RUN go build
ENTRYPOINT ["/app/app"]
Creating IP DB for fast lookup
Using the collected information, the following code will read the CSV and build a MaxMindDB file populated with the IP classification information:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use Net::Works::Network;
use MaxMind::DB::Writer::Tree;
my %types = (
service => 'utf8_string',
region => 'utf8_string',
);
my $tree = MaxMind::DB::Writer::Tree->new(
database_type => 'Feed-IP-Data',
description => { en => 'Amazon IP data' },
ip_version => 6,
map_key_type_callback => sub { $types{ $_[0] } },
record_size => 24,
);
my $file = $ARGV[0] or die "Need to get CSV file on the command line\n";
print "==> ", $file, "\n";
open(my $fh, "<", $file) or die "$file: $!";
my $csv = Text::CSV_XS->new();
$csv->column_names($csv->getline($fh));
while (my $row = $csv->getline($fh)) {
my $network = Net::Works::Network->new_from_string( string => $row->[2] );
my $metadata = { service => $row->[0], region => $row->[1] };
$tree->insert_network($network, $metadata);
}
close $fh;
my $filename = $ARGV[1] or die "Need to get mmdb file on the command line\n";
open my $ofh, '>:raw', $filename;
$tree->write_tree( $ofh );
close $ofh;
print "$filename created\n";
And again, a Dockerfile to run the build database script
FROM perl:5
RUN cpanm --notest --skip-satisfied MaxMind::DB::Writer Text::CSV_XS
RUN mkdir /out /app
WORKDIR /app
COPY build.pl ./
VOLUME "/out"
ENTRYPOINT ["perl", "build.pl"]
IP classification as middleware
In the last step, here is an example of a small http server that will use the database to enrich the requests by placing the AWS service and region in dedicated HTTP headers
package main
import (
"fmt"
"log"
"net"
"net/http"
maxminddb "github.com/oschwald/maxminddb-golang"
)
// ClassificationRecord IP classification record
type ClassificationRecord struct {
Service string `maxminddb:"service"`
Region string `maxminddb:"region"`
}
/// extracting the real user's IP address, assuming it is on a dedicated HTTP header: X-REAL-IP
func getRealIP(r *http.Request) string {
rip := r.Header.Get("X-REAL-IP")
if rip == "" {
rip = r.RemoteAddr
}
return rip
}
// enrichClassification enrich request with classification information on headers
func enrichClassification(reader *maxminddb.Reader, next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// log request real ip
rip := getRealIP(r)
log.Printf("Request from %s", rip)
// parse ip and enrich classification information
addr := net.ParseIP(rip)
if addr != nil {
var record ClassificationRecord
_ = reader.Lookup(addr, &record)
if record.Service != "" {
r.Header.Set("X-IP-SERVICE", record.Service)
}
if record.Region != "" {
r.Header.Set("X-IP-REGION", record.Region)
}
}
next.ServeHTTP(w, r)
}
}
func index(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "What would life be if we had no courage to attempt anything?\n")
// use enriched information if found on request
if service := r.Header.Get("X-IP-SERVICE"); service != "" {
fmt.Fprintf(w, "Known service %s\n", service)
}
if region := r.Header.Get("X-IP-REGION"); region != "" {
fmt.Fprintf(w, "Known region %s\n", region)
}
}
func main() {
// load mmdb with classifications
mmdb, err := maxminddb.Open("feed.mmdb")
if err != nil {
log.Fatalln(err)
}
defer mmdb.Close()
// register routes and serve requests
http.Handle("/", enrichClassification(mmdb, index))
log.Fatalln(http.ListenAndServe(":8080", nil))
}
Dockerfile to run the code above. For simplicity we embed the 'feed.mmdb' database file - but we can use it from a shared volume or download/update it using the server itself.
FROM golang:1-alpine
RUN apk add --update git
RUN mkdir /app
WORKDIR /app
COPY use.go feed.mmdb ./
RUN go get -d . && go build
EXPOSE 8080
CMD ["/app/app"]
Now we have a fast and simple solution that you can extend upon and use in order to classify your traffic based on the client IP.