Efficiently querying networks in BigQuery
I recently worked on a project to import a set of network data into BigQuery. As part of this I needed to figure out how to make querying the data efficient. I thought it would be useful to write down how I made this work for future reference as there are several things to figure out.
First implementation
Let's say we have a data set containing country information about networks that we want to load into BigQuery:
network,country 1.1.0.0/16,CA 2.0.0.0/15,US [many more rows]
Let's say we load it into BigQuery like this:
CREATE TABLE dataset.network_country AS ( network string, country string );
If we have an IP address and we want to query the table for its country, how do we do that? BigQuery has built-in network functions. We could use them to do something like this:
SELECT country FROM dataset.network_country WHERE NET.IP_TRUNC( NET.IP_FROM_STRING('1.1.1.1'), CAST(REGEXP_EXTRACT(network, r'/(\d+)$') AS INT64) ) = NET.IP_FROM_STRING( REGEXP_EXTRACT(network, r'^([^/]+)') );
However this is inefficient. Beyond the overhead of the regexes, the main issue is we need to check every row.
Adding start and end IP columns
To improve on this, we could import the data with the start and end IP of
each network. BigQuery's NET.IPV4_TO_INT64() lets us work with IPv4 IPs
as integers. There is no equivalent function for IPv6, so we use bytes,
which NET.IP_FROM_STRING() gives us.
CREATE TABLE dataset.network_country_ipv4 AS ( network string, country string, start_int integer, end_int integer );
and
CREATE TABLE dataset.network_country_ipv6 AS ( network string, country string, start_int bytes, end_int bytes );
To add these columns, we can pre-process the CSV before loading into BigQuery.
Then we can query like this:
SELECT country FROM dataset.network_country_ipv4 WHERE NET.IPV4_TO_INT64(NET.IP_FROM_STRING('1.1.1.1')) BETWEEN start_int AND end_int;
and
SELECT country FROM dataset.network_country_ipv6 WHERE NET.IP_FROM_STRING('f00::beef') BETWEEN start_int AND end_int;
This looks good and you might expect this to be efficient. Unfortunately BigQuery still reads the entire table, so this is inefficient too.
Putting networks into buckets
We want to reduce the number of rows BigQuery needs to read. A good way to do this is by placing networks into buckets such that we only need to look into a single bucket.
We could choose to bucket networks into CIDR prefixes of size /16. This means that if we're looking for information about a particular IP then we would only need to look at rows with networks in that IP's /16 bucket. How would that look?
CREATE TABLE dataset.network_country_ipv4 AS ( network string, country string, start_int integer, end_int integer, network_bucket integer );
and
CREATE TABLE dataset.network_country_ipv6 AS ( network string, country string, start_int bytes, end_int bytes, network_bucket string );
For IPv4 the network bucket can be integer as before. However for IPv6 we
use string and store the hex representation of the network's bucket. We
do this because we need to cluster the table by network_bucket and
BigQuery does not support clustering on bytes.
How do we calculate network_bucket? Again we can do this in a
pre-processing step before loading the data into BigQuery. However there
are some things to remember when doing this pre-processing.
First, if you have a network that is larger than the bucket size, you need
to include that network in all appropriate buckets. For example, if your
bucket size is /16 and you have the network 2.0.0.0/15, you need to split
it up such that it appears twice: It needs to appear once for 2.0.0.0/16
and once for 2.1.0.0/16. The CSV goes from:
network,country 1.1.0.0/16,CA 2.0.0.0/15,US
to:
network,country,start_int,end_int,network_bucket 1.1.0.0/16,CA,<1.1.0.0 as int>,<1.1.255.255 as int>,<1.1.0.0 as integer> 2.0.0.0/15,US,<2.0.0.0 as int>,<2.1.255.255 as int>,<2.0.0.0 as integer> 2.0.0.0/15,US,<2.0.0.0 as int>,<2.1.255.255 as int>,<2.1.0.0 as integer>
Note I've chosen to duplicate all column values other than the bucket.
Second, if we have a network that is smaller than the bucket size, we need
to apply the prefix mask to it when setting the bucket column. For example,
if we have the network 3.4.5.0/24, we should not use 3.4.5.0/24 as the
bucket. Instead we should use 3.4.0.0/16.
As I mentioned, we'll also want to cluster the tables by network_bucket.
This allows BigQuery to only look at relevant parts of the table when
querying.
We can query these tables like so:
SELECT country FROM dataset.network_country_ipv4 WHERE network_bucket = NET.IPV4_TO_INT64( NET.IP_TRUNC( NET.IP_FROM_STRING('1.1.1.1'), 16 ) ) AND NET.IPV4_TO_INT64( NET.IP_FROM_STRING('1.1.1.1') ) BETWEEN start_int AND end_int;
and
SELECT country FROM dataset.network_country_ipv6 WHERE network_bucket = TO_HEX( NET.IP_TRUNC( NET.IP_FROM_STRING('f00::beef'), 16 ) ) AND NET.IP_FROM_STRING('f00::beef') BETWEEN start_int AND end_int;
As you can see, we need to check start_int and end_int as well as the
bucket because while we are limiting to rows in the bucket, it's possible
networks will be in the bucket which do not include our IP.
Joining a table of IPs can be written in a similar way:
SELECT country FROM dataset.my_ips_ipv4 i JOIN dataset.network_country_ipv4 c ON c.network_bucket = NET.IPV4_TO_INT64( NET.IP_TRUNC( NET.IP_FROM_STRING(i.ip), 16 ) ) AND NET.IPV4_TO_INT64( NET.IP_FROM_STRING(i.ip) ) BETWEEN c.start_int AND c.end_int;
A join will cause BigQuery to process all rows, but it will join more efficiently than if you didn't include the network bucket.
Making the IPv6 bucket more efficient
This step is an optional further optimization. The hex string implementation should be sufficient in many cases.
We can make IPv6 queries more efficient by storing the bucket as an integer rather than as a string. Since BigQuery charges based on the amount of data processed by queries, reducing the column size reduces the cost.
Since we're only using a prefix of the IPv6 address (in examples above, the first 16 bits), we can store the relevant part as an integer that can fit in BigQuery's integer type. This means a bucket value goes from 32 bytes to 8 bytes. (Note we could make the hex string 4 bytes if we're using 16 bits, but I believe the same could be said for the integer approach as well where we could get as small as 2 bytes. It depends how much we want to optimize).
To do this, instead of using the full bucket address as a hex string, take the first 60 bits and convert that to an integer.
Why 60 bits? This comes from using the first 15 of the 32 characters of the address hex string. We do this for simple conversions to integer in queries. We don't use the first 16 characters that would fit in a 64 bit integer as that would complicate the conversion due to negative numbers, and 60 bits is already more than we need.
When creating the bucket column for IPv6, calculate the bucket as before, which gives you an IPv6 address, and then turn that into an integer that will fit in BigQuery's integer type:
- Calculate the bucket IPv6 address
- Convert the address to binary
- Take the first 64 bits and create a 64 bit integer
- Right shift the integer 4 bits to get 60 bits
The table will now look like this:
CREATE TABLE dataset.network_country_ipv6 AS ( network string, country string, start_int bytes, end_int bytes, network_bucket integer );
We can query the table like this:
SELECT country FROM dataset.network_country_ipv6 WHERE network_bucket = CAST( CONCAT( '0x', SUBSTR( TO_HEX( NET.IP_TRUNC( NET.IP_FROM_STRING('f00::beef'), 16 ) ), 1, 15 ) ) AS INT64 ) AND NET.IP_FROM_STRING('f00::beef') BETWEEN start_int AND end_int;
Sample implementation
You can see an implementation of parts of this in this pull request to mmdbconvert.
Snowflake
We can use these solutions for Snowflake which has similar performance
issues if you do not bucket the networks. Snowflake's PARSE_IP() function
can parse IPv4 IPs/networks into numbers and IPv6 IPs/networks into hex
strings:
SELECT PARSE_IP('1.1.1.1', 'inet'):ipv4, PARSE_IP('1.1.0.0/16', 'inet'):ipv4_range_start, PARSE_IP('f00::beef', 'inet'):hex_ipv6, PARSE_IP('f00::/16', 'inet'):hex_ipv6_range_start;
You can store the bucket as a number for IPv4 and as a hex string for IPv6.
For the start and end IPs, you can use these types too. This is a small
difference from BigQuery where we store IPv6 IPs as bytes. This
difference is natural since BigQuery's network functions return bytes for
IPv6 whereas Snowflake's functions return hex strings.
You could also adapt the IPv6 integer bucket approach for Snowflake. I've not implemented that part there.
Postgres
Happily Postgres does not have these kinds of issues: It has native network types and allows building indexes on them. See inet_ops.