Analysing Docker projects on Github with BigQuery

Maybe you know that the Github public archive can be analyzed with Google BigQuery. That’s 3Tb of data! This helped people run analysis on languages usage or framework popularity.

I wanted to produce similar results with projects using Docker. But what’s a project using Docker? For this article, I will consider a Docker project, a project that has a least one Dockerfile file.

So, why don’t we start by counting the number of Dockerfile files?

Count the Docker projects

The following query lists all files on a master branch that contain Dockerfile in their name. That’s not perfect but should be good enough for statistics.

SELECT count(*)
FROM [bigquery-public-data:github_repos.files]
WHERE path CONTAINS 'Dockerfile'
AND ref = 'refs/heads/master'

The number is: 281,212. That’s quite a lot! About 1% of all 2.9M Github projects.

Compare all the things!

We could compare it to the number of:

  • pom.xml: 840,328
  • travis.yml: 2,187,893
  • Makefile: 18,103,807
  • README.md: 10,774,468

A LOT of project seem to have a LOT of Makefiles! For example, shengshanbai/mini2440_buildroot contains 8628 Makefiles.

SELECT repo_name, count(*) count
FROM (
  SELECT repo_name
  FROM [bigquery-public-data:github_repos.files]
  WHERE path CONTAINS 'Makefile'
  AND ref = 'refs/heads/master'
)
GROUP BY repo_name
ORDER BY count DESC
LIMIT 1

Which base images are most used?

Now, it would be interesting to know which base images are used most often. There’s two ways of doing that. We can count the occurrences of each distinct FROM clause. Or we could group by image name, without the tag.

Let’s try both!

To make the next queries easier to write, I’ve created a view called dockerfiles with this query. This view contains the id of every Dockerfile we found:

SELECT id
FROM [bigquery-public-data:github_repos.files]
WHERE path CONTAINS 'Dockerfile'
AND ref = 'refs/heads/master'

And another view called dockerfiles_content that contains the actual content of all those Dockerfiles:

SELECT contents.content as dockerfile
FROM [code-story-blog:github.dockerfiles] as dockerfiles
JOIN [bigquery-public-data:github_repos.contents] as contents
ON dockerfiles.id = contents.id

Now let’s extract the images names and counts:

SELECT RTRIM(LTRIM(SUBSTR(line, 5))) as image, count(*) as count
FROM (
  SELECT SPLIT(dockerfile, '\n') as line
  FROM [code-story-blog:github.dockerfiles_content]
  HAVING LEFT(line, 5) = 'FROM '
)
GROUP BY image
ORDER BY count DESC
LIMIT 10
position image count
1 ubuntu:14.04 26746
2 debian:jessie 14031
3 scratch 11891
4 busybox 8064
5 ubuntu:trusty 7516
6 ubuntu 5597
7 golang:1.4 3395
8 ubuntu:latest 3383
9 debian:wheezy 2897
10 ubuntu:12.04 2545

Old Ubuntus are used a lot! The first alpine image is in 30th position. That needs to be improved! (See Alpine Based Docker Images Make a Difference in Real World Apps)

Now, what if we ignore the tags:

Let’s define an images view with all image names:

SELECT RTRIM(LTRIM(SUBSTR(line, 5))) as line_group.base_image
FROM (
  SELECT SPLIT(dockerfile, '\n') as line
  FROM [code-story-blog:github.dockerfiles_content]
  HAVING LEFT(line, 5) = 'FROM '
)

And count the occurrences of each image name:

SELECT image, count(*) as count
FROM (
  SELECT FIRST(SPLIT(line_group.base_image, ':')) as image
  FROM [code-story-blog:github.images]
)
GROUP BY image
ORDER BY count DESC
LIMIT 15
position image count
1 ubuntu 54453
2 debian 23079
3 golang 13127
4 scratch 11894
5 busybox 11059
6 centos 10087
7 fedora 7340
8 node 6764
9 python 6239
10 java 6102
11 alpine 4945
12 nginx 3074
13 redis 2970
14 php 2611
15 brimstone/ubuntu 2536

Ubuntu still leads and it’s good to see that people also use a lot the base images created especially for each language: Go, Node, Python and Java base images make it into the top 10.

OK, that’s it for counting. One can get nice information by counting but can we do better? For example, could we compare Dockerfiles with the best practices?

One of those best practices says we should use COPY over ADD.

ADD or COPY?

I tried to count the number of ADD and COPY commands. Clearly, the number of COPY should be higher since ADD should be used for only very specific cases. But since ADD also does what COPY does, people tend to overuse ADD.

That’s easy to check!

We already have a view called dockerfiles_content that contains all the lines of all the Dockerfiles. Let’s extract the lines that start with ADD or COPY and count the occurrences of both:

SELECT command, count(*) as count
FROM (
  SELECT first(split(line, ' ')) as command
  FROM (
    SELECT SPLIT(dockerfile, '\n') as line
    FROM [code-story-blog:github.dockerfiles_content]
    HAVING LEFT(line, 5) = 'COPY ' || LEFT(line, 4) = 'ADD '
  )
)
GROUP BY command

And here’s the result:

command count
ADD 101042
COPY 65066

Clearly, too many ADDs! We could analyze the file names passed to ADD and check that they are valid urls or tarball files. That would give us the percentage of valid usages. I let that exercise to my beloved readers!

COPY or ADD after USER?

Sebastian, a colleague of mine, asked me to search for a well known bug pattern. Because of a bug / bad decision during implementation of COPY and ADD, those directives don’t follow USER directives. So files are not added with ownership of the USER, resulting in people having to RUN chown afterwards, which balloons the image size.

Le’s see what we can do…

SELECT count(*)
FROM (
  SELECT INSTR(dockerfile, '\nCOPY ') as copy, INSTR(dockerfile, '\nADD ') as add, INSTR(dockerfile, '\nUSER ') as user
  FROM [code-story-blog:github.dockerfiles_content]
  HAVING (user != 0) AND (copy != 0 OR add != 0) AND (user < copy OR user < add)
)

That’s 4,254 out of 281,212 Dockerfiles (1.5%). Out of those 4,254, 854 seem to run a chmod afterwards.

That’s it for today! I hope this will motivate you to either run similar analysis or to improve your Dockerfiles!