Analysing Docker projects on Github with BigQuery
Maybe you know that the Github public archive can be analyzed with Google BigQuery. That’s 3Tb of data! This helped people run analysis on languages usage or framework popularity.
I wanted to produce similar results with projects using Docker. But what’s a project using Docker? For this article, I will consider a Docker project, a project that has a least one
So, why don’t we start by counting the number of
Count the Docker projects
The following query lists all files on a
master branch that contain
Dockerfile in their name. That’s not perfect but should be good enough for statistics.
SELECT count(*) FROM [bigquery-public-data:github_repos.files] WHERE path CONTAINS 'Dockerfile' AND ref = 'refs/heads/master'
The number is:
281,212. That’s quite a lot! About 1% of all 2.9M Github projects.
Compare all the things!
We could compare it to the number of:
A LOT of project seem to have a LOT of Makefiles! For example, shengshanbai/mini2440_buildroot contains
SELECT repo_name, count(*) count FROM ( SELECT repo_name FROM [bigquery-public-data:github_repos.files] WHERE path CONTAINS 'Makefile' AND ref = 'refs/heads/master' ) GROUP BY repo_name ORDER BY count DESC LIMIT 1
Which base images are most used?
Now, it would be interesting to know which base images are used most often. There’s two ways of doing that. We can count the occurrences of each distinct
FROM clause. Or we could group by image name, without the tag.
Let’s try both!
To make the next queries easier to write, I’ve created a view called
dockerfiles with this query. This view contains the
id of every Dockerfile we found:
SELECT id FROM [bigquery-public-data:github_repos.files] WHERE path CONTAINS 'Dockerfile' AND ref = 'refs/heads/master'
And another view called
dockerfiles_content that contains the actual content of all those Dockerfiles:
SELECT contents.content as dockerfile FROM [code-story-blog:github.dockerfiles] as dockerfiles JOIN [bigquery-public-data:github_repos.contents] as contents ON dockerfiles.id = contents.id
Now let’s extract the images names and counts:
SELECT RTRIM(LTRIM(SUBSTR(line, 5))) as image, count(*) as count FROM ( SELECT SPLIT(dockerfile, '\n') as line FROM [code-story-blog:github.dockerfiles_content] HAVING LEFT(line, 5) = 'FROM ' ) GROUP BY image ORDER BY count DESC LIMIT 10
Old Ubuntus are used a lot! The first
alpine image is in 30th position. That needs to be improved! (See Alpine Based Docker Images Make a Difference in Real World Apps)
Now, what if we ignore the tags:
Let’s define an
images view with all image names:
SELECT RTRIM(LTRIM(SUBSTR(line, 5))) as line_group.base_image FROM ( SELECT SPLIT(dockerfile, '\n') as line FROM [code-story-blog:github.dockerfiles_content] HAVING LEFT(line, 5) = 'FROM ' )
And count the occurrences of each image name:
SELECT image, count(*) as count FROM ( SELECT FIRST(SPLIT(line_group.base_image, ':')) as image FROM [code-story-blog:github.images] ) GROUP BY image ORDER BY count DESC LIMIT 15
Ubuntu still leads and it’s good to see that people also use a lot the base images created especially for each language: Go, Node, Python and Java base images make it into the top 10.
OK, that’s it for counting. One can get nice information by counting but can we do better? For example, could we compare Dockerfiles with the best practices?
One of those best practices says we should use
ADD or COPY?
I tried to count the number of
COPY commands. Clearly, the number of
COPY should be higher since
ADD should be used for only very specific cases. But since
ADD also does what
COPY does, people tend to overuse
That’s easy to check!
We already have a view called
dockerfiles_content that contains all the lines of all the Dockerfiles. Let’s extract the lines that start with
COPY and count the occurrences of both:
SELECT command, count(*) as count FROM ( SELECT first(split(line, ' ')) as command FROM ( SELECT SPLIT(dockerfile, '\n') as line FROM [code-story-blog:github.dockerfiles_content] HAVING LEFT(line, 5) = 'COPY ' || LEFT(line, 4) = 'ADD ' ) ) GROUP BY command
And here’s the result:
Clearly, too many
ADDs! We could analyze the file names passed to
ADD and check that they are valid urls or tarball files. That would give us the percentage of valid usages. I let that exercise to my beloved readers!
COPY or ADD after USER?
Sebastian, a colleague of mine, asked me to search for a well known bug pattern. Because of a bug / bad decision during implementation of
ADD, those directives don’t follow
USER directives. So files are not added with ownership of the
USER, resulting in people having to
RUN chown afterwards, which balloons the image size.
Le’s see what we can do…
SELECT count(*) FROM ( SELECT INSTR(dockerfile, '\nCOPY ') as copy, INSTR(dockerfile, '\nADD ') as add, INSTR(dockerfile, '\nUSER ') as user FROM [code-story-blog:github.dockerfiles_content] HAVING (user != 0) AND (copy != 0 OR add != 0) AND (user < copy OR user < add) )
4,254 out of
281,212 Dockerfiles (1.5%). Out of those
854 seem to run a
That’s it for today! I hope this will motivate you to either run similar analysis or to improve your Dockerfiles!