Norbert Preining氏によるコラム、【Analyzing Debian packages with Neo4j - Part 2】を公開

アクセリア株式会社の研究開発部社員であるNorbert Preining氏による、コラム連載を開始しました。

アクセリア株式会社 2018年02月06日

第10回:Analyzing Debian packages with Neo4j - Part 2 UDD and Graph DB Schema

「The Ultimate Debian Database UDD」
In the first part of this series of articles on analyzing Debian packages with Neo4j (リンク ») we gave a short introduction to Debian and the life time and structure of Debian packages.(first part (リンク ») )

The current second part first describes the Ultimate Debian Database UDD and how to map the information presented here from the UDD into a Graph Database by developing the database scheme, that is the set of nodes and relations, together with their attributes, from the inherent properties of Debian packages.

The next part will describe how to get the data from the UDD into Neo4j, give some sample queries, and discuss further work.

The Ultimate Debian Database UDD (リンク »)

 gathers a lot of data about various aspects of Debian in the same SQL database. It allows users to
 easily access and combine all these data. Data currently being imported include: Packages and
 Sources files, from Debian and Ubuntu, Bugs from the Debian BTS, Popularity contest, History of
 uploads, History of migrations to testing, Lintian, Orphaned packages, Carnivore, Debtags, Ubuntu
 bugs (from Launchpad), Packages in NEW queue, DDTP translations. Debian Wiki

Collection all these information and obviously having been grown over time, the database exhibits a highly de-normalized structure with ample duplication of the same information. As a consequence, reading the SQL code fetching data from the UDD and presenting them in a coherent interface tends to be highly convoluted.

This lets us to the project of putting (parts) of the UDD into a graph database, removing all the duplication on the way and representing the connections between the entities in a natural graph way.

「Developing the database schema」
Recall from the first column that there are source packages and binary packages in Debian, and that the same binary package can be built in different versions from different source packages. Thus we decided to have both source and binary packages as separate entities, that is nodes of the graph, and the two being connected via a binary relation builds.

Considering dependencies between Debian packages we recall the fact that there are versioned and unversioned dependencies. We thus decide to have again different entities for versioned source and binary packages, and unversioned source and binary packages.

The above considerations leads to the following sets of nodes and relations:

vsp -[:is_instance_of]-> sp
vbp -[:is_instance_of]-> bp
sp -[:builds]-> bp
vbp -[:next]-> vbp
vsp -[:next]-> vsp

where vsp stands for versioned source package, sp for (unversioned) source package, and analog for binary packages. The versioned variants carry besides the name attribute also a version attribute in the node.
The relations are is_instance_of between versioned and unversioned packages, builds between versioned source and versioned binary packages, and next that defines an order on the versions.

An example of a simple graph for the binary package luasseq which has was originally built from the source package luasseq but was then taken over into the TeX Live packages and built from a different source.

Next we want to register suites, that is associating which package has been included in which release of Debian. Thus we add a new node type suite and a new relation contains which connects suites and versioned binary packages vbp:
suite -[:contains]-> vbp

Nodes of type suite only contain one attribute name. We could add release dates etc, but refrained from it for now. Adding the suites to the above diagram we obtain the following:

Next we add maintainers. The new node type mnt has two attributes: name and email. Here it would be nice to add alternative email addresses as well as alternative spellings of the name, something that is quite common. We add a relation maintains to versioned source and binary packages only since, as we have seen, the maintainership can change over the history of a package:

mnt -[:maintains]-> vbp
mnt -[:maintains]-> vsp

This leads us to the following graph:

This concludes the first (easy) part with basic node types and relations. We now turn to the more complicated part to represent dependencies between packages in the graph.

「Representing dependencies」
For simple dependencies (versioned or unversioned, but no alternatives) we represent the dependency relation with two attributes reltype and relversion specifying the relation type (<<, <=, ==, >=, >>) and the version as string. For unversioned relations we use the reltype=none and relversion=1:

vbp -[:depends reltype: TYPE, relversion: VERS]-> bp

Adding all the dependencies to the above graph, we obtain the following graph:

Our last step is dealing with alternative dependencies. Recall from the first blog that a relation between two Debian packages can have alternative targets like in

Depends: musixtex (>= 1:0.98-1) | texlive-music

which means that either musixtex or texlive-music needs to be installed to satisfy this dependency.

We treat these kind of dependencies by introducing a new node type altdep and a new relation is_satisfied_by between altdep nodes and versioned or unversioned binary packages (vbp, bp).

The following slice of our graph shows binary packages pmx which has alternative dependencies as above:

(リンク »)

【アクセリア株式会社の研究開発部社員:Norbert Preining氏のコラム】
・第1回:今さら聞けない、機械学習/ディープラーニングとは!? (リンク »)
・第2回:最新の機械学習の代表、ニューラルネットワークとは (リンク »)
・第3回:手書き数字を認識する機械学習 (リンク »)
・第4回:畳み込みニューラルネットワーク (リンク »)
・第5回:機械学習における今後の展開 (リンク »)
・第9回:Analyzing Debian packages with Neo4j (リンク »)

・第6回:アクセリアが手がけるP2P (Peer-to-Peer) (リンク »)
・第7回:効率的な検索を可能にする、グラフデータベース (リンク »)
・第8回:アクセリアにおけるディープラーニング適用例 (リンク »)

関連情報へのリンク »