- •Contents
- •Introduction
- •1.1 Reverse Engineering
- •1.2 The eLib Program
- •1.3 Class Diagram
- •1.4 Object Diagram
- •1.5 Interaction Diagrams
- •1.6 State Diagrams
- •1.7 Organization of the Book
- •The Object Flow Graph
- •2.1 Abstract Language
- •2.1.1 Declarations
- •2.1.2 Statements
- •2.2 Object Flow Graph
- •2.3 Containers
- •2.4 Flow Propagation Algorithm
- •2.5 Object sensitivity
- •2.6 The eLib Program
- •2.7 Related Work
- •Class Diagram
- •3.1 Class Diagram Recovery
- •3.2 Declared vs. actual types
- •3.2.2 Visualization
- •3.3 Containers
- •3.4 The eLib Program
- •3.5 Related Work
- •3.5.1 Object identification in procedural code
- •Object Diagram
- •4.1 The Object Diagram
- •4.2 Object Diagram Recovery
- •4.3 Object Sensitivity
- •4.4 Dynamic Analysis
- •4.4.1 Discussion
- •4.5 The eLib Program
- •4.5.1 OFG Construction
- •4.5.2 Object Diagram Recovery
- •4.5.3 Discussion
- •4.5.4 Dynamic analysis
- •4.6 Related Work
- •Interaction Diagrams
- •5.1 Interaction Diagrams
- •5.2 Interaction Diagram Recovery
- •5.2.1 Incomplete Systems
- •5.2.2 Focusing
- •5.3 Dynamic Analysis
- •5.3.1 Discussion
- •5.4 The eLib Program
- •5.5 Related Work
- •State Diagrams
- •6.1 State Diagrams
- •6.2 Abstract Interpretation
- •6.3 State Diagram Recovery
- •6.4 The eLib Program
- •6.5 Related Work
- •Package Diagram
- •7.1 Package Diagram Recovery
- •7.2 Clustering
- •7.2.1 Feature Vectors
- •7.2.2 Modularity Optimization
- •7.3 Concept Analysis
- •7.4 The eLib Program
- •7.5 Related Work
- •Conclusions
- •8.1 Tool Architecture
- •8.1.1 Language Model
- •8.2 The eLib Program
- •8.2.1 Change Location
- •8.2.2 Impact of the Change
- •8.3 Perspectives
- •8.4 Related Work
- •8.4.1 Code Analysis at CERN
- •Index
140 7 Package Diagram
clusters have a feature vector that is coincident with that of the enclosed entity. Then, when a cluster contains feature vectors, its own feature vector is given by their sum: Thus, a cluster is associated to a feature vector with each coordinate given by the sum of the values of the same coordinate in all contained vectors.
Fig. 7.3. Clustering hierarchy (left), with two cut points selected, and associated package diagram (right).
When hierarchical clustering is applied for package diagram recovery, a partition of the classes can be obtained by cutting the hierarchy at an appropriate height (see Fig. 7.3). Successive cuts at different heights can be generated and assessed. Higher level cuts followed by lower level cuts indicate the cases where packages contain sub-packages. Lower level cuts eventually define packages that contain only classes.
With reference to Fig. 7.3, two cut points have been selected in the clustering hierarchy. The topmost cut defines a package containing two other packages, and a package containing 3 classes. The lower level cut in turn defines the content of the two packages that are merged at the higher level cut.
Problems that may occur when clustering is applied to software components, such as the classes, are the generation of a black hole, in which one cluster absorbs everything incrementally, while moving upward in the hierarchy, or, at the other extreme, the generation of a gas cloud, in which all singleton clusters tend to remain almost unchanged until the final grouping into a single final cluster [4]. Careful selection of the features to use, of the similarity measure between vectors and of the clustering algorithm to apply allow avoiding such problems.
7.2.2 Modularity Optimization
The approach to clustering based on modularity optimization [54] focuses on the relationships that hold among the entities to be clustered, rather than their features. In this setting, the goal of clustering is optimizing the level of
7.2 Clustering |
141 |
modularity, so that the resulting grouping of the entities concurrently minimizes coupling (i.e., the connections between components of distinct clusters) while maximizing cohesion (i.e., the connections between components in a same cluster).
When this approach is applied to package diagram recovery, the relationships that hold among the classes have to be taken into account. The alternative choices span across those represented in the class diagram:
Inheritance.
Association.
Aggregation.
Composition.
Dependency.
All or a subset of them can be used for clustering. As discussed below, it may be important to be able to give different relationships different weights.
Given a set of entities (classes, in case of package diagram recover) and of relationships (inter-class relationships), cohesion and coupling can be formally defined as follows:
Cohesion:
Coupling:
where |
is the number of relationships internal to cluster |
|
is the num- |
||||||
ber of relationships between clusters |
and |
and |
is the number of |
||||||
entities inside cluster |
If auto-loops cannot occur in the relationships being |
||||||||
considered, the denominator of |
becomes |
|
|
|
|
||||
|
and |
range between 0 and 1. |
is 1 when the entities in cluster |
|
|||||
are fully connected with each other |
with auto-loops, |
|
|
||||||
without auto-loops), while it is 0 when they are completely disconnected. |
|
||||||||
is equal to 1 when each entity of cluster |
is |
connected |
to each entity |
of |
|||||
cluster |
and |
vice-versa. |
is 0 when the entities in |
and |
have |
no |
connection with each other.
A joint measure of the modularization quality, MQ, can be obtained as the difference between the normalized total cohesion and the normalized total coupling:
where is the number of clusters. Since |
is between 0 and 1, the sum over |
||
all clusters will be between 0 and |
hence the normalizing denominator of |
||
the first term in MQ. As regards |
the sum |
of |
over all pairs of different |
clusters, the maximum will be |
|
i.e., equal to the number of such |
pairs. This number is used to normalize the second term in MQ, so as to make it range between 0 and 1.
142 7 Package Diagram
As a consequence of the normalization of the sums, MQ is bounded between -1 (no cohesion, maximum coupling) and 1 (no coupling, maximum cohesion). The latter situation is of course the most desirable one. Thus, the clustering algorithm based on the modularity metric MQ aims at determining the partition of the entities into clusters that maximizes MQ.
The problem of clustering has been turned into a combinatorial optimization problem. Consequently, the heuristics available from the field of combinatorial optimization can be used to approximate the optimal solution. The exact optimal solution is in general non computable, since the number of possible partitions for which MQ should be determined grows exponentially with the number of entities to be clustered.
Fig. 7.4. Hill-climbing clustering algorithm.
In the literature, several algorithms have been investigated to determine the clusters that maximize MQ [32, 54]. Fig. 7.4 shows a simple algorithm, based on the hill-climbing technique. It exploits the notion of neighbor partition. A partition NP is a neighbor of a partition P if it is the same as P except for a single element that belongs to different clusters in the two partitions. Initially, a random partition P is produced out of the set S of the entities to be clustered (line 2, Fig. 7.4). Then, an optimization loop is entered, which ends when the chosen strategy is unable to further improve the current partition of the entities. At line 4, a subset of all neighboring partitions, consisting of those with a higher MQ than P, is determined and assigned to BNP. If at least one better neighbor partition actually exists, P is reassigned (line 6). When more than one improvement directions are possible, one is chosen randomly. In the end, a (sub-)optimal partitioning of the entities is produced which can be interpreted as the package diagram being recovered from the inter-class relationships.
The main limitation of the algorithm in Fig. 7.4 is that its result is quite sensitive to the initial, random partition, from which optimization is started. This can be (partially) mitigated by executing it several times, starting from