- over 10 nodes in a network throws Exception when EXACT solve method / with APPROX it works but I don't whats the quality of the results.
- GetThrouput() / GetResidenceTime() / GetUtilization() methods have a problem when they are about o return results for a ISRV node (delay cetre)... totally unexpected results are shown instead of 100 and 0 ... I really don't know why.
Lack of support for MSQ nodes:
I have implemented multiserver queues in the PDQ C-language only library, and it is now in beta test.
The goal is to allow you to create OPEN-circuit models (initially) by defining any sequence of M/M/m queues. Instead of introducing a completely new procedure, like PDQ_CreateMultiNode (or PDQ::CreateMultiNode described in Chapter 6 of the Perl::PDQ book), I have come up with a simpler idea. You will call PDQ_CreateNode() in the usual way but, by asserting a new MSQ flag to tell PDQ the node is multi-server queue, you then specify the integer number of servers (m) for that node. For example, an m = 10 multiserver PDQ node named "mserver" would be instantiated using the standard CreateNode procedure with the following arguments:
PDQ_CreateNode("mserver", 10, MSQ).
The PDQ generic report has also been altered to include the new MSQ node.
More details will be forthcoming as the implementation unfolds. Stay tuned.
Monday, 27 August 2007
Thursday, 23 August 2007
PageRank Algorithm
PageRank(DirectedGraph graph, double bias, String edgeWeightKeyName)
Specialized constructor that allows the user to specify an edge key if edges already have user-defined weights assigned to them.
This algorithm measures the importance of a node in terms of the fraction of time spent at that node relative to all other nodes. This fraction is measured by first transforming the graph into a first-order Markov chain where the transition probability of going from node u to node v is equal to (1-alpha)*[1/outdegree(u)] + alpha*(1/|V|) where |V| is the # of vertices in the graph and alpha is a parameter typically set to be between 0.1 and 0.2 (according to the authors). If u has no out-edges in the original graph then 0 is used instead of 1/outdegree(v). Once the markov chain is created, the stationary probability of being at each node (state) is computed using an iterative update method that is guaranteed to converge if the markov chain is ergodic.
PageRank:
PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E).
PageRank was developed at Stanford University by Larry Page (hence the name Page-Rank[1]) and later Sergey Brin as part of a research project about a new kind of search engine. The project started in 1995 and led to a functional prototype, named Google, in 1998. Shortly after, Page and Brin founded Google Inc., the company behind the Google search engine. While just one of many factors which determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools.[2]
Simplified PageRank algorithm
Assume a small universe of four web pages: A, B, C and D. The initial approximation of PageRank would be evenly divided between these four documents. Hence, each document would begin with an estimated PageRank of 0.25.
If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank PR( ) in this simplistic system would thus gather to A because all links would be pointing to A.
PR(A)= PR(B) + PR(C) + PR(D).\,
But then suppose page B also has a link to page C, and page D has links to all three pages. The value of the link-votes is divided among all the outbound links on a page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of D's PageRank is counted for A's PageRank (approximately 0.083).
PR(A)= \frac{PR(B)}{2}+ \frac{PR(C)}{1}+ \frac{PR(D)}{3}.\,
In other words, the PageRank conferred by an outbound link L( ) is equal to the document's own PageRank score divided by the normalized number of outbound links (it is assumed that links to specific URLs only count once per document).
PR(A)= \frac{PR(B)}{L(B)}+ \frac{PR(C)}{L(C)}+ \frac{PR(D)}{L(D)}. \,
In the general case, the PageRank value for any page u can be expressed as:
PR(u) = \sum_{v \in B_u} \frac{PR(v)}{L(v)},
i.e. the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
Reference
[] "The Anatomy of a Large-Scale Hypertextual Web Search Engine by L. Page and S. Brin, 1999"
Specialized constructor that allows the user to specify an edge key if edges already have user-defined weights assigned to them.
This algorithm measures the importance of a node in terms of the fraction of time spent at that node relative to all other nodes. This fraction is measured by first transforming the graph into a first-order Markov chain where the transition probability of going from node u to node v is equal to (1-alpha)*[1/outdegree(u)] + alpha*(1/|V|) where |V| is the # of vertices in the graph and alpha is a parameter typically set to be between 0.1 and 0.2 (according to the authors). If u has no out-edges in the original graph then 0 is used instead of 1/outdegree(v). Once the markov chain is created, the stationary probability of being at each node (state) is computed using an iterative update method that is guaranteed to converge if the markov chain is ergodic.
PageRank:
PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E).
PageRank was developed at Stanford University by Larry Page (hence the name Page-Rank[1]) and later Sergey Brin as part of a research project about a new kind of search engine. The project started in 1995 and led to a functional prototype, named Google, in 1998. Shortly after, Page and Brin founded Google Inc., the company behind the Google search engine. While just one of many factors which determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools.[2]
Simplified PageRank algorithm
Assume a small universe of four web pages: A, B, C and D. The initial approximation of PageRank would be evenly divided between these four documents. Hence, each document would begin with an estimated PageRank of 0.25.
If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank PR( ) in this simplistic system would thus gather to A because all links would be pointing to A.
PR(A)= PR(B) + PR(C) + PR(D).\,
But then suppose page B also has a link to page C, and page D has links to all three pages. The value of the link-votes is divided among all the outbound links on a page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of D's PageRank is counted for A's PageRank (approximately 0.083).
PR(A)= \frac{PR(B)}{2}+ \frac{PR(C)}{1}+ \frac{PR(D)}{3}.\,
In other words, the PageRank conferred by an outbound link L( ) is equal to the document's own PageRank score divided by the normalized number of outbound links (it is assumed that links to specific URLs only count once per document).
PR(A)= \frac{PR(B)}{L(B)}+ \frac{PR(C)}{L(C)}+ \frac{PR(D)}{L(D)}. \,
In the general case, the PageRank value for any page u can be expressed as:
PR(u) = \sum_{v \in B_u} \frac{PR(v)}{L(v)},
i.e. the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
Reference
[] "The Anatomy of a Large-Scale Hypertextual Web Search Engine by L. Page and S. Brin, 1999"
Wednesday, 22 August 2007
The Evolution of Queueing Networks for Computer System Analysis
The interaction between the system workload and the system resources determine the behaviour of a system in general. During the early period of queueing analysis the system was treated as a single queue which was serving an infinite population of tasks or jobs. This kind of model, which is based on the classical queueing theory, gives no clues to the relation between the external behaviour and the internal structure of the computing system. Additionally, this model has the limitation of the infinite population. Thus it can not represent the common situation of existing only a finite population of users.
Fig 1.0
In order to overcome this limitation, another model was created, which introduces a number of N terminals to the system, one for each user. In this model the number of jobs is N, and each of these jobs can be in one of the following states:
- “thinking” at the terminal,
- waiting in the queue of the central system,
- or being served by the central system.
Fig 1.1
However, this model, which is known as the “machine repairman” model, fails to represent the complexity of a typical computing system.
A significant step for the evolution of queueing networks occurred with Jackson's model. Jackson has developed a technique for solving a special category of queueing networks. These networks of queues were only involving a single class of jobs. There was also a distinction between open and closed queueing networks. In an open queueing network the jobs were only permitted to enter or leave the network, while in a closed queueing network the jobs were remaining inside. In Jackson's approach each node/queue contains one or more parallel servers, while the service time distributions, for each of these nodes, are assumed to be exponential.
A state of a network which contains M nodes/stations can be expressed a vector:
and expresses the number of jobs at node/station i.
The objective of analysing a Jackson's queueing network model is to find a closed form expression for the equilibrium probability of finding the network in any given state , which is:
Knowing the probabilities of this equilibrium state, several performance variables for the evaluation of the system can be derived, such as the utilization and the average queue length of each node and the throughput of the workload of the network.
If is the marginal probability of finding n jobs at the node then:
The probability of a node to be busy is called utilization it is computed as:
Jackson also proved something that it is termed as a “product form” which defines that, in an open network, the equilibrium probability of state , factors into the product of each of the marginal distributions:
Later, and without knowing about the earlier work of Jackson, two scientists, Gordon and Newell, obtained the equilibrium state probabilities for closed queueing networks. Again the service time distributions for each node was assumed to be exponential. The closed network they studied had a finite number of jobs, N and a finite number of nodes, M. Additionally, each node, i, may have a finite numberof parallel identical servers which have exponentially distributed service times as well. Independently of the state of the network, one job that has just serviced from the node i will be transmitted to the node j with probability .
A state of a network which contains M nodes and N jobs can be expressed as a vector again:
Now, because the network is closed, the sum of all the , must be N.
Gordon and Newell proved what exactly Jackson has proved for the open queueing networks some time ago, but applied to closed queueing networks this time. The equilibrium probability of state in a closed queueing network also satisfies a “product form”:
where G(N) is the normalisation constant which has been chosen to make all the feasible state probabilities sum to one:
where depends on the characteristics of the node:
, if node i is load dependent
OR , if node i is load independent.
In order to evaluate the normalisation constant G(N), a large number of operations is required. Thus, though the above product form solutions are easily expressed, the evaluation of the state probabilities can be computationally expensive. Additionally, the accuracy of the results is in doubt because there is a large number of finite precision arithmetic operations involved.
An efficient computational algorithm for evaluating the normalisation constant was developed by Buzen in the early seventies. Moreover, he studied ways of obtaining the performance measures of a queueing network, like utilisation, throughput and queue length, creating algorithms with simple functions that involve the normalisation constant and the model parameters. In that way more complex systems could be effectively studied.
Continuing his research, Buzen, tried to construct models of queueing networks, that would represent accurately the flow of the several jobs through a computer system. The result of his research was the construction of the “Central Server Model” of a multiprogramming system. In this model the nodes represent the processors of the computer system, while the several jobs/programs move in the direction of the arrows. A Job in this model can visit one node/processor at a time and when one job arrives to a node/processor which is busy, then a queue is formed. Using this type of queueing network models, bottlenecks of a system could be traced easier.
Fig 1.2.
There are two principal limitations to the models that Jackson and Gordon and Newell have studied. In these queueing networks, there is only one class of jobs, and all service time distributions are exponential. However, in a real system these principals are not generally valid, because time distributions are seldom exponential, while each job has its own service parameters for each node.
There are several queueing models that have been proposed in order to represent more accurately real computer systems. The truth is that all of these models expand and actually constitute special cases of the model developed by Baskett, Chandy, Muntz and Palacios (BCMP) in 1975.
The model described by Baskett, Chandy, Muntz and Palacios, in its general form, does not contain the two principal limitations of the previous models. Jobs are divided into R distinct classes, while each class may have different parameters such as a unique set of transition frequencies which has the form:
This expression describes the fraction of departures from node i in class r that go next to node j in class s. The above transition frequency expresses also an additional property of the jobs, which is that they can change class while they are moving through the system. A state of this kind of queueing network represents a distribution of jobs over classes and nodes.
A state of a network which contains M nodes can be expressed as a vector:
, where each component of is also a vector:
.
These vectors express the number of class r jobs at node .
These kind of queueing networks, that have been described by Baskett, Chandy, Muntz and Palacios in 1975, are also called BCMP networks (or models), and they constitute, until now, the most commonly used types of models for system performance evaluation. They can describe accurately a large variety of computer systems, much wider than the restrictive and simple models, introduced by Jackson and Gordon and Newell, are capable to. In general, a BCMP model is capable to describe an open, closed or mixed queueing network. A mixed system is closed for a specific set of classes of jobs and open for the other classes of jobs. However, the complexity of these systems, makes the calculation of their performance measures much more expensive in computing resources.
In a BCMP model each server can have one of the following queue disciplines (service policies):
- FCFS (First Come First Served): The job that arrives earlier in the queue is always served earlier than the one that arrives later, independently to the class that this job belongs to.
- PSHR (Processor Sharing): A single server chooses, using the round robin method, and serves one of the jobs that are waiting in its queue, for a quantum of time.
- ISRV (Infinite Server): An Infinite Server or Delay Centre is considered to be a node in which the number of servers it contains is always at least the maximum number of jobs coming at the node. In this way there is never a job waiting to be served in the queue.
- LCFS (Last Come First Served): The most recent job that arrives to the node, is the one that is served first.
In order to solve queueing network models it is essential to make some assumptions about the properties that influence the transition of the model, from one state to another. Buzen and Denning have studied these necessary assumptions that need to be made in order to simplify the procedure of balancing the flow into and out of system states. In order to simplify this procedure, it is assumed that there are no multiple job arrivals or departures in the system. This assumption, known as “one step transitions”, implies that the rate of job flow between the nodes is the only parameter that determines the rate a system enters or leaves a state. A second assumption declares that the queue length of the specific node is only a parameter that influences the flows out of a node, and not the queue lengths of the other nodes of the system. This assumption is known as “homogeneity”. []
References
[] Steven C. Bruell and Gianfranco Balbo, “Computational Algorithms for Closed Queueing Networks”, Elsevier North Holland, Inc. 1980.
Fig 1.0
In order to overcome this limitation, another model was created, which introduces a number of N terminals to the system, one for each user. In this model the number of jobs is N, and each of these jobs can be in one of the following states:
- “thinking” at the terminal,
- waiting in the queue of the central system,
- or being served by the central system.
Fig 1.1
However, this model, which is known as the “machine repairman” model, fails to represent the complexity of a typical computing system.
A significant step for the evolution of queueing networks occurred with Jackson's model. Jackson has developed a technique for solving a special category of queueing networks. These networks of queues were only involving a single class of jobs. There was also a distinction between open and closed queueing networks. In an open queueing network the jobs were only permitted to enter or leave the network, while in a closed queueing network the jobs were remaining inside. In Jackson's approach each node/queue contains one or more parallel servers, while the service time distributions, for each of these nodes, are assumed to be exponential.
A state of a network which contains M nodes/stations can be expressed a vector:
and expresses the number of jobs at node/station i.
The objective of analysing a Jackson's queueing network model is to find a closed form expression for the equilibrium probability of finding the network in any given state , which is:
Knowing the probabilities of this equilibrium state, several performance variables for the evaluation of the system can be derived, such as the utilization and the average queue length of each node and the throughput of the workload of the network.
If is the marginal probability of finding n jobs at the node then:
The probability of a node to be busy is called utilization it is computed as:
Jackson also proved something that it is termed as a “product form” which defines that, in an open network, the equilibrium probability of state , factors into the product of each of the marginal distributions:
Later, and without knowing about the earlier work of Jackson, two scientists, Gordon and Newell, obtained the equilibrium state probabilities for closed queueing networks. Again the service time distributions for each node was assumed to be exponential. The closed network they studied had a finite number of jobs, N and a finite number of nodes, M. Additionally, each node, i, may have a finite numberof parallel identical servers which have exponentially distributed service times as well. Independently of the state of the network, one job that has just serviced from the node i will be transmitted to the node j with probability .
A state of a network which contains M nodes and N jobs can be expressed as a vector again:
Now, because the network is closed, the sum of all the , must be N.
Gordon and Newell proved what exactly Jackson has proved for the open queueing networks some time ago, but applied to closed queueing networks this time. The equilibrium probability of state in a closed queueing network also satisfies a “product form”:
where G(N) is the normalisation constant which has been chosen to make all the feasible state probabilities sum to one:
where depends on the characteristics of the node:
, if node i is load dependent
OR , if node i is load independent.
In order to evaluate the normalisation constant G(N), a large number of operations is required. Thus, though the above product form solutions are easily expressed, the evaluation of the state probabilities can be computationally expensive. Additionally, the accuracy of the results is in doubt because there is a large number of finite precision arithmetic operations involved.
An efficient computational algorithm for evaluating the normalisation constant was developed by Buzen in the early seventies. Moreover, he studied ways of obtaining the performance measures of a queueing network, like utilisation, throughput and queue length, creating algorithms with simple functions that involve the normalisation constant and the model parameters. In that way more complex systems could be effectively studied.
Continuing his research, Buzen, tried to construct models of queueing networks, that would represent accurately the flow of the several jobs through a computer system. The result of his research was the construction of the “Central Server Model” of a multiprogramming system. In this model the nodes represent the processors of the computer system, while the several jobs/programs move in the direction of the arrows. A Job in this model can visit one node/processor at a time and when one job arrives to a node/processor which is busy, then a queue is formed. Using this type of queueing network models, bottlenecks of a system could be traced easier.
Fig 1.2.
There are two principal limitations to the models that Jackson and Gordon and Newell have studied. In these queueing networks, there is only one class of jobs, and all service time distributions are exponential. However, in a real system these principals are not generally valid, because time distributions are seldom exponential, while each job has its own service parameters for each node.
There are several queueing models that have been proposed in order to represent more accurately real computer systems. The truth is that all of these models expand and actually constitute special cases of the model developed by Baskett, Chandy, Muntz and Palacios (BCMP) in 1975.
The model described by Baskett, Chandy, Muntz and Palacios, in its general form, does not contain the two principal limitations of the previous models. Jobs are divided into R distinct classes, while each class may have different parameters such as a unique set of transition frequencies which has the form:
This expression describes the fraction of departures from node i in class r that go next to node j in class s. The above transition frequency expresses also an additional property of the jobs, which is that they can change class while they are moving through the system. A state of this kind of queueing network represents a distribution of jobs over classes and nodes.
A state of a network which contains M nodes can be expressed as a vector:
, where each component of is also a vector:
.
These vectors express the number of class r jobs at node .
These kind of queueing networks, that have been described by Baskett, Chandy, Muntz and Palacios in 1975, are also called BCMP networks (or models), and they constitute, until now, the most commonly used types of models for system performance evaluation. They can describe accurately a large variety of computer systems, much wider than the restrictive and simple models, introduced by Jackson and Gordon and Newell, are capable to. In general, a BCMP model is capable to describe an open, closed or mixed queueing network. A mixed system is closed for a specific set of classes of jobs and open for the other classes of jobs. However, the complexity of these systems, makes the calculation of their performance measures much more expensive in computing resources.
In a BCMP model each server can have one of the following queue disciplines (service policies):
- FCFS (First Come First Served): The job that arrives earlier in the queue is always served earlier than the one that arrives later, independently to the class that this job belongs to.
- PSHR (Processor Sharing): A single server chooses, using the round robin method, and serves one of the jobs that are waiting in its queue, for a quantum of time.
- ISRV (Infinite Server): An Infinite Server or Delay Centre is considered to be a node in which the number of servers it contains is always at least the maximum number of jobs coming at the node. In this way there is never a job waiting to be served in the queue.
- LCFS (Last Come First Served): The most recent job that arrives to the node, is the one that is served first.
In order to solve queueing network models it is essential to make some assumptions about the properties that influence the transition of the model, from one state to another. Buzen and Denning have studied these necessary assumptions that need to be made in order to simplify the procedure of balancing the flow into and out of system states. In order to simplify this procedure, it is assumed that there are no multiple job arrivals or departures in the system. This assumption, known as “one step transitions”, implies that the rate of job flow between the nodes is the only parameter that determines the rate a system enters or leaves a state. A second assumption declares that the queue length of the specific node is only a parameter that influences the flows out of a node, and not the queue lengths of the other nodes of the system. This assumption is known as “homogeneity”. []
References
[] Steven C. Bruell and Gianfranco Balbo, “Computational Algorithms for Closed Queueing Networks”, Elsevier North Holland, Inc. 1980.
Saturday, 18 August 2007
Jung and Visualization
The Java Universal Network/Graph Framework (JUNG) is a freely provided, under the BSD open-source licence, software library that offers a common and extendible language for the manipulation, analysis, and visualization of graphs and networks. It has been developed in Java programming language so it is portable and extendible. JUNG is not itself a stand alone tool, but it constitutes an object oriented framework which can be used to support the construction of applications and tools for manipulating data that can be represented as a graph or network. []
The architecture of this library supports a variety of representations of entities and their relations. It has been designed to support directed and undirected graphs, graphs with parallel edges, multi-modal graphs, and hypergraphs. Moreover, it provides a built in mechanism for annotating graphs, entities (Vertices) and relations (Edges) with data structures (metadata). It supports complex filtering mechanisms, which extract subsets of a network, and dynamic graphs. Additionally, the JUNG software includes implementations of a number of algorithms. These algorithms are used in exploratory data analysis, social network analysis, graph theory and machine learning. They can be used for clustering, optimization, statistical analysis, decomposition, random graph generation and calculation of network distances, flows, and ranking measures.
The developers of JUNG, in order to reduce the duplication of effort required for implementing certain functions that their software needed to perform, have used other, also platform-independent, Java libraries. Jung uses the following Java libraries:
Commons Collections (Apache Jakarta Project (2004)) library which enhances the basic Java API for collections of objects.
Colt (CERN (2004)) is a set of libraries for matrix algebra functions and generally for high-performance scientific and technical computing.
Xerces (Apache XML Project (2004)) is the common library, used for implementing the GraphML I/O capabilities, for XML parsing.
Building JUNG-based applications will require all of the above, plus the Java SDK (version 1.3 or later).
Except from the powerful JUNG API, there exist a number of other packages and tools for manipulating and visualizing network data such as the: Pajek (Batagelj and Mrva (2004)), the UCINET (Borgatti, Everett, and Freeman (2004)), the R (R Development Core Team (2004)) with sna (Butts (2004)), the NetMiner (Cyram Company, Ltd. (2004)), the MultiNet (Richards and Seary (2004)), the GFC (IBM Corporation (1999), the StOCNET (Boer, Huisman, Snijders, and Zeggelink (2003)), the Info Vis Cyberinfrastructure (Penumarthy, Mane, and Borner (2004)), the Info Vis (Fekete (2004)), the Visone (Brandes and Wagner (2003)), the Jgraph (Alder (2005)), The yFiles (yWorks (2004)) and the Boost (Siek, Lee, and Lumsdaine (2001)). []
Additionally, some extra open source projects for graph or network visualization written in Java are the: GiNY, HyperGraph, JDigraph, WilmaScope, JGraphT, TouchGraph, GVF, JgraphEd, VGJ, GUESS, Zoomgraph, Walrus, Prefuse, ZVTM, Large Graph Layout, NV2D, Otter, Gravisto, Linguine Maps, JIGGLE, Zest, Relo and SoNIA.
It should be reported that initially, the JGraphT software was selected for creating the GUI for the present project. JGraphT seems to be a simple and helpful software for creating graphical representations of network data, only if these data are already known. It does not seem to be appropriate for creating dynamic graphs. That was the reason why the powerful and flexible JUNG API was used in the end.
There is a long list of software projects that use JUNG for network data visualization. Some of them are listed here: Djinn (Fabien Benolt), RDF Gravity (Sunil Goyal, Rupert Westenthaler), GUESS (Eyatan Adar, David Feinberg), ADAPTNet (Michal Okoniewski, Tim Yates), Augur (Jon Froehlich), Ariadne (Cleidson de Souza et al.), Netsight (Yan-Biao Boey, Joshua O' Madadhain, Scott White, Padhraic Smyth), InfoVis CyberInfrastructure (Penumarthy, Mane, and Borner (2004)), PWComp (Joshua Adelman, Josh England, Alex Chen), Google Cartography (Richard Jones), GINY (Rowan Christmas), GraphExplore (Quanli Wang), TOTEM (Toolbox For Traffic Engineering Methods (S. Balon, O. Delcourt, J. Lepropre and F. Skivee), D2K (“Data to Knowledge”), graphBuilder (Ben Raymond), Semiophore, Xholon (Ken Webb), and Flink (Peter Mika). []
Additionally, there is another long list with papers and research projects that involve the JUNG software, which has been decided not to be mentioned. JUNG is a dynamically involving software and its second edition, called Jung 2.0, is currently in its alpha version. It is available to be downloaded from the Open Source Technology Group (SourceForge.net). In this point it is worthwhile to mention that during the first period of developing the GUI for the PDQ library, an effort has been made in order to implement the GUI using the newer second version of JUNG, instead of the 1.7.6. version. Using the second version of the software initially seemed to be an interesting challenge, since it is difficult to combine one quite under engineered software like PDQ with one like Jung2.0, which in its first sight looks quite over engineered. Finally, it was decided that the, very well done, 1.7.6 version could provide enough help to create a GUI for the Pretty Damn Quick library, since Jung2.0 is still in its alpha version and probably some changes will take place in the near future.
References
J O 'Madadhain, D Fisher, S White, Y Boey, “The JUNG (Java Universal Network/Graph) Framework”, (preprint by http://jung.sourceforge.net), 2007
The architecture of this library supports a variety of representations of entities and their relations. It has been designed to support directed and undirected graphs, graphs with parallel edges, multi-modal graphs, and hypergraphs. Moreover, it provides a built in mechanism for annotating graphs, entities (Vertices) and relations (Edges) with data structures (metadata). It supports complex filtering mechanisms, which extract subsets of a network, and dynamic graphs. Additionally, the JUNG software includes implementations of a number of algorithms. These algorithms are used in exploratory data analysis, social network analysis, graph theory and machine learning. They can be used for clustering, optimization, statistical analysis, decomposition, random graph generation and calculation of network distances, flows, and ranking measures.
The developers of JUNG, in order to reduce the duplication of effort required for implementing certain functions that their software needed to perform, have used other, also platform-independent, Java libraries. Jung uses the following Java libraries:
Commons Collections (Apache Jakarta Project (2004)) library which enhances the basic Java API for collections of objects.
Colt (CERN (2004)) is a set of libraries for matrix algebra functions and generally for high-performance scientific and technical computing.
Xerces (Apache XML Project (2004)) is the common library, used for implementing the GraphML I/O capabilities, for XML parsing.
Building JUNG-based applications will require all of the above, plus the Java SDK (version 1.3 or later).
Except from the powerful JUNG API, there exist a number of other packages and tools for manipulating and visualizing network data such as the: Pajek (Batagelj and Mrva (2004)), the UCINET (Borgatti, Everett, and Freeman (2004)), the R (R Development Core Team (2004)) with sna (Butts (2004)), the NetMiner (Cyram Company, Ltd. (2004)), the MultiNet (Richards and Seary (2004)), the GFC (IBM Corporation (1999), the StOCNET (Boer, Huisman, Snijders, and Zeggelink (2003)), the Info Vis Cyberinfrastructure (Penumarthy, Mane, and Borner (2004)), the Info Vis (Fekete (2004)), the Visone (Brandes and Wagner (2003)), the Jgraph (Alder (2005)), The yFiles (yWorks (2004)) and the Boost (Siek, Lee, and Lumsdaine (2001)). []
Additionally, some extra open source projects for graph or network visualization written in Java are the: GiNY, HyperGraph, JDigraph, WilmaScope, JGraphT, TouchGraph, GVF, JgraphEd, VGJ, GUESS, Zoomgraph, Walrus, Prefuse, ZVTM, Large Graph Layout, NV2D, Otter, Gravisto, Linguine Maps, JIGGLE, Zest, Relo and SoNIA.
It should be reported that initially, the JGraphT software was selected for creating the GUI for the present project. JGraphT seems to be a simple and helpful software for creating graphical representations of network data, only if these data are already known. It does not seem to be appropriate for creating dynamic graphs. That was the reason why the powerful and flexible JUNG API was used in the end.
There is a long list of software projects that use JUNG for network data visualization. Some of them are listed here: Djinn (Fabien Benolt), RDF Gravity (Sunil Goyal, Rupert Westenthaler), GUESS (Eyatan Adar, David Feinberg), ADAPTNet (Michal Okoniewski, Tim Yates), Augur (Jon Froehlich), Ariadne (Cleidson de Souza et al.), Netsight (Yan-Biao Boey, Joshua O' Madadhain, Scott White, Padhraic Smyth), InfoVis CyberInfrastructure (Penumarthy, Mane, and Borner (2004)), PWComp (Joshua Adelman, Josh England, Alex Chen), Google Cartography (Richard Jones), GINY (Rowan Christmas), GraphExplore (Quanli Wang), TOTEM (Toolbox For Traffic Engineering Methods (S. Balon, O. Delcourt, J. Lepropre and F. Skivee), D2K (“Data to Knowledge”), graphBuilder (Ben Raymond), Semiophore, Xholon (Ken Webb), and Flink (Peter Mika). []
Additionally, there is another long list with papers and research projects that involve the JUNG software, which has been decided not to be mentioned. JUNG is a dynamically involving software and its second edition, called Jung 2.0, is currently in its alpha version. It is available to be downloaded from the Open Source Technology Group (SourceForge.net). In this point it is worthwhile to mention that during the first period of developing the GUI for the PDQ library, an effort has been made in order to implement the GUI using the newer second version of JUNG, instead of the 1.7.6. version. Using the second version of the software initially seemed to be an interesting challenge, since it is difficult to combine one quite under engineered software like PDQ with one like Jung2.0, which in its first sight looks quite over engineered. Finally, it was decided that the, very well done, 1.7.6 version could provide enough help to create a GUI for the Pretty Damn Quick library, since Jung2.0 is still in its alpha version and probably some changes will take place in the near future.
References
J O 'Madadhain, D Fisher, S White, Y Boey, “The JUNG (Java Universal Network/Graph) Framework”, (preprint by http://jung.sourceforge.net), 2007
Friday, 17 August 2007
Millennium Performance Problems and the contribution of the Jung API
The first of the Millennium Performance Problems, as expressed by Neil J. Gunther, is the performance Visualization. The idea is to find alternative ways of representing performance data. The traditional reports which contain a set of tables with numbers and a variety of two dimensional diagrams, seem not to provide the expected help, when a special set of problems is involved. These are usually complex problems which need a special way of representation in order to become comprehensive from the human brain, and problems that demand a special analysis and solution methodology. Alternative ways of data representation should be considered which would take into account the new discoveries that have been made recently around the humans vision area of research, and generally around the way the human brain can perceive some certain types of information and knowledge. For this reason physicists and biologists use, and also are involved into, projects that have to do with animation and special GUIs, that are used to represent complex data in ways that could help their users to derive and solve problems easier. [Neil J. Gunther]
Thus, in order to achieve more effective representations of complex network data, several graphical analysis tools and frameworks have been developed to provide help to programmers and consultants. In this section, a quick reference to the several open source and freely available software projects, that have been developed for this purpose, will be made, and a short presentation of the Jung API capabilities, will be provided. But in order to complete the reference to the Millennium Performance Problems, it would be better to provide a short description of the other four remaining problems first, as expresses by Neil J. Gunther.
The second millennium performance problem has to do with the so called “Self-instrumented Applications”. It is well-known that object orientation logic has improved the way software is managed and organised. Primarily, this is achieved because in object orientated programming, code can be reused and maintained easier. The idea behind the self-instrumented applications is based on the same rationale, applied to the programmers or generally to the productive resources, themselves. These kind of objects should be then added to the object library as well. In this way, the performance analyst could have the ability to turn objects on and off selectively, achieving a more flexible and probably more efficient management, since paths to find bottlenecks through applications code, could be traced easier. Obviously, applying this logic could contribute to the analysis of production systems
The Von Neumann Bottleneck is the third millennium performance problem. Parallelism, is the technical term for doing more than one thing at once in order to achieve computational efficiency. In parallel computing centres, where supercomputers made either with clusters or with Grid computing techniques, the maximum efficiency for the system can be achieved if only a single thread is running in each of the processing units and this thread is not terminated until the end of the computation. Unfortunately to achieve practically something like that or to achieve general-purpose parallelism, seems to be an utopia for performance analysts, despite the intense effort that has been done over the last two decades. The paradigm of the human brain architecture has lead to the idea of using neural networks to achieve a higher degree of parallelism. Another idea is to construct quantum computers.
The fourth millennium performance problem as expressed by Neil J. Gunther has to do with the performance analysis of the Internet. Internet communication is based mostly on the concept of receiving and transmitting information coded as a packet of data. Internet packet analysis, which has taken place over the last fifteen years, showed that “long-term correlations can persist over several orders of magnitude in time” [Neil J. Gunther]. It is true that in the packet level, all the conventional queueing theory techniques are no longer valid. Packet arrivals don't follow the Poison distribution, while there are no constant service times. Thus the question is how the Internet could be modelled. A very challenging task could be the development of an Internet Simulator or a new theory for expressing complex systems of the same type.
The final millennium challenge for the performance analysts has to do with the performance analysis of Quantum Computers. The development of Quantum Computers may still sound like a science fiction story, but it is true that there have already been developed and used quantum communication devices for encryption. Maybe they are still very specialised and expensive but it is expected that during the following years, devices like those will be easier and cheaper to a be produced in a large scale. The question is how the wide use of this technology will affect performance. [Neil. J Gunther]
These five millennium performance problems, as they have been expressed by Neil J. Gunther, seem to reveal some interesting points about the upcoming challenges in the area of systems performance evaluation. The present project aims to perform a research and hopefully contribute to the first of these challenges. The visualization of complex data that have to do with the relations between distributed entities expressed as networks of data, and the analysis of the several ways that this can or already has been done, could be the subject of this project.
Thus, in order to achieve more effective representations of complex network data, several graphical analysis tools and frameworks have been developed to provide help to programmers and consultants. In this section, a quick reference to the several open source and freely available software projects, that have been developed for this purpose, will be made, and a short presentation of the Jung API capabilities, will be provided. But in order to complete the reference to the Millennium Performance Problems, it would be better to provide a short description of the other four remaining problems first, as expresses by Neil J. Gunther.
The second millennium performance problem has to do with the so called “Self-instrumented Applications”. It is well-known that object orientation logic has improved the way software is managed and organised. Primarily, this is achieved because in object orientated programming, code can be reused and maintained easier. The idea behind the self-instrumented applications is based on the same rationale, applied to the programmers or generally to the productive resources, themselves. These kind of objects should be then added to the object library as well. In this way, the performance analyst could have the ability to turn objects on and off selectively, achieving a more flexible and probably more efficient management, since paths to find bottlenecks through applications code, could be traced easier. Obviously, applying this logic could contribute to the analysis of production systems
The Von Neumann Bottleneck is the third millennium performance problem. Parallelism, is the technical term for doing more than one thing at once in order to achieve computational efficiency. In parallel computing centres, where supercomputers made either with clusters or with Grid computing techniques, the maximum efficiency for the system can be achieved if only a single thread is running in each of the processing units and this thread is not terminated until the end of the computation. Unfortunately to achieve practically something like that or to achieve general-purpose parallelism, seems to be an utopia for performance analysts, despite the intense effort that has been done over the last two decades. The paradigm of the human brain architecture has lead to the idea of using neural networks to achieve a higher degree of parallelism. Another idea is to construct quantum computers.
The fourth millennium performance problem as expressed by Neil J. Gunther has to do with the performance analysis of the Internet. Internet communication is based mostly on the concept of receiving and transmitting information coded as a packet of data. Internet packet analysis, which has taken place over the last fifteen years, showed that “long-term correlations can persist over several orders of magnitude in time” [Neil J. Gunther]. It is true that in the packet level, all the conventional queueing theory techniques are no longer valid. Packet arrivals don't follow the Poison distribution, while there are no constant service times. Thus the question is how the Internet could be modelled. A very challenging task could be the development of an Internet Simulator or a new theory for expressing complex systems of the same type.
The final millennium challenge for the performance analysts has to do with the performance analysis of Quantum Computers. The development of Quantum Computers may still sound like a science fiction story, but it is true that there have already been developed and used quantum communication devices for encryption. Maybe they are still very specialised and expensive but it is expected that during the following years, devices like those will be easier and cheaper to a be produced in a large scale. The question is how the wide use of this technology will affect performance. [Neil. J Gunther]
These five millennium performance problems, as they have been expressed by Neil J. Gunther, seem to reveal some interesting points about the upcoming challenges in the area of systems performance evaluation. The present project aims to perform a research and hopefully contribute to the first of these challenges. The visualization of complex data that have to do with the relations between distributed entities expressed as networks of data, and the analysis of the several ways that this can or already has been done, could be the subject of this project.
User Data in Jung
There are two ways to associate data with graphs, edges or vertices.
1st. Class Extension, creating a class that extends one of the Graph, Vertice or Edge implementations.
2nd. Jung Annotation, using the existing built-in mechanism (the UserData class) for annotating graph elements with data. Several methods are used to manipulate these data.
I think that the first way is more appropriate for the needs of my project.
1st. Class Extension, creating a class that extends one of the Graph, Vertice or Edge implementations.
2nd. Jung Annotation, using the existing built-in mechanism (the UserData class) for annotating graph elements with data. Several methods are used to manipulate these data.
I think that the first way is more appropriate for the needs of my project.
Wednesday, 15 August 2007
OverEngineering vs UnderEngineering
It has fun when you are trying to combine one quite under engineered software like PDQ with one like Jung2, which produces very impressive graphic representations but, in its first sight looks quite over engineered. I decided that, the very well done, 1.7.6 could provide enough help to my project since Jung2.0 is still in its alpha version and probably some changes will take place in the near future.
leffe
leffe
Monday, 13 August 2007
Compiling a JUNG2 based code
- Exception in thread "main" java.lang.NoClassDefFoundError: com/perfdynamics/pdq/examples/ClosedPassport
*/ Probably something is wrong with the classpath
- The type javax.vecmath.Point3f cannot be resolved. It is indirectly
referenced from required .class files
- Could not instantiate class, error in the constructor
*/ Probably something is wrong with the classpath
- The type javax.vecmath.Point3f cannot be resolved. It is indirectly
referenced from required .class files
- Could not instantiate class, error in the constructor
Saturday, 11 August 2007
The Millennium Performance Problems
The Millennium Performance Problems are:
1. Performance Visualization:
The idea here is to find ways of representing performance data that are a better impedance match for our cognitive computer (our brain). One role model is the techniques used in so-called Scientific Visualization where physicists and biologists have learned to use things like special GUIs and animation to represent complex data in ways that help them solve problems. Why should they have all the fun?
2. Self-instrumented Applications:
Object-oriented programming has been promoted as a good thing primarily for reasons of reusability. If people are going to re-use objects, how about they come with their own instrumentation? This really should be part of the object library so that a programmer need nevre be concerned with adding such code. Then I, as the performance analyst would have the ability to turn objects on selectively and thereby trace paths through application code to find bottlenecks, even on production systems.
3. The Von Neummann Bottleneck:
An efficient way to compute something on a machine is to do more than one thing at once. The technical term is parallelism. Unfortunately, despite a lot of intense effort over the last two decades, general-purpose parallelism remains a holy grail of performance analysis. One reason for this barrier seems to stem from the influential success of early electronic computer designers such as Alan Turing and John von Neumann. The fundamental paradigm of sequential programming and operation seem to be almost impossible to break away from in the modern digital computer. But there is an obvious role model for a non-von computer architecture: our brain. This has led to the idea neural networks as way of achieving a higher degree of parallelism. Quantum computers are another.
4. Performance Analysis of the Internet:
The results of analyzing Internet packet traces 15 years ago at Bellcore (now Lucent) showed that long-term correlations can persist over several orders of magnitude in time. Packet arrivals are not Poisson, and service times are not constant. In other words, all the conventional queueing theory techniques near and dear to our hearts as performance analysts, are no longer valid at the packet level. How are we to model the Internet? Perhaps we need to think big. The climatologists use things like the Earth Simulator to address complex questions about global warming. Maybe we need an Internet Simulator of similar scale?
5. Performance Analysis of Quantum Computers:
Quantum Computers are probably a long way off, but quantum communication devices are already here. The only reason you are not aware of them is because they are expensive and therefore a specialty item for institutions like banks. In the next 3 to 5 years, I believe these things will reach commodity prices and will therefore become more ubiquitous. I am also working on these technologies now. The question for us is, How will they affect performance?
CMG Materials
1. Performance Visualization:
The idea here is to find ways of representing performance data that are a better impedance match for our cognitive computer (our brain). One role model is the techniques used in so-called Scientific Visualization where physicists and biologists have learned to use things like special GUIs and animation to represent complex data in ways that help them solve problems. Why should they have all the fun?
2. Self-instrumented Applications:
Object-oriented programming has been promoted as a good thing primarily for reasons of reusability. If people are going to re-use objects, how about they come with their own instrumentation? This really should be part of the object library so that a programmer need nevre be concerned with adding such code. Then I, as the performance analyst would have the ability to turn objects on selectively and thereby trace paths through application code to find bottlenecks, even on production systems.
3. The Von Neummann Bottleneck:
An efficient way to compute something on a machine is to do more than one thing at once. The technical term is parallelism. Unfortunately, despite a lot of intense effort over the last two decades, general-purpose parallelism remains a holy grail of performance analysis. One reason for this barrier seems to stem from the influential success of early electronic computer designers such as Alan Turing and John von Neumann. The fundamental paradigm of sequential programming and operation seem to be almost impossible to break away from in the modern digital computer. But there is an obvious role model for a non-von computer architecture: our brain. This has led to the idea neural networks as way of achieving a higher degree of parallelism. Quantum computers are another.
4. Performance Analysis of the Internet:
The results of analyzing Internet packet traces 15 years ago at Bellcore (now Lucent) showed that long-term correlations can persist over several orders of magnitude in time. Packet arrivals are not Poisson, and service times are not constant. In other words, all the conventional queueing theory techniques near and dear to our hearts as performance analysts, are no longer valid at the packet level. How are we to model the Internet? Perhaps we need to think big. The climatologists use things like the Earth Simulator to address complex questions about global warming. Maybe we need an Internet Simulator of similar scale?
5. Performance Analysis of Quantum Computers:
Quantum Computers are probably a long way off, but quantum communication devices are already here. The only reason you are not aware of them is because they are expensive and therefore a specialty item for institutions like banks. In the next 3 to 5 years, I believe these things will reach commodity prices and will therefore become more ubiquitous. I am also working on these technologies now. The question for us is, How will they affect performance?
CMG Materials
Friday, 10 August 2007
JGraphT
- JGraphT supports various types of graphs including: directed and undirected graphs, graphs with weighted / unweighted / labeled or any user-defined edges, various edge multiplicity options, including: simple-graphs, multigraphs, pseudographs, unmodifiable graphs - allow modules to provide "read-only" access to internal graphs, listenable graphs - allow external listeners to track modification events, subgraphs graphs that are auto-updating subgraph views on other graphs and all compositions of mentioned graphs. Powerful, but simple.
JUNG
- JUNG — the Java Universal Network/Graph Framework--is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network. The JUNG architecture is designed to support a variety of representations of entities and their relations, such as directed and undirected graphs, multi-modal graphs, graphs with parallel edges, and hypergraphs. It provides a mechanism for annotating graphs, entities, and relations with metadata. The current distribution of JUNG includes implementations of a number of algorithms from graph theory, data mining, and social network analysis, such as routines for clustering, decomposition, optimization, random graph generation, statistical analysis, and calculation of network distances, flows, and importance measures (centrality, PageRank, HITS, etc.). Comprehensive, everything but the kitchen sink!
In order to run a JUNG-based application, you will need all the packages listed on the JUNG download page. (As of this writing, this includes .jar files for JUNG, version 1.3 (or later) of a Java runtime environment (JRE), Apache Jakarta commons-collections, the CERN Colt scientific library, and Xerces.
Building JUNG-based applications will require all of the above, plus the Java SDK (version 1.3 or later).
jung
Check here for more open source graph network visual tools in Java ;-)
http://www.manageability.org/blog/stuff/open-source-graph-network-visualization-in-java/view
In order to run a JUNG-based application, you will need all the packages listed on the JUNG download page. (As of this writing, this includes .jar files for JUNG, version 1.3 (or later) of a Java runtime environment (JRE), Apache Jakarta commons-collections, the CERN Colt scientific library, and Xerces.
Building JUNG-based applications will require all of the above, plus the Java SDK (version 1.3 or later).
jung
Check here for more open source graph network visual tools in Java ;-)
http://www.manageability.org/blog/stuff/open-source-graph-network-visualization-in-java/view
Algorithm for calculating the Time Share Ratio between the Queues
v(j) = Σv(i)p(i→j)
p(i→j) = probability for the Job (Workload) to leave Queue (Node) i and go to j
in order to make the calculation, firstly one of the v(i) should be considered =1, (v(i) = 1)
p(i→j) = probability for the Job (Workload) to leave Queue (Node) i and go to j
in order to make the calculation, firstly one of the v(i) should be considered =1, (v(i) = 1)
Wednesday, 8 August 2007
Potential GUI Approach for the existing PDQ Java version
As I've mentioned before, I believe that in order to develop a correct, in matters of software engineering, software tool with a fully functional GUI, and potentials for further development, it is essential to perform some refactoring to the already existing Java code.
On the other hand, if the only purpose is to develop a GUI based just on the already existing code with just the current functionality, an approach could be:
- Initially represent graphically a queueing model with Nodes(Queues) and Arcs with parameters,
- keep the information and parameters into some additional memory structures (ArryList/Hashmap),
- perform a validation, and after the visually defined model is valid,
- export the buffered parameters of the visual model into the actual PDQ model and then
- solve it using the PDQ_solve(). After that
- take the results using the appropriate "PDQ_get" functions and
- represent them graphically, using some external of the PDQ library functions again.
Having this approach, in the case that a queueing model has defined, then validated and finaly solved, if the user would like to change just one parameter of the model then after the first execution of the MVA algorithm the pdq model should be deleted (actually calling the PDQ_Init()) and then recreated from the beginning since no changes can be done to an already defined PDQ model.
This approach doesn't extend the functionality of the existing PDQ library. It just creates a visual shell, which gives the impression that the PDQ tool is more high level.
On the other hand, if the only purpose is to develop a GUI based just on the already existing code with just the current functionality, an approach could be:
- Initially represent graphically a queueing model with Nodes(Queues) and Arcs with parameters,
- keep the information and parameters into some additional memory structures (ArryList/Hashmap),
- perform a validation, and after the visually defined model is valid,
- export the buffered parameters of the visual model into the actual PDQ model and then
- solve it using the PDQ_solve(). After that
- take the results using the appropriate "PDQ_get" functions and
- represent them graphically, using some external of the PDQ library functions again.
Having this approach, in the case that a queueing model has defined, then validated and finaly solved, if the user would like to change just one parameter of the model then after the first execution of the MVA algorithm the pdq model should be deleted (actually calling the PDQ_Init()) and then recreated from the beginning since no changes can be done to an already defined PDQ model.
This approach doesn't extend the functionality of the existing PDQ library. It just creates a visual shell, which gives the impression that the PDQ tool is more high level.
braju.jar
Why braju.jar?
An additional quite old library called "braju" is used from the PDQ library. In contrast with the PDQ library, the braju library is not open source (it is freely available though). This library is commonly used in order to represent the C - like functions: fprintf(), printf() and sprintf() in Java. I don't really understand what is the purpose of using this package in the Java PDQ implementation. I suppose that the only purpose of doing this is to create an as much as relevant and close to the C version, PDQ Java code version.
The API of the Package com.braju.format contains:
Class Summary
Format: This class provides static C-style methods like sprintf(), printf() and fprintf() with variable number of parameters.
Parameters: This class makes it possible to simulate parameter lists to method with arbitrary length.
ParametersAutoClear: This class is an empty class with no extra methods.
Exception Summary
ParseErrorException: Exception thrown when there is an unexpected character or a syntax error in the format string passed to a *printf method.
It is possible that refactoring just the pdq.jar open source code, is going to be enough for creating the demanded functionality of the library, but I can not be sure about it. The truth is that even if it is feasible, an additional effort will be required in order to learn how the braju.jar library works if I still want to use it, but probably, in the case of refactoring, its use is going to be redundant.
An additional quite old library called "braju" is used from the PDQ library. In contrast with the PDQ library, the braju library is not open source (it is freely available though). This library is commonly used in order to represent the C - like functions: fprintf(), printf() and sprintf() in Java. I don't really understand what is the purpose of using this package in the Java PDQ implementation. I suppose that the only purpose of doing this is to create an as much as relevant and close to the C version, PDQ Java code version.
The API of the Package com.braju.format contains:
Class Summary
Format: This class provides static C-style methods like sprintf(), printf() and fprintf() with variable number of parameters.
Parameters: This class makes it possible to simulate parameter lists to method with arbitrary length.
ParametersAutoClear: This class is an empty class with no extra methods.
Exception Summary
ParseErrorException: Exception thrown when there is an unexpected character or a syntax error in the format string passed to a *printf method.
It is possible that refactoring just the pdq.jar open source code, is going to be enough for creating the demanded functionality of the library, but I can not be sure about it. The truth is that even if it is feasible, an additional effort will be required in order to learn how the braju.jar library works if I still want to use it, but probably, in the case of refactoring, its use is going to be redundant.
Friday, 3 August 2007
Sofware Engineering Issues
In my mind I had two potential approaches for developing the Graphical Queueing Networks Analysis Tool:
1. Just use the library as it is (don't change the functionality).
Ignore the details of the implementation of the PDQ software, and focus on the functionality. In other words, just use the functionality PDQ API provides, mostly just using the PDQ class which can be pretty much enough, in order to define models. Decipher how this can be done through the several examples and the PDQ users manual. Try to automate the modeling procedure and then represent the required code visually creating classes - elements of a GUI.
2. Develop the API (add more functionality) after studying its code.
Understand thoroughly not only the way PDQ API can be used in order to define Queueing networks models (probably using examples and the manual), but also the several details of the software implementation. Which are the attributes and functions each of the classes of the API provide? Which of these classes should be expanded and how? What kind of functions should be added in order create a higher level of the library and after that a GUI? This is essential because of Software Engineering reasons. Do I just need to create a GUI for an already complete API or I need to expand the functionality of the PDQ API as well, adding more attributes and methods?
I think that the second approach is the only rational approach cause the PDQ API is really very low level. Thus re-engineering is essential, otherwise the result will be a clumsy, non-functional GUI, based only on the already existing functionality of the library.
Some of the PDQ library functions are grouped here in order of most frequent invocation:
PDQ_Init() Initializes all internal PDQ variables. Must be called prior to any other PDQ function.
PDQ_CreateOpen() Defines the characteristics of a workload in an open-circuit queueing model.
PDQ_CreateClosed() Defines the characteristics of a workload in a closed-circuit queueing model.
PDQ_CreateNode() Defines a single queueing-center in either a closed or open circuit queueing model.
PDQ_SetDemand() Defines the service demand of a specific workload at a specified node.
PDQ_SetVisits() Define the service demand in terms of the service time and visit count.
PDQ_SetDebug() enables diagnostic printout of PDQ internal variables.
PDQ_Solve() The solution method must be passed as an argument.
PDQ_GetResponse() Returns the system response time for the specified workload.
PDQ_GetThruput() Returns the system throughput for the specified workload.
PDQ_GetUtilization() Returns the utilization of the designated queueing node by the specified workload.
PDQ_Report() Generates a formatted report containing system, and node level performance measures.
- It seems that there are no implemented methods for deleting Nodes and Workloads in the PDQ class. PDQ_RemoveNode(), PDQ_RemoveClosed(), etc.
- Nodes and Workloads can only be created. Then they are stored in a simple static array structure defined in the PDQ class.
- Should I create a subclass of the PDQ class called eg. class ThePDQ expands PDQ {}, which implements these required methods?
- Should I create a subclass of the Node class called eg class TheNode expands PDQ {}, which gives more attributes to the Node Object, such as displayed Image as well as complementary methods for manipulating the behavior of a Node such as a method that calculates automatically the Demand of this Node?
- Should I do something similar and create an additional Workload() class?
- What about the Arc() class? Which are the attributes of this about-to-be implemented class? How the Arcs that join the several Nodes should behave? Should this potential Arc() class expand some other relative class of the PDQ API?
- Which other classes should be expanded in order to have a coherent set of classes and methods?
- Apparently, the existing Java version of the PDQ library can be considered under-engineered. How can I avoid this and how can I avoid potential over-engineering as well? Is it preferable instead of expanding one class, just to develop a separate class in which an instance of the about to be expanded class is created?
1. Just use the library as it is (don't change the functionality).
Ignore the details of the implementation of the PDQ software, and focus on the functionality. In other words, just use the functionality PDQ API provides, mostly just using the PDQ class which can be pretty much enough, in order to define models. Decipher how this can be done through the several examples and the PDQ users manual. Try to automate the modeling procedure and then represent the required code visually creating classes - elements of a GUI.
2. Develop the API (add more functionality) after studying its code.
Understand thoroughly not only the way PDQ API can be used in order to define Queueing networks models (probably using examples and the manual), but also the several details of the software implementation. Which are the attributes and functions each of the classes of the API provide? Which of these classes should be expanded and how? What kind of functions should be added in order create a higher level of the library and after that a GUI? This is essential because of Software Engineering reasons. Do I just need to create a GUI for an already complete API or I need to expand the functionality of the PDQ API as well, adding more attributes and methods?
I think that the second approach is the only rational approach cause the PDQ API is really very low level. Thus re-engineering is essential, otherwise the result will be a clumsy, non-functional GUI, based only on the already existing functionality of the library.
Some of the PDQ library functions are grouped here in order of most frequent invocation:
PDQ_Init() Initializes all internal PDQ variables. Must be called prior to any other PDQ function.
PDQ_CreateOpen() Defines the characteristics of a workload in an open-circuit queueing model.
PDQ_CreateClosed() Defines the characteristics of a workload in a closed-circuit queueing model.
PDQ_CreateNode() Defines a single queueing-center in either a closed or open circuit queueing model.
PDQ_SetDemand() Defines the service demand of a specific workload at a specified node.
PDQ_SetVisits() Define the service demand in terms of the service time and visit count.
PDQ_SetDebug() enables diagnostic printout of PDQ internal variables.
PDQ_Solve() The solution method must be passed as an argument.
PDQ_GetResponse() Returns the system response time for the specified workload.
PDQ_GetThruput() Returns the system throughput for the specified workload.
PDQ_GetUtilization() Returns the utilization of the designated queueing node by the specified workload.
PDQ_Report() Generates a formatted report containing system, and node level performance measures.
- It seems that there are no implemented methods for deleting Nodes and Workloads in the PDQ class. PDQ_RemoveNode(), PDQ_RemoveClosed(), etc.
- Nodes and Workloads can only be created. Then they are stored in a simple static array structure defined in the PDQ class.
- Should I create a subclass of the PDQ class called eg. class ThePDQ expands PDQ {}, which implements these required methods?
- Should I create a subclass of the Node class called eg class TheNode expands PDQ {}, which gives more attributes to the Node Object, such as displayed Image as well as complementary methods for manipulating the behavior of a Node such as a method that calculates automatically the Demand of this Node?
- Should I do something similar and create an additional Workload() class?
- What about the Arc() class? Which are the attributes of this about-to-be implemented class? How the Arcs that join the several Nodes should behave? Should this potential Arc() class expand some other relative class of the PDQ API?
- Which other classes should be expanded in order to have a coherent set of classes and methods?
- Apparently, the existing Java version of the PDQ library can be considered under-engineered. How can I avoid this and how can I avoid potential over-engineering as well? Is it preferable instead of expanding one class, just to develop a separate class in which an instance of the about to be expanded class is created?
Thursday, 2 August 2007
New Challenges For The Professional Modellers and the Systems Modelling Methodology: Grid Computing
Professional modellers deal with a variety of real problems and their main objective is to “translate” each of these problems into a strict mathematical form. In order to achieve something like that, several stages of the modelling process should be preceded. Initially, all the aspects of the problem should be clarified, and then the several variables of the problem should be identified. Finally the modeller should be able to estimate and approximate a set of courses of action that may cost money and time. It is a very important skill for the modeller to be able to identify the critical parts of a problem in order to create an optimised, abstract version, which constitutes the mathematical model of the problem. Generally, any model can be defined as a simplified representation of certain aspects, preferably the most important, of a real system. A mathematical model can be defined as a model which is created using several mathematical concepts such as functions and equations.
It should never be forgotten the fact that modelling, compared to simulation, is usually a less accurate but faster methodology for estimating the behaviour of a system, and that the main purpose of using this methodology is its high performance. Additionally, it is essential to underline, for one more time, the role of systems modelling in industry and business. Optimising the procedures of decision making and systems performance evaluation, a significant competitive advantage in matters of time and money may be gained for the company. A not so naïve observer could possibly identify a gap between the reasoning required for creating a mathematical principle, and this required for creating a mathematical model. It is true that mathematical principles are based on sound reasoning and development but, when we come to model a given real problem, we must feel free to construct the model using whatever mathematical relationships and techniques seem appropriate. In the case of creating a mathematical model, nothing can be considered as totally true or totally false and there are always several ways in which a model of a real problem can be created. [1]
Moreover, new achievements in the area of distributed and parallel computing systems have raised the need for developing models and modelling mechanisms, that can evaluate effectively the behaviour of highly complex systems, such as supercomputers, which have been created using clusters or idle desktop computers connected to the Internet. The second method of creating a supercomputer is based on the principles and the technology that define the term Grid Computing. Using the computational power of idle non-dedicated desktop machines that are connected in the Internet reveals another, more modern, approach to the way supercomputers can be created. This methodology called cycle-scavenging, which is based on volunteer computing, is currently used effectively to projects like BOINC, which was initially developed to support the SETI@home computationally intensive project. SETI was aiming to detect intelligent life outside Earth, after analysing a large amount of data coming from a telescope. There are several freely available technologies, which have been developed around the idea of Grid Computing. Distributed Resource Management Application API (DRMAA) is a high level specification API, provided by the Global Grid Forum (GGF). which is a community which aims to standardise Grid Computing globally. DRMAA tries to specify the way in which the several jobs, within one or more Distributed Resource Management (DRM) systems, are submitted and controlled. “Condor” is a popular open source software framework for course-grained distributed parallelization of computationally intensive tasks. It is commonly used to manage workload on a dedicated cluster of machines or on a Grid of idle non-dedicated desktop computers connected to the Internet. Alternative and complementary frameworks are the Cactus Grid Project and the Sun Grid Engine. Some other popular open source software middlewares for building computing grids are the “Globus Toolkit”, and the “Parallel Virtual Machine” (PVM). OGSA (Open Grid Service Architecture) describes an architecture for grid computing environment, based around services, developed within the GGF (Global Grid Forum) for scientific and business use. It actually constitutes a distributed interaction and computing architecture which assures interoperability on heterogeneous systems, so that different types of resources can communicate and share information in order to provide several services. There are several other applications and middlewares created to support the Grid Computing technology, while additionally several low or higher level APIs have been developed such as the “Message Passing Interface” (MPI) which is used for non-shared memory programming (computer clusters) , the OpenMP API used for shared memory multiprocessing programming, and the most recent “Simple API for Grid Applications” (SAGA), which is an open standard that describes an interface for high level Grid applications programming. Additionally, combining the MPI and OpenMP technologies, several other hybrids have been developed, by different distributors, in a variety of supported programming languages. The main objectives of these technologies are: high performance, scalability and portability. Parallel, multiprocessing programming in a lower level, is based on the use of threads (eg. POSIX Threads), and a key task for achieving maximum efficiency in shared memory parallel architectures is to both use exactly one thread per processor and keep these threads running during the whole lifetime of the parallel program. There are several techniques that can optimise this procedure. Additionally, the most popular and supportive languages for parallel programming are C/C++ and Fortran. Java programming language assures portability, ease of software engineering, automatic garbage collection (using the finalize() method), thorough type checking and absence of pointers. These attributes make Java development less error prone. On the other hand there are some characteristics that make the Java programming language inappropriate for parallel systems programming such as lower performance because of slower compilers, lack of support for complex numbers and multidimensional arrays, lack of suitable standards for parallel programming (no standardised interface for MPI / OpenMP), and problems using native Java Threads instead of a directive system (OpenMP). However there are some efforts to create specifications using OpenMP-like directives, in order to standardise somehow the parallel programming with Java. Similar efforts have been done for other languages as well, such as Perl and Python.
To sum up, having as a task to evaluate the behaviour of a highly complex system, such as a supercomputer, can be very demanding. This is another challenge for professional system modellers. Taking into account the recent evolution in the area of supercomputers, several questions can be derived. Which is the way, and what kind of tools could be developed and used in order to help modellers, to evaluate and optimise modern, large-scale systems, such as projects based on Grid Computing technology? How could the Queueing Networks be used in order to model and evaluate the quite complex procedures taking place within a multi-platform, shared memory (or not), multiprocessing environment. Although the low level technology and middlewares related to the Grid computing have been already matured, there are not many high level Grid applications created yet. This lack of experience in the particular field, makes the Grid computing systems modelling and evaluation even more demanding and challenging.
Reference
[1] Edwards & Hamson, “Guide Mathematical Modelling”, Palgrave 2001.
It should never be forgotten the fact that modelling, compared to simulation, is usually a less accurate but faster methodology for estimating the behaviour of a system, and that the main purpose of using this methodology is its high performance. Additionally, it is essential to underline, for one more time, the role of systems modelling in industry and business. Optimising the procedures of decision making and systems performance evaluation, a significant competitive advantage in matters of time and money may be gained for the company. A not so naïve observer could possibly identify a gap between the reasoning required for creating a mathematical principle, and this required for creating a mathematical model. It is true that mathematical principles are based on sound reasoning and development but, when we come to model a given real problem, we must feel free to construct the model using whatever mathematical relationships and techniques seem appropriate. In the case of creating a mathematical model, nothing can be considered as totally true or totally false and there are always several ways in which a model of a real problem can be created. [1]
Moreover, new achievements in the area of distributed and parallel computing systems have raised the need for developing models and modelling mechanisms, that can evaluate effectively the behaviour of highly complex systems, such as supercomputers, which have been created using clusters or idle desktop computers connected to the Internet. The second method of creating a supercomputer is based on the principles and the technology that define the term Grid Computing. Using the computational power of idle non-dedicated desktop machines that are connected in the Internet reveals another, more modern, approach to the way supercomputers can be created. This methodology called cycle-scavenging, which is based on volunteer computing, is currently used effectively to projects like BOINC, which was initially developed to support the SETI@home computationally intensive project. SETI was aiming to detect intelligent life outside Earth, after analysing a large amount of data coming from a telescope. There are several freely available technologies, which have been developed around the idea of Grid Computing. Distributed Resource Management Application API (DRMAA) is a high level specification API, provided by the Global Grid Forum (GGF). which is a community which aims to standardise Grid Computing globally. DRMAA tries to specify the way in which the several jobs, within one or more Distributed Resource Management (DRM) systems, are submitted and controlled. “Condor” is a popular open source software framework for course-grained distributed parallelization of computationally intensive tasks. It is commonly used to manage workload on a dedicated cluster of machines or on a Grid of idle non-dedicated desktop computers connected to the Internet. Alternative and complementary frameworks are the Cactus Grid Project and the Sun Grid Engine. Some other popular open source software middlewares for building computing grids are the “Globus Toolkit”, and the “Parallel Virtual Machine” (PVM). OGSA (Open Grid Service Architecture) describes an architecture for grid computing environment, based around services, developed within the GGF (Global Grid Forum) for scientific and business use. It actually constitutes a distributed interaction and computing architecture which assures interoperability on heterogeneous systems, so that different types of resources can communicate and share information in order to provide several services. There are several other applications and middlewares created to support the Grid Computing technology, while additionally several low or higher level APIs have been developed such as the “Message Passing Interface” (MPI) which is used for non-shared memory programming (computer clusters) , the OpenMP API used for shared memory multiprocessing programming, and the most recent “Simple API for Grid Applications” (SAGA), which is an open standard that describes an interface for high level Grid applications programming. Additionally, combining the MPI and OpenMP technologies, several other hybrids have been developed, by different distributors, in a variety of supported programming languages. The main objectives of these technologies are: high performance, scalability and portability. Parallel, multiprocessing programming in a lower level, is based on the use of threads (eg. POSIX Threads), and a key task for achieving maximum efficiency in shared memory parallel architectures is to both use exactly one thread per processor and keep these threads running during the whole lifetime of the parallel program. There are several techniques that can optimise this procedure. Additionally, the most popular and supportive languages for parallel programming are C/C++ and Fortran. Java programming language assures portability, ease of software engineering, automatic garbage collection (using the finalize() method), thorough type checking and absence of pointers. These attributes make Java development less error prone. On the other hand there are some characteristics that make the Java programming language inappropriate for parallel systems programming such as lower performance because of slower compilers, lack of support for complex numbers and multidimensional arrays, lack of suitable standards for parallel programming (no standardised interface for MPI / OpenMP), and problems using native Java Threads instead of a directive system (OpenMP). However there are some efforts to create specifications using OpenMP-like directives, in order to standardise somehow the parallel programming with Java. Similar efforts have been done for other languages as well, such as Perl and Python.
To sum up, having as a task to evaluate the behaviour of a highly complex system, such as a supercomputer, can be very demanding. This is another challenge for professional system modellers. Taking into account the recent evolution in the area of supercomputers, several questions can be derived. Which is the way, and what kind of tools could be developed and used in order to help modellers, to evaluate and optimise modern, large-scale systems, such as projects based on Grid Computing technology? How could the Queueing Networks be used in order to model and evaluate the quite complex procedures taking place within a multi-platform, shared memory (or not), multiprocessing environment. Although the low level technology and middlewares related to the Grid computing have been already matured, there are not many high level Grid applications created yet. This lack of experience in the particular field, makes the Grid computing systems modelling and evaluation even more demanding and challenging.
Reference
[1] Edwards & Hamson, “Guide Mathematical Modelling”, Palgrave 2001.
Subscribe to:
Posts (Atom)