Login | Register
Login | Register

My pages Projects SunSource.net openCollabNet

Chapter 1. Overview

1.1. Common concept

Aim of the project Hedeby is a Service Domain Management system which makes it possible to manage scalable services.

This project is developed by the Sun Grid Engine Management team. As the Sun Grid Engine project, the Hedeby project has also been open sourced under SISSL license .

The Service Domain Manager is designed to handle very different kind of services. The main purpose is solving resource lacking of such services. Hedeby is interresting for all administrators managing huge services with an administration interface. The Service Domain Manager will be able to detect scalabilty problem and resolve them.

For the first release we (the Hedeby team) will concentrate on using Hedeby to manage the Sun Grid Engine service. In future it should be able to support other services.

1.1.1. Resource

What is a resource? In the project Hedeby a resource can be nearly everything. It can be some hardware (e.g. a host or a printer) or it can be a software (a specific application or licenses). In general a resource is something a service uses to provide the service. If you give a service more resources, it can do more work in the same time.

Each resource should have a system wide unique id, the resource id to be fully usable by Hedeby. This id must identify the resource. For a host resource this can be the full qualified hostname.

A resource in Hedeby is seen as single entity. It is not wanted to share resources between services e.g. if Hedeby should manage a license which allows the usage of a software for ten user concurrently, the administrator has to add ten resources to the Hedeby system. Each license resource must have a unique id, as sharing a license may lead to violiting of license agreement and to service malfunction. Also, sharing a resource between services can lead to downgrading a service performance so ideally each resource should be assigned to a single service exclusively. See Section 1.1.1.2, “Ambiguous Resources” for details about non-unique resource ids.

A Hedeby system stores for each registered resource a set of properties. These resource properties describes the resource. Examples are the number of CPUs, memory and architecture of a host, version number of a software, number of licenses. Each resource property has a name and a value.

Resources in a Hedeby system

1.1.1.1. Static Resources

The Hedeby system distinguishes between static and dynamic resource. A static resource can not be removed from a service. The Hedeby system will never touch a static resource. When removing a service all assigned static resources will disappear from the Hedeby system.

Dynamic resources can be removed (unassigned) from service and added to another (assigned).

1.1.1.2. Ambiguous Resources

A service can use couple of resources even before the service is managed by Hedeby e.g. web server cluster (service) is running on four servers (resources). When a service becomes managed, it reports its resources to Hedeby (auto-discovery of resources) - in this case, it may happen that service reports a resource with a resource id that is already used in Hedeby system. This may signal either that a resource is shared between two services or that there is just a name collision. Either case is not wanted and we call such resource AMBIGUOUS as it is not clear which service should use the resource exclusively. The case has to be solved manually by removing one instance of the ambiguous resources from the system.

Presence of ambiguous resource does not meant that the system is not functional at all. An ambiguous resource may be fully or partially functional (depending on a service), but to avoid possible problems, system puts several constrains on an ambiguous resource:

  • Ambiguous resource can not be modified

  • Ambiguous resource can not be moved (from service A to service B)

  • Ambiguous resource will not be considered as candidate for filling a resource request

The only operations allowed on an ambiguous resource are:

  • Ambiguous resource may be reset

  • Ambiguous resource may be removed from system

1.1.1.3. Resource States

The Hedeby system knows the following resource states

UNASSIGNED

A resource is not assigned to a service. It is currently being processed by resource provider (e.g. for filling a resource request).

ASSIGNING

A resource is in proccess of being assigned to a service (e.g. service adapter performs installation of service components on a resource, but it is not finished yet).

ASSIGNED

A resource has been successfully assigned to a service. A service has full control of a resource.

UNASSIGNING

A resource is in process of being released from a service (e.g. service adapter performs uninstallation of service components on a resource, but it is not finished yet).

INPROCESS

A resource has been just added to RP (temporary, until it is assinged to another service) and there has been no info yet about a resource state (whether it has been released from service or no).

ERROR

An action on a resource produced an unrecoverable error. The resource is currently not usable and has to be reset.

1.1.2. Service

A service in the term of Hedeby is a piece of software. It can be a database, an application server or any other software. The only constrain is that the software has to provide a service management interface.

To make a service manageable Hedeby needs a driver for the service. Such a driver is called service adapter. The service adapter is packaged in a jar file. It has its own configuration and in the current version is runs inside a service container.

Services in a Hedeby system

1.1.2.1. Service States

From the view of a Hedeby system a service as the following states:

UNKNOWN

There exists no connection to the service. Hedeby has no idea in what state the real service is. A service goes into the UNKNOWN state if the communication between the Hedeby system and the real service is interrupted.

STARTING

Hedeby is currently starting the service

RUNNING

Hedeby has a connection to the service. The service is observed.

SHUTDOWN

The service is going down.

ERROR

Hedeby can contact the service, but it is not working in the expected way. The administrator of the service has to solve the problem.

STOPPED

The service has been stopped.

State changes of a service can be triggered from Hedeby system or from the service itself. The service adapter has to be smart enough to detect external service state changes (e.g. a service adapter for a Grid Engine has to catch the "qmaster goes down" event).

What really happens if a Hedeby system starts or stopps a service is a implementation detail of the service adapter (e.g does service adapter for Grid Engine shutdown qmaster?). The Hedeby system knows only the states reported by the service adapter which interprets the states of the real service.

1.1.2.2. Registered Services

The Hedeby system distinguishes between registered and unregistered services. Unregistered service are not included into the decision making process. No resources are assigned or unassigned. The service will also report no needs if it is unregistered.

1.1.2.3. Key Performance Indicators ( KPI)

To make the performance measurable, each service defines a set of key performance indicators. The service adapter collections this numerical values by using the service management interface.

Example 1.1. Sample KPIs

  • Number of transaction per second
  • Disk usage
  • Memory usage

Each PKI can have additional properties which indentifies the performance of a logical unit of the service.

Example 1.2. KPI's with properties

  • Number of used rows of each database table (name of the database table is a propertiy of the KPI)
  • Number of pending jobs per host (hostname is a property)
  • Number of pending waiting for a specific license (name of the license is a property)

The service adapter is responsible to map the properties of the KPI's into resource properties. He has to provide some maping tables.

1.1.2.4. Service Level Objectives ( SLO)

Hedeby allows the definition of rules which describes the current state of a service. A single rule is called service level objective. If a SLO can be fulfilled or not for a specific time. If a SLO is fulfilled we say the service is in compliance with this SLO.

With a set of SLOs the administrator of the Hedeby system defines implicit a service level agreement ( SLA) for a service. If all SLOs are fullfilled the service works for a defined scenario. The SLA itself can not be defined in the Hedeby system. Only the set of SLOs.

Example 1.3. Typical SLA for a service

The following graphic shows a typical SLA definition for a service. The SLA if formulated with two SLOs. The number of pending request to the service should always be less then 10 and the throughput time of a request should be less then 3s.

SLOs of a service


1.1.2.5. Need

If some SLOs are not fulfilled the service has a need for additional resources. The service has to describe what kind of resources are needed by specifing resource properties. The Hedeby system ( Section 1.1.3, “Resource Provider”) will try the solve this lack of resources by assigning new resources to the service.

A need contains the information about the needed resource (type of resource, resource properties) and a urgency. This urgency is a non-negative number (0 and above) where the higher number specifies the more urgent need.

The administrator of the Hedeby system has to define what need will be generated if SLOs are not fulfilled.

Example 1.4. Service reports a need

The following graphic shows the SLA defintion described in Example 1.3, “Typical SLA for a service ”. If one of the SLOs is not fullfilled the service will report the need the new resource of type host a needed. The urgency of the need is 75 (relatively high).

A service reports a need


The calculated urgency is only absolute for this service. Settings in the Policy Engine relativates the urgency in comparision to other services (see Section 1.1.3.1, “Policy Engine”).

1.1.2.6. Spare Pool

Hedeby provides a special service and component in each Hedeby system. It is named the Spare Pool. This spare_pool service collects all resources which are not heavilly used by service to which are they currently assigned by sending constant request. There could be more than one spare_pool components installed.

The Spare Pool supports only one SLO. No matter how many resources are assigned to the Spare Pool the SLO is never fullfilled. The urgency of generated need of the Spare Pool is configurable by the adminstrator. It should be small enough so that no resource is assigned to the Spare Pool while other service needs them.

Example 1.5. Service gets resource from Spare Pool

In this example we have a Hedeby system with three services (including the Spare Pool). The Spare Pool contains currently six resources. Service #1 is in compliance with it's SLOs. Service #2 has a need for an additional resources. The urgency of the Spare Pool is lower then the urgency of service #2. The service domain manager is taking one resource out of the Spare Pool and is assigning it to service #2.

Role of the Spare Pool in a Hedeby system


1.1.2.7. Resource Usage

The resource usage gives the Hedeby system the information how important the resource for this service is. The usage is non-negative number (0 is also allowed). It's the resonsibility of the service to keep the usage of the resource uptodate (e.g. if KPI if the service has changed).

In general we can say that the usage of a resource is the maximum urgency of the SLOs which needs the resource to be fulfilled.

Example 1.6. Resource Usage

A service has six resource assigned (R1-R6). There exists two SLOs for this service. SLO1 has urgency 50 and SL2 has urgency 30.

Resource R2, R3 and R4 have a usage 30, because they are need to full fill SLO2. Resource R5 and R6 have usage 50 (urgency of SLO1).

Resource R1 is need by SLO1 and SLO2. In such cases the resource will have the maximum urgency of all associated SLOs, this means R1 has usage 50 (= max(urgency of SLO1, urgency of SLO2)).

Usage of assigned resource


1.1.3. Resource Provider

The Resource Provider is the central component in a Hedeby system. It has the control over all services and resources. Each service adapter must inform the Resource Provider if the state of a service or a resource has changed. The Resource Provider makes the decisions whether a service gets a resource or not. The following image illustrates the decision making process:

Decision making process

1

The service reports to the service adapter it's key performance indicators (KPI).

2

The administrator defines SLOs which calculates based on the KPIs of the service the resource need for each SLO.

3

List of all needs for all SLOs are send to the resource provider.

4

The Resource Provider uses the Policy Engine to normalize the needs for the services (normalization of the needs).

5

Based on the normalized needs the Resource Provider meets it resource assignment decisions.

6

The Resource Provider sends to the service the corresponding resource assignments/unassignments.

At startup the Resource Provider discovers the Hedeby system. It asks all services what resources they posses and store that information in it's local storage.

1.1.3.1. Policy Engine

With the Policy Engine it is possible to define policies which influences the decisions of the Resource Provider. The Policy Engine calculates out of the need of a service a new urgency.

The Policy Engine rules the decision making process of the resource Provider

The Policy Engine has access to statistical values of the resource usage. The following information can be provided:

  • Number of resources assigned to a service which match a given resource properties pattern (e.g. number of host with solaris operating system).

  • Number of resources in the given state which match the given resource properties pattern (e.g. number of assigned host resources with more then 2GB memory).

Note

To make time base decisions the Policy Engine will need information about how long has a resource been assigned to service.

The Policy Engine provides a generic interface which make it possible to plug other implemenation into the Hedeby system.

Example 1.7. Example for a simple Policy Engine

A simple implementation of a Policy Engine can weight the importance of a service by given them different priorities. The policy engine multiplies the urgencies of the services reported in a need with the priority of the service and gets so the weighted needs.

ServicePriorityNumber of Resources
Spare Pool11
A23
B32

For the services the following SLOs are defined:

Service SLO Urgency
Spare Poolneeds always resources1
Aneed more then 3 resources50
Bneed more then 3 resources40

The Policy Engine weights the urgencies of the reported needs by multipling the priority of the service:

ServiceUrgency * PriorityWeighted Urgency
Spare Pool1*11
A50* 2100
B40*3120

The police engine reports the needs with new calculated urgencies to the Resource Provider. The Resource Provider gives service B the signal that the free Resource from the Spare Pool can be assigned. After the assigment is finished service B send an event to the Resource Provider. The next scheduling run starts.

Warning

Missconfiguration of the SLOs and the policies will lead into a swinging system. We have to implement mechanisms to prevent such situations.


There is no strict definition of a policy setting - Policy Engine is open for 3rd party enhancements, therefore it does not rely on any special definition/implementation of a policy setting. An example of a policy setting can be the following rule :

Example 1.8. 

<SERVICE_CONTAINER> should receive [N]% of <RESOURCE> resources


Hedeby currently embrace only a simple Policy Engine implementation which does take into account only Priority setting. Priority is value assigned to Service adapter (generally to a managed service) and is subjective importance of service (defined by an Hedeby administrator).

1.1.3.2. Decision Process in Detail

The decision process is based on an algorithm that takes into account the requirements of the service which are specified by Need and a data provided by a policy manager.

Note

By Resource Provider (RP) we understand an interfaces that encloses a set of managers that are responsible for whole decision making process (service manager, resource manager, request processor, order processor).

Note

Need is a quantified request for a resource with certain properties. One possible sample of Need: "4 resources of host type with 4GB of memory" which means that a service asks for 4 hosts with 4GB of memory. Another possiblity of Need: "1 resource of SW license type" which means that a service asks for a license (for the special SW).

The complete algorithm can basicaly be divided into solving the two cases:

  • The service asks for a new resource.
  • The service is giving up one of its resources.

The first case is in detail described in the following steps:

  1. When a service's SLO is not met, the service (let's name it SOURCE) sends a notification to a RP that it needs a resource (ResourceRequestEvent). It is up to service to send a ResourceRequestEvent everytime it finds out the SLO is not met.

  2. RP receives a ResourceRequestEvent that contains the name of the SOURCE and the list of Needs (which contains one or more Need).

  3. RP enqueues the ResourceRequestEvent in the internal request queue (a request object was created from the ResourceRequestEvent).

  4. RP takes the request from the queue and starts to process it.

  5. If the end of the request's list of Needs is reached, go to step 6 otherwise for each need from the list of Needs in the request do the following steps:

    1. Obtain a list of resources from each service (let's name such service TARGET) that match the required resource described in Need. Let's call each such resource a CANDIDATE.

      TARGET will consider a resource as a CANDIDATE if the resource usage level is lower than the normalized urgency of the need expressed by SOURCE. Normalized urgency of the SOURCE (and of the TARGET) is calculated using the policy manager.

    2. If the list of CANDIDATE resources is not empty, continue on the next step, otherwise go to step e.

    3. Iterate over the list of CANDIDATES. If the end of the CANDIDATE list is reached go to the step e OR if the required amount of CANDIDATES was asked to be released go to step d otherwise for each CANDIDATE do the following steps:

      1. First, register an action which has to be taken once the CANDIDATE is released from TARGET. The registration is done creating the ORDER and storing it in the ORDER store. The action is an assigning of the CANDIDATE to the SOURCE (refresh: SOURCE is the service which expressed NEED).
      2. Ask the TARGET to release the CANDIDATE (the asynchronous call to removeResource on TARGET interface will be called, and the result of the operation is NOT guaranteed). If there was problem requesting the TARGET to release the CANDIDATE, the previously created assignement ORDER is cancelled (removed).
      3. If the required amount of CANDIDATES was asked to be released go to the next step. (Required amount is specified by quantity attribute in the Need).

    4. RP filled the need (was able to ask at least the same amount of resources (CANDIDATES) to be released as it is specified by need's quantity). Remove the need from the list of needs in the request. Go to step 5.

    5. RP did NOT fill the need (was not able to ask at least the same amount of CANDIDATES to be released as it is specified by need's quantity). The quantity attribute of the need is reduced by number of those CANDIDATES that RP was able to ask the related TARGET to release. Leave the need in the list of needs. Go to step 5.

  6. If the list of needs in the request is not empty re-submit the request in the request queue for re-processing at later time, otherwise remove the request (it is processed).

The second case is in detail described in the folowing steps:

  1. RP receives a ResourceRemovedEvent that contains the name of the service that released the resource and the snapshot of the Resource that was released. Let's name the service as a SOURCE and the resource as a RESOURCE.

  2. RP looks into a REQUEST queue if there is an ORDER for the RESOURCE (the resource ID and SOURCE service (owner) is compared).

  3. If there is an ORDER for a RESOURCE, process the ORDER. ORDER contains identifier of a service to which has to the RESOURCE be assigned. Let's name such service a TARGET.

    RP asks the TARGET to add a RESOURCE (the asynchronous call to addResource on TARGET interface will be called, and the result of the operation is NOT guaranteed).

    If there is a problem with executing the ORDER, the ORDER is not executed and is cancelled (removed) and the RESOURCE is assigned to the first service that is willing to accept the RESOURCE. If no such service exists, RESOURCE remains temporarily stored in RP cache and administrator has to solve the resource manually. If the ORDER is executed without any problem, all ORDERS for the same RESOURCE are removed from the resource queue (they were added based on decision that was made before the RESOURCE was added to the SOURCE).

  4. If there is no ORDER for a RESOURCE, add the resource to the first available service.

1.1.4. Reporter

Reporter component is a log/monitoring tool for Hedeby.

The role of reporter component is to intercept and gather informations about what is going on in the system. Administrator can specify what kind of data he is interested in.Reporter is able to store informations and notifications that comes from Configuration Service, Resource Provider and all services that are installed in the system.

The reporter component is prepared to store data in ARCo data base (Grid Engines Accounting and Resourting Console). By prepared we mean that, there is a special ARCo format file created, that stores suitable for ARCo data.

The data from ARCo file aren't so much readable for normal user, thats why Administrator can get and print out on the screen data using CLI commands. The data can be filtered using available filters. More about Reporter component you can find here: Section 2.2.5.4, “Reporter Component”

1.1.5. Executor

Executor is used whenever there is a need to set up (or destroy) service component on a resource that has to be a part of the service, especially in situation when there is no other way how to communicate with the resource. Once the resource is configured by executor, service adapter can use different way of communication with the resource (usually a communication channel provided by the managed service).

Executor give Hedeby the possibility to execute actions or commands on a resource. For this purpose in a Hedeby system the administrator can install on each resource an executor component. The features of executor component highly depends on the type of resource. In general the executor executes a command on a resource. For host resources user switching will be possible.

Mainly the service apdaters will use the executors for installing/uninstalling software on a resource. However the usage of executors is not restricted to the service adapters.

1.1.6. Principles of Operation

This section describes the basic actions or use case which can be executed on a Hedeby system.

1.1.6.1. Managing Services

Service state transistion

1.1.6.1.1. Add a Service

Adding a service is triggered from the UI. The adminstrator has to provide the following information.

  • Name of the service, must be unique in the Hedeby system

  • Type of the system (e.g. Grid Engine, RDBM system, ..)

  • Connect parameter (e.g. SGE_ROOT and SGE_CELL for Grid Engine)

  • Mapping of resource properties from the service conventions to the Hedeby conventions

    arch=lx26-x86 -> hardwareCpuArchitecture=x86, operatingSystemName=Linux

  • Service level agreement (set of SLOs, depends highly on the type of service, different services supports different types of SLOs)

  • Service specific configuration (parameters for a specific service adapter)

When adding a service the Hedeby system validates the configuration parameters. On any error the action is rejected. With a valid configuration the service adapter is instantiated. State of the service is UNKNOWN. The service is registered in the RP.

1.1.6.1.2. Remove a Service

Removing a service is only possible if the service is in state SHUTDOWN or ERROR. This action is triggered from the UI. The following steps are executed:

  • The instance of the service adapter is removed

  • The configuration of the service is removed

  • The service is unregistered from the RP

1.1.6.1.3. Start a Service

Starting a service can be triggered from the UI. The administrator has to provide the name of the service.

  • If the service is not in state UNKNOWN or SHUTDOWN the action is rejected

  • The UI sends the service adapter the start_service event

  • The service adapter connects to the real service

  • After a successful connect the service adapter discovers the resources that are owned by the service. All resources are reported to RP. Resource that are unknown in the Hedeby system are created automatically by RP. If service adapter is not able to manage discovered resource without presence of Executor component (implementation detail of service adapter and it may vary across different service adapters) and the discovered resource is not running Executor component, the resource is marked as static.

  • Finally the service adapter sets the state of the service to RUNNING and informs the RP about the state change.

Note

The service adapter does not observe the service if the service adapter is in state UNKNOWN or SHUTDOWN. If a service is started without Hedeby it has no effect on the Hedeby system.

1.1.6.1.4. Stop a Service

There exists two possiblilities to stop an service:

  1. The administrator stop the service manually (via UI). The administrator has to specify the name of the service and the “free_resources” flag. The following steps are executed.

    • The service will go into state SHUTDOWN

    • If the “free_resources” flag is set the service adapter removes all assigned non-static resources.

    • The connection to the external service will be closed and the state of the service is set to STOPPED (event to RP).

  2. The service is stopped with external tools (e.g. qconf -km). The service status is immediatly set to STOPPED and the RP is informed about the state change.

1.1.6.1.5. Configure a Service

Configuring a service is done via the UI. Some configuration parameters can only be changed if the service is not running (e.g. connection parameters). Other parameters can be changed dynamically (e.g. changing SLOs).

1.1.6.2. Managing Resources

The following shows all possible state transitions of resource in a Hedeby system. Each state transition requires a couple of component interactions.

Resource state transitions

1.1.6.2.1. Add a Resource

For adding a new resource to a Hedeby system the administrator uses the user interface. For adding a resource the administrator has to specify the following information:

  • Name of the resource, must be unique within a Hedeby system.

  • Type of the resource (e.g. host, printer, license)

  • Additional properties for the resource (installed operating system, architecture, hardware specific properties)

When adding a resource the UI sends a corresponding request direct to the Resource Provider. The Resource Provider validates the input parameters, stores the resource in it's local storage and assigns the resource to the first service willing to accept the resource.

Resource is added automatically to the Hedeby system each time a service adapter discovers that a service uses a resource that is unknown in the Hedeby system. Such resource may be marked as static if service adapter is not able to remove the resource from service. State of a discovered resource reflects the actual resource state depending on the service adapter (for GE adapter, it may be ASSIGNED if discovered resource has execd running, or ERROR if discovered resource has not execd running).

1.1.6.2.2. Assign a Resource to a Service

The assignment of a Resource can be triggered in two ways:

  1. The administrator uses the UI to assign a resource manually.

  2. The RP finds out that a resource requires additional resources. The RP will trigger the resource assignment automatically.

For a Resource assignment the following actions are executed:

  1. RP sets the state of a resource to ASSIGNING.

  2. RP sends a add_resource request to the service (it contains the Resource properties)

  3. The service checks whether Resource fulfills the requirements for this.

  4. If the resource is not usable the service sends resource provider a resource_rejected message. The RP sets the state of the resource to UNASSIGNED. The RP marks the resource ID of the resource as not usable by the service (RP's internal storage called service blacklist).

  5. If the resource is usable the service starts the necessary installation process (installation routines, depending on a service adapter). If RP received the success response message from the service ( resource_added) it sets the state of the resource to ASSIGNED. On any unforeseen error during the installation phase the RP will set the resource in the ERROR state because the service adapter has modified the resource and the modification is undoable.

1.1.6.2.3. Unassign a Resource

The same as with the assignment process the unassignment can also be triggered manually (over the UI) or automatically (RP). The following actions are executed during the unassigment:

  1. The RP sets the resource state to UNASSIGNING and sends the service the remove_resource request.

  2. The service checks if it is possible to remove the resource.

  3. If removing is not possible and the remove resource request is not forced, then the service sends the response message to RP, the state will be set to ASSIGNED.

  4. If removing the resource is possible or if the remove resource request is forced the service processes the uninstall procedure. On success the RP sets the state of the resource to UNASSIGNED. On any unforeseen error the resource is treated as unusable for the Hedeby system and the state of the resource is set to ERROR.

1.1.6.2.4. Remove a Resource

Removing a resource is possible if the resource is owned by a service (it depends on a service adapter to check the resource state to allow/disallow the removal of resource, ideally only resource in ASSIGNED and ERROR state can be removed). The administrator uses the UI to trigger this action. Only the name of the resource must be specified.

Administrator can remove a resource even if the resource is owned by resource provider (the resource state is not checked as this operation should be performed only if system is in inconsistent state). The administrator uses the UI to trigger this action. Only the name of the resource must be specified.

1.1.6.2.5. Reset a Resource

Any unforeseen error during assigment/unassigment sets a resource into ERROR state. If a resource is in ERROR state the Hedeby system treats it as unusable. If service adapter of the service that owns the ERROR resource does not support active reset of resource (automatic cleanup), the administrator must cleanup the resource manually. After the clean up the resource state can be reset manually (UI). Only the name of the resource must be specified.