Skip to content

REST API data wishlist #8009

@calestyo

Description

@calestyo

Hey.

At LMU we run a home-brew monitoring system for dCache (only), which gives us about the following:

  • total space, used thereof
  • the same per space token
  • total number of active, queued movers and of such in a weird state (like WaitingForGetPool)
  • total transfer speed, per protocol and flavour
  • transfer speed histogram (i.e. if we have 100 transfers, how many of them are with 100 kB/s, how many with 1000 kB/s, etc.)
  • transfer durations histogram (i.e. of all current transfers how many are running since 1s, how many since 10s, etc.)

per pool:

  • total/used space
  • total and per queue: number of max/active/queued movers

Also we have some Icinga/Nagios check which uses basically all information from PoolManager’s psu ls pool -l command to decide whether pools are down (i.e. enabled=, active=, rdOnly= and mode=).

Both I’d like to convert into some Prometheus exporter (and do monitoring/alerting with that).
If you’d consider to include the metrics in dCache’s prometheus exporter, than I guess we should talk before. Not sure how much experience you have with Prometheus... I do have a bit... and one can not only do the collecting of metrics via various paradigms, but in particular how one maps them to metrics and labels, especially for a clustered system like dCache, has many ways...

Ideally all these information would be retrievable via some API (which is where I thought about the REST API interface… only some of the above information is available in that right now, however.

So the following would be a wishlist for more data ;-)


- somehow all the information from `psu ls pool -l` (not sure if this could differ between various `PoolManager` instances, so in principle it would be per `PoolManager`)

Per pool (i.e. pool service, not pool host):

  • total space, used space, gap and perhaps also how much space is used that is cached/precious/etc.
  • per queue: max, active and queued movers
  • the total number of sent/received bytes (i.e. that number always only grows) in total, per queue - and for both perhaps even per protocol/flavour
  • maybe the same for read/written bytes (i.e. storage IO), if that could differ from the above network IO values

Per door:
What I had above on the pool level (i.e. per-protocol/flavour IO numbers) might make sense on the door level, too.

  • First, the collected numbers of transfers where the data wasn’t relayed via the door, but where a certain door was used to initiate the transfer
  • Second, the collected numbers, where the data was relayed via the door.

Things like e.g. overall dCache IO numbers could be calculated in Prometheus by summing things up... or you might also provide them ;-)

Any ideas?

Cheers,
Chris.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions