Advanced use

See also the example scripts to test the different metrics provided by the library.

Recorded fields

Recordings are logged in a json file and include the power draw and the use of the CPU and GPU for the pids related to your experiment. Some of the recordings are done for each pid related to your experiments: per_process_metric_name : {… pid_i: v_i, ….}. However, the monitoring of multiple programs on the same device should be done with care (see Measuring multiple programs). In the following, we details the different metrics recorded. Unless specified otherwise, the power is logged in Watt.

First you can load the data of an experiment contained in “your_output_folder” .. code-block:: python

from deep_learning_power_measure.power_measure import experiment, parsers driver = parsers.JsonParser(‘your_output_folder’) exp_result = experiment.ExpResults(driver)

From then, you can compute some statistics

# power consummed by the CPU measured by RAPL of your experiment
print(exp_result.total_("rel_intel_power"))
# duration of your experiments
d, start, end = print(exp_result.get_duration())

To check the list of available metrics (might depend on your setup):

print(exp_result)
Available metrics :
CPU
  per_process_mem_use_abs,per_process_cpu_uses,per_process_mem_use_percent,intel_power,psys_power,uncore_power,per_process_cpu_power,total_cpu_power,per_process_mem_use_uss
GPU
  nvidia_draw_absolute,nvidia_attributable_power,nvidia_mem_use,nvidia_sm_use,per_gpu_power_draw,per_gpu_attributable_power,per_gpu_estimated_attributable_utilization
Experiments
  nvidia_draw_absolute,nvidia_attributable_power,nvidia_mem_use,nvidia_sm_use,per_gpu_power_draw,per_gpu_attributable_power,per_gpu_estimated_attributable_utilization

Below are the definitions of these metrics:

CPU use

  • per_process_mem_use_abs : RAM PSS Memory usage for each recorded process in bytes*

  • per_process_mem_use_percent : RAM PSS Memory usage for each recorded process in percentage of the overall memory usage*

  • per_process_mem_use_uss : RAM USS Memory usage for each recorded process*

  • per_process_cpu_uses : Percentage of CPU usage for each process, relatively to the general CPU usage.

  • cpu_uses: percentage of cpu clock used by this pid during the recording.

  • mem_use_abs: Number of bytes used in the CPU RAM. The recording uses psutil in the background, check: deep_learning_power_measure.power_measure.rapl_power.get_mem_uses() for more details.

  • mem_use_percent: Relative number of bytes used in the CPU RAM PSS.

Non GPU Energy consumption

  • intel_power : total consumptino measured by RAPL

  • total_cpu_power: total consumption measured by RAPL for the CPU

  • psys_power: System on chip consumption

  • uncore_power: other hardware present on the cpu board, for instance, an integrated graphic card. This is NOT the nvidia gpu which is on another board.

  • total_cpu_power: core power consumption.

  • per_process_cpu_power : Essentially : * intel_power. Should be used with caution (see Measuring multiple programs)

  • per_process_mem_use_uss : USS memory per CPU in RAM.

In other words, you have the following relation:

\[Intel\_power = psys + uncore + total\_cpu\]

For the ram and the core power, we multiply by the cpu and memory use of each pid to get the per process value in the fields per_process_cpu_power and per_process_dram_power.

Check the CPU and RAPL section for more details on RAPL domains, and deep_learning_power_measure.power_measure.rapl_power.get_power() for implementation details. The support for different power domains varies according to the processor model, our library will ignore not available domains.

GPU use

  • per_gpu_attributable_mem_use : memory usage for each gpu

  • per_gpu_per_pid_utilization_absolute : absolute % of Streaming Multiprocessor (SM) used per gpu per pid

  • per_gpu_absolute_percent_usage : absolute % of SM used per gpu for the given pid list

  • per_gpu_estimated_attributable_utilization : relative use of SM used per gpu by the experiment

GPU power

This is done by the nvidia-smi tool supported by the NVML library of nvidia.

  • nvidia_draw_absolute: the amount of power used by the whole nvidia card and all GPUs.

  • per_gpu_power_draw: the amount of power used by the whole nvidia card for each GPUs

  • nvidia_attributable_power : Total nvidia power consumption attributatble to the processes you recorded. It corresponds to

  • per_gpu_attributable_power : same as nvidia_attributable_power but for each gpu

Monitoring whole machine with Prometheus

The following code will launch the monitoring and a flask app on the port 5001

from deep_learning_power_measure.power_measure import experiment, prometheus_client

driver = prometheus_client.PrometheusClient()
exp = experiment.Experiment(driver)
exp.monitor_machine(period=5)

Then, you can launch a prometheus instance

./prometheus --config.file=prometheus.yml

with a config file which look like the following

global:
scrape_interval: 3s

external_labels:
  monitor: "example-app"

rule_files:

scrape_configs:
  - job_name: "flask_test"
    static_configs:
      - targets: ["localhost:5001"]

Then visit the following url : http://localhost:9090/graph

Currently, the following metrics are supported

['power_draw_cpu', 'intel_power',
'mem_used_cpu', 'mem_used_gpu',
'power_draw_gpu']

model complexity

We use a wrapper for torchinfo to extract statistics about your model, essentially number of parameters and mac operation counts. To obtain them, add additional parameters:

net = ... the model you are using for your experiment
input_size = ... (batch_size, *data_point_shape)
exp = experiment.Experiment(driver, model=net, input_size=input_size)

You can log the number of parameters and the number of multiply and add (mac) operations of your model. Currently, only pytorch is supported.

Docker integration

For the implementation of AIPowerMeter in a docker container, we need to use a special branch of the code because of the behaviour of the command :

$ nvidia-smi pmon

An hot fix has been implemented, it forces the tracking of all the GPU processes. It’s then impossible to isolate a process running at the same time than others.

See the github repo docker_AIPM for more details. You will also find slides explaining the motivations for the use of Docker images and container.