dev.nlited.com

>>

Cycle Counting

<<<< prev
next >>>>

2016-03-04 21:53:31 chip Page 1583 📢 PUBLIC

Mar 5 2016

Today's task: Improve the resolution of DbgOut on Overo. DbgOut11 on Overo

Resolution and Accuracy

Resolution and accuracy are not the same thing. For example, assume the temperature is 72F. I can stick my thumb in the air and determine the temperature to be 68.19245F. This would be extremely fine resolution while also wildly inaccurate. I could determine the temperature to be "about 70F", which would be very accurate but miserably low resolution.

DbgOut uses "ticks" to assign a timestamp to events. A tick is an arbitrary period of time that needs to be (in order of importance):

  1. Always increasing (never go backward)
  2. Very small
  3. Constant and uniform

Ticks that are too large (low resolution) will result in multiple events with the same timestamp, making it impossible to know which event happened first and hiding multiple events in the same transition on the display. On the other hand, fluctuating tick periods (low accuracy) can still be useful and mitigated by including a secondary timestamp stream that uses a truly constant period at a lower resolution.

With DbgOut, resolution is more important than accuracy. I prefer the CPU cycle counter over the system clock whenever possible.

DbgOut11.ko needs to know the clock frequency only to determine the maximum period for a rollover check. The clock frequency is not used to scale the ticks. The tick datastream is always raw ticks (cycle counts). This preserves the full accuracy (and sequentialality (sic)) of the timestamps. The clock frequency is only used by the DbgScope viewer to display the ticks as human-readable time spans.

Overo Ticks

gettimeofday resolution DbgOut Overo

DbgOut system clock resolution overo gettimeofday

The current version of DbgOut uses the Linux system clock as its reference clock, reported in nanoseconds. The DbgView quickly tells me that the actual resolution of the system clock is 30517ns, or an update frequency of 32768Hz. From this, I can infer that the system clock is being updated on a hardware interrupt every time the onboard 32KHz counter rolls over.

30.5us might be an unimaginably short time for humans, but in computer terms this is a huge chasm that can swallow up a lot of information. A high-resolution timing profile loses its value if there are many events with exactly the same timestamp, there is no way to tell what happened first. Notice how the scope reports 35 events in the view while there are only a few visible.

Cycle Counting on ARM

I can do better. I can read the cycle counter directly with a resolution of 1/720MHz. This is a resolution of 1.4ns -- over 20,000 times finer! More importantly, the cycle counter acts as a natural serializer as it is impossible to generate two timestamps with the same value, there will always be a delta.

There are, of course, trade-offs. The system clock is standardized so the ticks are always nanoseconds. I don't need to know anything about the hardware, it is always available, and I don't need to worry about cpu frequency scaling or sleep modes. If I read the cycle counter directly I will need to deal with all these things.

Fortunately, there are some big assumptions I can make. The Overo is a single-core system, there is only a single cycle counter. I will assume the cpu is running while I am measuring and will blissfully assume there will be no clock scaling or sleep events. I will assume the clock runs at a constant speed, as reported by /proc/cpuinfo "BogoMIPS".


14:36> Fortunately, the code to read the hardware cycle counter was still lurking behind an #if. I was able to enable it quickly and it just worked.

What a difference! No more overlapping events, and the traces look much better. The scope now reports only 4 items in the view, and each event is clearly visible as a state transition.

DbgOut Overo hardware cycle counter DbgOut Overo hardware cycle counter DbgOut Overo hardware cycle counter

ToDo

Enabling the hardware cycle counter requires me to do a lot more work to avoid reporting bad data. The cycle counter is stored in a 32bit register that will overflow every 4 billion cycles. This is about every 5.37 seconds when the clock is ticking at 800MHz. I need to always read the cycle counter before it wraps around past the value of the previous read to accurately detect rollovers. DbgOut's internal rollover counter effectively extends the cycle counter to 64bits, which will overflow once every 584 years.

I need to add a kernel timer that will be called periodically to simply read the cycle counter and determine whether it has rolled over, incrementing the rollover count if so. The period for this timer should be slightly less than half the rollover period. The maximum counter value of a 32bit register is a known constant of 2^32. I divide this by 2 and shave it a bit to provide a buffer for late timers, arriving at a counter limit of 0x7800,0000. I divide this by the clock frequency (BogoMIPS scaled to BogoKIPS for accuracy) to determine the timer period in milliseconds. For a 800MHz clock, the timer period is 2516ms.

The rollover test is simple. A rollover occurs when the most significant bit transitions from 1 to 0. I need to perform this test at least twice every rollover period to be sure to observe both '0' and '1' states to know when a transition has occurred.

UINT32 BogoMips= 720000; //720MHz clock UINT32 TimerMs= 0x78000000/BogoMips;

This leaves the problem of determining the value of BogoMips. The simplest method is to have DbgOutRelay read the contents of /proc/cpuinfo, parse it for the value of BogoMIPS, and relay the value to DbgOut11.ko in a command ioctl.

root@overo:~# cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 2 (v7l) BogoMIPS : 792.98 Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc08 CPU revision : 2 Hardware : Generic OMAP36xx (Flattened Device Tree) Revision : 0000 Serial : 0000000000000000

Cycle Counting on x86

Intel invariant TSC Intel: Counting Clocks Intel: Counting Clocks Intel: Counting Clocks

I first "discovered" the cycle counter back in 1998 with the first generation Pentiums. It worked great... until the Pentium M started playing games with the clock frequency to save power. The cycle counter became almost useless with the arrival of Core processors, where each core had its own independent cycle counter -- all running at different speeds. I reverted back to using Microsoft's QueryPerformanceCounter(), which used the PCI bus clock running (glacially) at 14MHz. DbgOut(PC) was left in that state for a long time.

Now may be the time to revist the x86 cycle counter. The last time I read the technical reference manual I saw a reference to a "reference cycle counter" that was available across all cores and was not influenced by the power management system. This will require further investigation. See DbgOutMP.

The Intel Architectural Reference Manual has the following section on the TSC (Time-Stamp Counter). All recent versions implement the TSC as an "invariant counter" allowing it to be used as a wall clock timer without worrying about the effects of power management scaling the internal clock rates. It appears all I need to do is resume using RDTSC -- Intel has fixed the problem.



WebV7 (C)2018 nlited | Rendered by tikope in 53.627ms | 44.220.184.63