contextswitching - Tumblr blog

contextswitching · 8 years ago

Text

How yum handles -release dependencies

Today I came across an interesting corner case in the way Red Hat and rpm/yum handle package dependencies.

For background, we have one server running RHEL EUS 6.7. We wanted to switch this server to the base RHEL channel and away from EUS. Because EUS channels have their own versions of packages (which contain backported security fixes and other differences to the 'main' RHEL channels), this would involve an update of the system to replace the 6.7 EUS packages with base packages.

On one of these systems, in the past Apache 2.4 had already been installed from the EUS channels, and with it came a dependency, libnghttp2:

# rpm -qR httpd24u | grep libng libnghttp2.so.14()(64bit)

# rpm -q libnghttp2 libnghttp2-1.6.0-1.el6.7.z.1.x86_64

When we tried to update the system, we received a dependency resolution error:

# yum update httpd24u --disableexcludes=all Setting up Update Process Resolving Dependencies --> Running transaction check ---> Package httpd24u.x86_64 0:2.4.20-3.ius.el6.7.z will be updated --> Processing Dependency: httpd24u = 2.4.20-3.ius.el6.7.z for package: 1:httpd24u-mod_ssl-2.4.20-3.ius.el6.7.z.x86_64 ---> Package httpd24u.x86_64 0:2.4.25-3.ius.el6 will be an update --> Processing Dependency: httpd24u-tools = 2.4.25-3.ius.el6 for package: httpd24u-2.4.25-3.ius.el6.x86_64 --> Processing Dependency: httpd24u-filesystem = 2.4.25-3.ius.el6 for package: httpd24u-2.4.25-3.ius.el6.x86_64 --> Processing Dependency: httpd24u-filesystem = 2.4.25-3.ius.el6 for package: httpd24u-2.4.25-3.ius.el6.x86_64 --> Processing Dependency: nghttp2 >= 1.5.0 for package: httpd24u-2.4.25-3.ius.el6.x86_64 --> Running transaction check ---> Package httpd24u-filesystem.noarch 0:2.4.20-3.ius.el6.7.z will be updated ---> Package httpd24u-filesystem.noarch 0:2.4.25-3.ius.el6 will be an update ---> Package httpd24u-mod_ssl.x86_64 1:2.4.20-3.ius.el6.7.z will be updated ---> Package httpd24u-mod_ssl.x86_64 1:2.4.25-3.ius.el6 will be an update ---> Package httpd24u-tools.x86_64 0:2.4.20-3.ius.el6.7.z will be updated ---> Package httpd24u-tools.x86_64 0:2.4.25-3.ius.el6 will be an update ---> Package nghttp2.x86_64 0:1.6.0-1.el6.1 will be installed --> Processing Dependency: libnghttp2(x86-64) = 1.6.0-1.el6.1 for package: nghttp2-1.6.0-1.el6.1.x86_64 --> Finished Dependency Resolution Error: Package: nghttp2-1.6.0-1.el6.1.x86_64 (rhel-x86_64-server-6-common) Requires: libnghttp2(x86-64) = 1.6.0-1.el6.1 Installed: libnghttp2-1.6.0-1.el6.7.z.1.x86_64 (@rhel-x86_64-server-6.7.z-common) libnghttp2(x86-64) = 1.6.0-1.el6.7.z.1 Available: libnghttp2-1.6.0-1.el6.1.x86_64 (rhel-x86_64-server-6-common) libnghttp2(x86-64) = 1.6.0-1.el6.1 You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest

The problem here isn’t immediately obvious, particularly for anyone used to looking solely at version and ephoc numbers to resolve dependencies. libnghttp2 1.6.0-1 is required, and installed we have... libnghttp2 1.6.0-1?

We can see quickly that httpd24u now requires a metapackage called nghttp2:

# repoquery -qR httpd24u | grep ^nghttp2 nghttp2 >= 1.5.0

Which in turn required the libnghttp2 package:

# repoquery -qR nghttp2 libnghttp2(x86-64) = 1.6.0-1.el6.1

The problem in this specific case is that not only does the new nghttp2 package require a specific version of libnghttp2 (1.6.0), but they also require a specific *release* (1.el6.1), as per this entry in the SPEC file and the output above:

# grep "Requires: libnghttp2" nghttp2.spec Requires: libnghttp2%{?_isa} = %{version}-%{release} Requires: libnghttp2%{?_isa} = %{version}-%{release}

The release of a package is an identifier to distinguish between different builds of the same code. To quote from the fedora documentation[1]:

In cases where version numbers are hard to compare, an epoch can be used to more easily distinguish between software version. The release number is almost never used. This makes sense, in that it ties a dependency to a particular build of the RPM package, rather than a version of the software itself.

The fact that we have a different release of the libnghttp2 package installed than the one required by the nghttp2 requirement is why we're getting a dependency error when trying to update the software.

The second part of the puzzle though is why yum can't resolve the dependency itself. Generally if yum is missing a dependency and it's available in the yum repositories configured, yum will simply download / update to satisfy the dependency.

In the case of httpd24u, we can see it requires libnghttp2-1.6.0-1.el6.1 which is available on the system:

# yum list libnghttp2 --disableexcludes=all --showduplicates Installed Packages libnghttp2.x86_64 1.6.0-1.el6.7.z.1 Available Packages libnghttp2.i686 1.6.0-1.el6.1 libnghttp2.x86_64 1.6.0-1.el6.1 libnghttp2.x86_64 1.6.0-1.el6.1

The problem here is that yum will not compare release numbers for the purposes of distinguishing whether packages are the same. This makes logical sense if you think about it: given that a release can be something like "1.el6.1" or "-ius.centos6", how can you compare those two strings in a sane way? As far as yum is concerned, if the version and the epoch match, the packages are the same. You can see this more clearly through yum shell, where we can't get yum to install a different release of a package even if we explicitly specify it:

# rpm -q libnghttp2 libnghttp2-1.6.0-1.el6.7.z.1.x86_64

# yum shell Setting up Yum Shell > install libnghttp2-1.6.0-1.el6.1 Setting up Install Process Package matching libnghttp2-1.6.0-1.el6.1.x86_64 already installed. Checking for update.

There is no easy way to fix this automatically. The solution in this case is for human intervention to remove and replace libnghttp2 with the correct release.

0 notes

contextswitching · 8 years ago

Text

How is the await field in iostat calculated?

PART 1: 15 seconds of await

One of our customers was running some third party monitoring software, which was reporting very occasional spikes of many seconds (6-15 seconds)worth of await on their local disk. Looking at the datadog code, we can see that the python is really just running iostat and capturing the output:

if Platform.is_linux(): stdout, _, _ = get_subprocess_output(['iostat', '-d', '1', '2', '-x', '-k'], self.logger)

From the man page, the options iostat is running with are:

-x Display extended statistics. This option works with post 2.5 kernels since it needs /proc/diskstats file or a mounted sysfs to get the statistics. This option may also work with older kernels (e.g. 2.4) only if extended statistics are available in /proc/partitions (the kernel needs to be patched for that).

-k Display statistics in kilobytes per second instead of blocks per second. Data displayed are valid only with kernels 2.4 and later.

-d Display the device utilization report.

The '1' and '2' will mean that iostat runs every one second, and will return two results before exiting. iostat with the -x flag will read from /proc/diskstats by default (it says it can also use sysfs, but a look at the code shows that /proc/diskstats is preferred when available). /proc/diskstats is a file containing a set of incrementing counters with various disk statistics. The first set of output from iostat will be statistics since the system was booted (as per the man page), and the second result will be statistics collected during the interval since the previous report. This will give us output like this:

# iostat -d -k -x 1 2 Linux 2.6.32-642.6.2.el6.x86_64 (linux-test2) 01/26/2017 _x86_64_ (1 CPU) Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.01 0.02 0.00 0.00 0.05 0.09 69.08 0.00 31.58 3.05 53.63 0.74 0.00 xvda 0.00 1.40 0.05 1.22 1.10 10.49 18.18 0.00 0.66 0.62 0.67 0.13 0.02 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

The first thing to do to try and investigate the issue was to try and replicate the results using iostat ourselves. We ran the following in a while loop:

# while true; do date >> /var/log/iostat.log ; iostat -d 1 2 -x -k >> /var/log/iostat.log; sleep 1; done

After running that for a while, we calculated the min, max and average values that iostat logged for await:

# cat /var/log/iostat.log | grep "dm-2" | awk '{print $10}' | sort -V | awk 'NR == 1 { max=$1; min=$1; sum=0 } { if ($1>max) max=$1; if ($1&ltmin) min=$1; sum+=$1;} END {printf "Min: %d\tMax: %d\tAverage: %f\n", min, max, sum/NR}'

The results looked something like this:

Min: 0 Max: 15506 Average: 1.862469

Suspiciously the max was indeed showing 15506ms. Given that iostat is running every second, 15000ms (15 seconds!) of await is logically impossible. But in case the customer pressed further, the next question is what actually is await? A quick google around the internet doesn't show us exactly how it's calculated, so that means we need to go to the source.

PART 2: Obtaining the source code

The iostat program is part of a package included in the 'sysstat' package. The customer was running sysstat-9.0.4-27.el6.x86_64. The first thing to do was to download the source RPM for this version; it's not sufficient to browse the latest code repository online since software can change significantly from version to version. We found the SRPM on the red hat website and downloaded it on a test machine, and installed it.

# wget http://ftp.redhat.com/pub/redhat/linux/enterprise/6Server/en/os/SRPMS/sysstat-9.0.4-27.el6.src.rpm # rpm -Uvh sysstat-9.0.4-27.el6.src.rpm # cd rpmbuild

Looking in the SOURCES directory, there is some code and a bunch of .patch files. If we wanted to look at the source for the program as it was installed on disk, we would need to apply the patch files to the original source. We can do this with rpmbuild. From the relevant section of the rpmbuild man page:

-bp Executes the "%prep" stage from the spec file. Normally this involves unpacking the sources and applying any patches.

Before we can apply the patch files, we need to install some dependencies. We got an error for missing dependencies when trying to build the package initially which told us which dependencies were missing - gettext and if.h (which a yum whatprovides shows is provided by the gettext and kernel-devel package respectively):

# yum install kernel-devel gettext

Finally we can run rpmbuild.

# rpmbuild -bp SPECS/sysstat.spec

Once this finished, we can find the patched source for the version of sysstat we care about in the ~/rpmbuild/BUILD/sysstat-9.0.4 directory. From the naming of the files, the one we want is iostat.c.

PART 3: Reading the source

Since this is quite a short piece of code (~2000 lines), the first step is to go through and read the comments, functions and variable naming to get a broad picture for what we're looking at. Initially the first thing that sticks out is the write_ext_stat function - the comment above it reads this:

/* *************************************************************************** * Display extended stats, read from /proc/{diskstats,partitions} or /sys. * * IN: * @curr Index in array for current sample statistics. * @itv Interval of time. * @fctr Conversion factor. * @shi Structures describing the devices and partitions. * @ioi Current sample statistics. * @ioj Previous sample statistics. *************************************************************************** */ void write_ext_stat(int curr, unsigned long long itv, int fctr, struct io_hdr_stats *shi, struct io_stats *ioi, struct io_stats *ioj)

That looks like what we want. Looking at the code, we can see pretty clearly where the iostats output is printed when displaying extended stats:

/* DEV rrq/s wrq/s r/s w/s rsec wsec rqsz qusz await svctm %util */ printf("%-13s %8.2f %8.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %6.2f %6.2f\n", devname, S_VALUE(ioj->rd_merges, ioi->rd_merges, itv), S_VALUE(ioj->wr_merges, ioi->wr_merges, itv), S_VALUE(ioj->rd_ios, ioi->rd_ios, itv), S_VALUE(ioj->wr_ios, ioi->wr_ios, itv), ll_s_value(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr, ll_s_value(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr, xds.arqsz, S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0, xds.await, /* The ticks output is biased to output 1000 ticks per second */ xds.svctm, /* Again: Ticks in milliseconds */ xds.util / 10.0);

The part we care about is the await part, which is the third column from the right. As above, we can see that what's printed for the await column is "xds.await". Looking a bit further up the function, we can see that xds is likely set here:

compute_ext_disk_stats(&sdc, &sdp, itv, &xds);

The iostat.c file doesn't contain this function, but a quick grep over the source directory shows that the definition is present in common.c:

/* *************************************************************************** * Compute "extended" device statistics (service time, etc.). * * IN: * @sdc Structure with current device statistics. * @sdp Structure with previous device statistics. * @itv Interval of time in jiffies. * * OUT: * @xds Structure with extended statistics. *************************************************************************** */ void compute_ext_disk_stats(struct stats_disk *sdc, struct stats_disk *sdp, unsigned long long itv, struct ext_disk_stats *xds)

Within this function, we can see that await is set based on some arithmetic using various variables, like this:

/* * Kernel gives ticks already in milliseconds for all platforms * => no need for further scaling. */ xds->await = (sdc->nr_ios - sdp->nr_ios) ? ((sdc->rd_ticks - sdp->rd_ticks) + (sdc->wr_ticks - sdp->wr_ticks)) / ((double) (sdc->nr_ios - sdp->nr_ios)) : 0.0;

From the above, it's important for us to know a few pieces of information:

What is nr_ios?

What is rd_ticks?

What is wr_ticks?

What is sdc?

What is sdp?

What is the ? operator?

The sdc and sdp questions can be answered by the comment above the compute_ext_disk_stats function:

* @sdc Structure with current device statistics. * @sdp Structure with previous device statistics.

The next question - what nr_ios actually is - can be obtained back in the write_ext_stat function. There, we have some lines of code which look like this:

sdc.nr_ios = ioi->rd_ios + ioi->wr_ios; sdp.nr_ios = ioj->rd_ios + ioj->wr_ios;

It appears then that nr_ios is the sum of rd_ios and wr_ios.

The remainder of the information can be obtained back in iostat.c. We know from the man page that extended disk stats rely on the /proc/diskstats file. A quick search for this shows a function called read_diskstats_stat:

/* *************************************************************************** * Read stats from /proc/diskstats. * * IN: * @curr Index in array for current sample statistics. *************************************************************************** */ void read_diskstats_stat(int curr)

This function will parse the contents of /proc/diskstats. The important part of this function:

/* major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq */ i = sscanf(line, "%u %u %s %lu %lu %llu %u %lu %lu %llu %u %u %u %u", &major, &minor, dev_name, &rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec, &wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks); ... sdev.rd_ticks = rd_ticks_or_wr_sec;

Documentation on sscanf and % format specifiers is here:

https://linux.die.net/man/3/sscanf https://www.le.ac.uk/users/rjm1/cotter/page_30.htm

From the sscanf man page:

The scanf() family of functions scans input according to format as described below. This format may contain conversion specifications; the results from such conversions, if any, are stored in the locations pointed to by the pointer arguments that follow format.

So the first field of each line is an int and will be read into &major, the second field of each line is an int and will be read into &minor, the third field of each line is a char array and will be read into dev_name, and so on.

From the above, now we know that rd_ios, wr_ios, rd_ticks and wr_ticks directly correlate to fields in /proc/diskstats. We can get a plain English description of what each field in the /proc/diskstats file means by reading the kernel documentation, here: https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats

From cross referencing the variable names above against the fields in the kernel documentation, we can get a plaintext description of what these variables actually are.

rd_ios = field 4. In English, "reads completed successfully"

rd_ticks = field 7. In English, "time spent reading (ms)"

wr_ios = field 8. In English, "writes completed"

wr_ticks = field 11. In English, "time spent writing (ms)"

Now we know all of the information we need to work out what await actually is. A summary:

sdc - Structure with current device statistics

sdp - Structure with previous device statistics

nr_ios - The sum of reads and writes completed successfully (fields 4 and 8 in /proc/diskstats)

rd_ticks - Time spent reading in ms (field 7 in /proc/diskstats)

wr_ticks - Time spent writing in ms (field 11 in /proc/diskstats)

Back to the await calculation, we can see that it's set to this:

xds->await = (sdc->nr_ios - sdp->nr_ios) ? ((sdc->rd_ticks - sdp->rd_ticks) + (sdc->wr_ticks - sdp->wr_ticks)) / ((double) (sdc->nr_ios - sdp->nr_ios)) : 0.0;

One last bit of key information: in C, what you're looking at here with the "?" is a conditional operator (sometimes called a ternary operator). It evaluates to its second argument if the first argument evaluates to true. It evaluates to its third argument (the bit after the :) if the first argument evaluates to false. In English, if there is no difference between the current count for number of I/Os and the previous counter it means there was no disk activity, so set await to 0.0. Otherwise, set await to:

((sdc->rd_ticks - sdp->rd_ticks) + (sdc->wr_ticks - sdp->wr_ticks)) / ((double) (sdc->nr_ios - sdp->nr_ios))

Rewriting that in English:

await = (The difference between the time spent reading in ms plus the time spent writing in milliseconds), divided by (the number of writes completed plus the number of reads completed)

Therefore:

await is the time performing read/write operations divided by the number of read/write operations performed.

PART 4: Conclusion

We know that in the 1000ms between iostat runs, there can not possibly be more than 1000ms worth of reading/writing done. If iostat is sampling 1 second apart, it's impossible for there to be a 15,000ms await. iostat is returning incorrect results. The customer should upgrade to a later version of sysstat, and if the bug persists then we should try and reproduce it, and then open a support case with Red Hat to get it fixed.

0 notes

contextswitching · 8 years ago

Text

Discovering the cache

It's well known that when you look at free memory on Linux, you need to bear in mind that the page cache is included in total memory usage. In this example there's 987MB of RAM usable by the O/S, 92MB is used by applications, 828MB is used for buffers/cache, leaving 66MB free, but 649MB is available as cache contents can be evicted if needed:

# free -m total used free shared buff/cache available Mem: 987 92 66 49 828 649

The page cache takes up the majority of buffers/cache. There are occasionaly cicumstances where you want to tune how the cache evictions happen, but usually the kernel deals with it just fine.

The page cache contains the contents of files which are mapped into RAM when they are read or written[1] to speed up access next time they're needed. If there's no more space in the cache, if more memory is needed by applications, or if the system looks to be under pressure, the least recently used (LRU) pages are evicted. The contents of the page cache doesn't include filenames, but we can do a reverse lookup to see whether the pages from a file are in cache using mincore(2). There's a number of tools that wrap this system call, including pcstat, linux-ftools and vmtouch. They all give pretty much the same information. Here's an example with vmtouch:

Clear the page cache (only do this for testing!): # echo 1 > /proc/sys/vm/drop_caches We can see now that none of these MyISAM data files are cached: # vmtouch /var/lib/mysql/mysql/*.MYD Files: 22 Directories: 0 Resident Pages: 0/151 0/604K 0% Elapsed: 0.00029 seconds We can also show details on each file: # vmtouch /var/lib/mysql/mysql/help_topic.MYD -v /var/lib/mysql/mysql/help_topic.MYD [ ] 0/118 Files: 1 Directories: 0 Resident Pages: 0/118 0/472K 0% Elapsed: 0.000112 seconds Now read some data and check again: # mysql -e 'SELECT name FROM mysql.help_topic WHERE help_topic_id = 300;' # vmtouch /var/lib/mysql/mysql/help_topic.MYD -v /var/lib/mysql/mysql/help_topic.MYD [ o ] 1/118 Files: 1 Directories: 0 Resident Pages: 1/118 4K/472K 0.847% Elapsed: 0.00011 seconds The "o" shows pages in the page cache, so in this case the SELECT statement read a single 4K page into RAM. You can also choose to add (-t) or evict (-e) all pages from a file: # vmtouch /var/lib/mysql/mysql/help_topic.MYD -tv /var/lib/mysql/mysql/help_topic.MYD [OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 118/118 Files: 1 Directories: 0 Touched Pages: 118 (472K) Elapsed: 0.003222 seconds

This information can help you identify what's the contents of the page cache. It might help you see inefficient reads of data, or identify strange behaviour of cache usage seen in "free".

Finally, what about performance of the page cache? cachestat, part of perf-tools is a great way to see how often reads are having to read from disk because the required page isn't in the cache:

# ./cachestat -t Counting cache functions... Output every 1 seconds. TIME HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB 15:56:10 323 125 0 72.1% 0 79 15:56:11 341 0 0 100.0% 0 79 15:56:12 3162 1729 0 64.6% 3 83 15:56:13 579 312 1 65.0% 4 83 15:56:14 26458 3783 6 87.5% 17 85 15:56:15 24124 2932 3 89.2% 28 86 15:56:16 38967 1975 0 95.2% 36 86 15:56:17 36877 2337 6 94.0% 45 86 15:56:18 10835 1891 0 85.1% 52 86 15:56:19 629 1 6 99.8% 52 86 15:56:20 341 0 0 100.0% 52 86 15:56:21 351 0 3 100.0% 52 86

Footnote: [1] In fact, the application can tell the kernel not to add this file to the cache using fadvise FADV_DONTNEED. For example, there's no benefit to filling the cache with streaming data which will never be reread.

0 notes

contextswitching · 9 years ago

Text

Where is the PATH?

There's many references for how to change the PATH variable, either by overwriting it in system wide files like /etc/profile, or per user for example in ~/.bashrc. But what if you don't overwrite or modify it in any of these places - where does the default value come from?

As is often the case with Linux, "it depends". When you log in locally, /bin/login sets an initial PATH value from its compiled in default for normal or superusers. This can be customised on Debian-type distros with ENV_PATH and ENV_SUPATH in /etc/login.defs, but not on Red Hat derivatives. Bash then does its normal processing of login files (/etc/profile, /etc/profile.d, ~/.bashrc, etc):

# strings /bin/login | grep "/usr/bin" /usr/local/bin:/bin:/usr/bin /usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

When you log in via SSH, by default, /bin/login isn't involved (it can be if UseLogin yes is set). In this case, sshd sets the path that it was compiled with using the "--with-default-path" option for normal users, and "--with-superuser-path" for superusers. You can see the default-path in sshd_config, but not superuser-path:

# grep PATH /etc/ssh/sshd_config # This sshd was compiled with PATH=/usr/local/bin:/bin:/usr/bin # strings /usr/sbin/sshd | grep ":/usr/bin" /usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin /usr/local/bin:/bin:/usr/bin

If when bash executes it finds no PATH has been set; in variables.c:

temp_var = set_if_not ("PATH", DEFAULT_PATH_VALUE);

then it'll set a value from its compiled in default; in config-top.h:

#ifndef DEFAULT_PATH_VALUE #define DEFAULT_PATH_VALUE \ "/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:." #endif

That value might vary by distribution, for example Red Hat patch it to look like this (this is against an older version of bash which had a different default value):

- "/usr/gnu/bin:/usr/local/bin:/bin:/usr/bin:." + "/usr/local/bin:/bin:/usr/bin"

So this shows that even if you run a bash script outside of a login shell, you'll still always have a PATH set, which explains why you don't have to fully qualify executable paths in those circumstances.

Bash git source, config-top.h

SSH, the Secure Shell: The Definitive Guide

0 notes

contextswitching · 9 years ago

Text

Leap seconds & timer subsystems

With the next leap second insertion swiftly approaching on the 31st of December we're starting to think about possible effects on systems. Previously (2012) there were a few issues that cropped up with the leap second, one of which was high CPU usage. This should have been fixed in later kernel releases from RHEL 6.4 and later. That problem was actually pretty interesting and a good excuse to think about how Linux and other systems handle the problem of time.

If you'd like to really unbalance your perception of the world and potentially have an existential crisis I highly recommend reading about time. For the purposes of what we're talking about here we'll assume that atomic clocks provide us with an accurate measure of time, and further we'll assume that leap seconds are a reasonable way to handle the problem of keeping these in line with the Earth's axial rotation. [1] As far as systems go there's two aspects of time we care about: What time is it relative to other systems, and what time is it relative to system boot. The former is used to co-ordinate communication and exchange between systems and the latter to co-ordinate system tasks and processes.

The problem we saw with leap seconds in 2015 was based on the interaction between these two 'types' of time and how they manage themselves in relation to one another.

System Time System time is going to be measured in a few ways, but commonly you'll have this synchronised by a service like ntpd. ntpd on your server uses a reference clock to provide the time. It does some neat little tricks considering things like offset and jitter in order to preserve as close to accurate a time as possible. We'll assume that it is correctly doing so and staying synchronised with the reference clock. When it notices offsets or drift, it will correct this. With ntp it's generally safe to assume that your clock is accurate-ish.

Timers and Timer subsystems When you run multiple processes on a CPU you need a way to ensure each process gets time on the CPU. There's a couple of different schedulers that you'll commonly see like CFS but one thing they all need is the ability to measure time. In this case we don't care about our time relative to other systems, merely time relative to our own system. To measure this the kernel uses something called 'jiffies'. Jiffies are the number of 'ticks' since system boot. Tick rates are based on the HZ defined by different architectures, you can find this in <asm/param.h> but you can assume it to be 1000 on x86-64 systems. The standard kernel timer API utilises jiffies as its units.

The smaller the unit of time, the more accurately and efficiently you can schedule tasks. Ideally you would use the smallest unit that it's is possible to measure on the realtime clock of the processor. This may or may not match the unit being used for the jiffie, usually it will be more accurate (smaller) than what the jiffie can provide. On systems where you need a higher resolution and more accuracy you have two options, increase the HZ value appropriately thus reducing the length in nanoseconds of each jiffie or utilise one of the high resolution timers. While it might seem obvious that you should just set a higher value for HZ this can introduce additional overhead across the whole system. High resolution timers allow you to be constrained not by jiffies but by the maximum accuracy the hardware provides. hrtimer() is a common implementation, have a look at it here.[2] On an high level it does this by matching offsets between system clocks and the realtime clock of the processor.[3]

So what does this have to do with leap seconds?

Leap seconds affect system time, we'll assume here that we're using ntpd to synchronise this. There's a few ways that leap seconds are applied, you can either 'step', which is essentially either stepping back a second, repeating your last timestamp, or you can insert a duplicate timestamp, or an impossible timestamp. Alternatively you might choose to smear or slew the time. With a smear the ntp server smudges the extra second across a period of time, distributing the additional second in millisecond increments throughout that period, a slew is the same thing but done locally on the server. Smears and slews kind of solve some of the issues you see with stepping time but they do mean that your time will remain 'incorrect' for a longer period which may not be ideal. In this case it's the step that caused the problem. With a step your system time will increment similarly to below, either repeating a second, or adding an additional second, e.g.:

23:59:59 23:59:59 <-- leap second 00:00:00

23:59:59 23:59:60 <-- leap second 00:00:00

That's fine for the kernel and ntp, they'll both adjust and go 'sweet, inserted a leap second' before trundling along with their regular life. Where we get a problem is the way in which system time interacts with the timer subsystems we looked at earlier. If we had been using the kernel's timer API we would be measuring in jiffies from boot and not care what the system clock was doing because we already have a built in consistently incrementing value independent of system time.

This only becomes a problem when we are using high resolution timers. This is because hrtimer and other high resolution timers are essentially bridging system time with processors realtime clocks. The issue seen in 2015 was with hrtimer and how it implements this integration with the system clock. An overly simple explanation here is that via the use of a number of internal time values that it offsets from the system clock hrtimer takes system time and translates it into the appropriate value for processors realtime clocks, basically syncing up the smaller units of the processors realtime clock (milliseconds) with the larger units of the system clock (seconds). It's not using jiffies, so it's anchoring it's internal time bases off that system clock as opposed to an independent counter.

That works just grand until you change the system time, because it's going to need to adjust itself to take that into account. Since there's things like drift with ntpd there's already a way to do that. The kernel will change the time and call clock_was_set() which will notify hrtimer to go check and adjust itself. The problem in 2015 was that earlier kernels failed to make this call, and hrtimer was never notified that the system time changed. This meant that the subsystem was immediately a second in the future, causing all the timers to go off a second sooner than they should. Not necessarily a huge problem for a single application, if however you have a bunch of timers set for <1 second in the future, which you probably do if you cared enough to utilise a high resolution timer in the first place, those are all going to wake up at exactly the same time because they'll immediately expire. They'll wake up and request CPU time. Most of them will reset immediately, for another <1 second value, which of course will immediately expire again since the subsystem is still ahead of the system clock. Repeate ad infinitum.

The immediate 'fix' that was used and publicised for this was to restart ntpd, which is kind of misleading because it implies the issue was with ntpd. The reason this worked is because a restart of ntpd caused the kernel to send the clock_was_set() syscall as part of ntpd's initialisation, thus hrtimer worked out that it's offset was wrong and corrected itself. If we didn’t know any better you’d think it was something to do with the nptd, but actually it turned out to be a much more interesting case that provides a nice excuse for digging through how different elements of the system require different ways of tracking time and how those interact.

[1] Turns out this is an enormous assumption and actually is a huge simplification of how time and space operates and basically everything is inaccurate, time is meaningless and nothing matters. [2] hrtimer source [3] It's actually more complicated than that and can be set to be bound to either a monotonic clock (similar to jiffies in that it's an incrementing counter) or to CLOCK_REALTIME which is self explanatory. We're presuming you're binding to CLOCK_REALTIME since that's where this bug crops up. More details here.

Other reading: Original bug report Interesting breakdown General leap second info from RH Google's Policy on NTP smears

#linux #leap second #ntpd #hrtimer #operating systems

0 notes

contextswitching · 9 years ago

Text

Sharing Data Blocks with XFS

Traditionally, filesystems map a file to a unique set of blocks on disk. Each file has its own set of blocks. This is efficient when the contents of the files are completely unique. But there's cases where you'll have multiple identical files, or mostly the same. For example, imagine a VM image. You might have 5 VMs that are created from the same VM image. At the start they'll be identical. The VM clone process will make a few things unique - perhaps the network address and hostname, but the file is still mostly the same. Over time, logs and data will cause there to be more differences, but the OS may stay essentially the same.

If the files could share those data blocks that are the same, then they would take up less space, and it would take less time to create the VM clone. This has been supported using a feature called "reflink" by btrfs and OCFS2 since around 2009, but they're both more specialist filesystems. From Linux kernel 4.9, its also supported by XFS making this available to many more users.

To create a relinked copy of a file:

cp --reflink source.file destination.file

No matter how big source.file is, the cp operation will be instant, and df will show that no extra space has been used. A new inode has been created, independent of the source, but no data blocks have been copied, only marked shared. You can then perform any operation you like on destination.file as you would a normal file, e.g. change ownership or permissions, read and write. When you write to either source.file or destionation.file, this is copy-on-write operaiton - the write is written to a new location so that block is no longer shared. The more you write to either file, the fewer blocks the two files will share.

If you try to relink a file on a filesystem that doesn't support it, you'll get an error like:

cp --reflink source.file destination.file cp: failed to clone 'destination.file' from 'source.file': Inappropriate ioctl for device

For details on how this is implemented for XFS:

https://github.com/djwong/xfs-documentation/blob/master/design/XFS_Filesystem_Structure/reflink.asciidoc https://github.com/djwong/xfs-documentation/blob/master/design/XFS_Filesystem_Structure/refcountbt.asciidoc

0 notes

contextswitching · 9 years ago

Text

Maximum Partitions on Linux

GPT format partition tables remove some of the limitations of MS-DOS style primary/logical tables. You'll see in discussions of the GPT layout on disk that this has space for many "primary" partitions. The minimum number defined by the standard is 128, although you can make space for more when you label the disk. There's another limit that comes into play in the OS, which is how the disk driver deals with partitions. Here's a run through of working out what's involved.

First some background to how devices appear. Since Linux 2.6, these are managed by udev. Udev provides dynamic (i.e. device nodes are only created when the device that needs it is attached) and persisent (each device with always get the same device node name) allocation of device nodes. If you list a device you see some extra things compared to a normal file:

$ ls -l /dev/sda /dev/tty0 brw-rw---- 1 root disk 8, 0 Dec 5 12:17 /dev/sda

There are two types of device "block" and "character". The "b" means this is a block type device. This sort of device behaves a lot like a file: a value read from a certain location will be whatever was last written there. The "8, 0" is the major and minor device number. The major number selects the driver to be used to access this device, and the minor number is passed as a parameter to the driver.

Other devices are "character" type devices. These cause some immediate action when you write to or read from them - for example writing a byte might display it on the screen.

crw--w---- 1 root tty 4, 0 Dec 2 18:01 /dev/tty0

So above we found /dev/sda had major number "8". We can check https://www.kernel.org/doc/Documentation/devices.txt to see what handles that:

8 block SCSI disk devices (0-15) 0 = /dev/sda First SCSI disk whole disk 16 = /dev/sdb Second SCSI disk whole disk 32 = /dev/sdc Third SCSI disk whole disk ... 240 = /dev/sdp Sixteenth SCSI disk whole disk Partitions are handled in the same way as for IDE disks (see major number 3) except that the limit on partitions is 15.

This references IDE disks, so here's the information for those as well:

3 block First MFM, RLL and IDE hard disk/CD-ROM interface 0 = /dev/hda Master: whole disk (or CD-ROM) 64 = /dev/hdb Slave: whole disk (or CD-ROM) For partitions, add to the whole disk device number: 0 = /dev/hd? Whole disk 1 = /dev/hd?1 First partition 2 = /dev/hd?2 Second partition ... 63 = /dev/hd?63 63rd partition For Linux/i386, partitions 1-4 are the primary partitions, and 5 and above are logical partitions. Other versions of Linux use partitioning schemes appropriate to their respective architectures.

OK, so this is a disk handled by the SCSI driver (in fact these days, SATA and SCSI disks are both handled by the SCSI driver, so the name is no longer very accurate), and this documentation tells us what we're looking for about partitions. We can see that for this driver, the minor number refers to the partition, starting at 0. Importantly, this is shared amongst disks /dev/sda to /dev/sdp. The information from the IDE section shows how partitions are handled. It says we can only have 15 partitions on each block device (the first minor number is the whole block device) - sda can have minor numbers 0-15, sdb can have 16-31, and so on up to sdp with minor numbers 240-255.

Here's how that looks:

# ls -l /dev/sdc* | sort -k10 -n brw-rw---- 1 root disk 8, 32 Aug 22 21:25 /dev/sdc brw-rw---- 1 root disk 8, 33 Nov 27 16:43 /dev/sdc1 brw-rw---- 1 root disk 8, 34 Nov 27 16:43 /dev/sdc2 brw-rw---- 1 root disk 8, 35 Nov 27 16:43 /dev/sdc3 # ls -l /dev/sdd* | sort -k10 -n brw-rw---- 1 root disk 8, 48 Aug 22 21:25 /dev/sdd brw-rw---- 1 root disk 8, 49 Nov 27 16:43 /dev/sdd1 brw-rw---- 1 root disk 8, 50 Nov 27 16:43 /dev/sdd2 brw-rw---- 1 root disk 8, 51 Nov 27 16:43 /dev/sdd3

As an aside, especially with multipathing, you will likely have seen block devices beyond /dev/sdp. There's extra major device numbers allocated to cope with this:

# ls -l /dev/sdaf brw-rw---- 1 root disk 65, 240 Oct 9 13:45 /dev/sdaf

So we can easily see what's the maximum number of disks the SCSI driver can cope with too:

$ grep "SCSI disk devices" devices.txt 8 block SCSI disk devices (0-15) 65 block SCSI disk devices (16-31) 66 block SCSI disk devices (32-47) ... 133 block SCSI disk devices (208-223) 134 block SCSI disk devices (224-239) 135 block SCSI disk devices (240-255)

In that last 135 block entry we see the last name sdiv, and this is the 256th disk:

240 = /dev/sdiv 256th SCSI disk whole disk

So we've worked out that you can have 15 partitions on each disk, and 256 disks. Well. Not quite.

To get around the limits, a feature called "extended devt" was added in kernel 2.6.26:

https://lwn.net/Articles/288953/

This adds a new major device for overflow above partition 15 (or above /dev/sdiv). So this might look like:

# ls -l /dev/sdb16 brw-rw---- 1 root disk 259, 0 Dec 7 09:17 /dev/sdb16

The maximum value for the minor number is 20-bits, meaning somewhere north of 1 million partitions across all block devices can be stored. There's a hard limit of 256 partitions per disk in the kernel:

http://lxr.linux.no/#linux+v3.8.8/include/linux/genhd.h#L58

Its worth noting that not all block type drivers support extended devt, for example the xvdb Xen Virtual Block driver doesn't. Here's what the kernel will tell you:

Error: Error informing the kernel about modifications to partition /dev/xvdb16 -- Invalid argument. Error: Failed to add partition 16 (Invalid argument)

So to summarise, when you're using an sd device:

If you have a kernel older than 2.6.26, you're limited to 15 partitions

If you have a kernel newer than 2.6.26, and have created a default GPT table, you're limited to 128 partitions

If you have a kernel newer than 2.6.26, and have changed the defaults when creating the GPT table, you can make up to 256 partitions

The actual maximum number of disks is another rabbit hole for another day

0 notes