#Check for zero byte file in unix

Explore tagged Tumblr posts

Visit Tumblr Blog

Explore Tumblr blogs with no restrictions, modern design and the best experience.

Last Seen Tumblr Blogs

john891901

无标题

24 posts

vanhilla

the flipside

365 posts

jeandelispector

౨ৎ

13 posts

kulebuxe

kurze Lederhosen

31 posts

perpetuallyfive

maturation stalled

26K posts

Fun Fact

The total number of visits Tumblr.com received during January 2021 is 327 million.

messagemmorg · 3 years ago

Text

Check for zero byte file in unix

CHECK FOR ZERO BYTE FILE IN UNIX HOW TO

If you want to check the core file size limit then you need to use ulimit -c command as shown below.

CHECK FOR ZERO BYTE FILE IN UNIX HOW TO

Nice - max nice priority allowed to raise to values: Īlso Read: How to Drop/Flush/Clear Cache Memory or RAM in Linux (RedHat/CentOS 7/8) in 6 Best Steps Example 1: How to check the core file size limit in Linux/Unix Msgqueue - max memory used by POSIX message queues (bytes) Sigpending - max number of pending signals Locks - max number of file locks the user can hold Priority - the priority to run user process with Maxsyslogins - max number of logins on the system Maxlogins - max number of logins for this user Nofile - max number of open file descriptors Memlock - max locked-in-memory address space (KB) Below are the parameters which can be set through ulimit command. Hard limit is the limit after which you will not be allowed to use system resources any further. Soft limit is the limit after which you will start getting warning about your over usages of the resources till you reach the hard limit. Now you might think what is hard limit and what is soft limit ? Hard limit and soft limit is a mechanism through which you can limit the System resources for a User in the Server. You can check the configuration in /etc/security/nf where usually hard limit is set by the root user and soft limit can be overridden by another user. There are two types of limits one can put in a system: hard limit and soft limit. If you want to limit the system resources for a User in your Linux/Unix based Server then you need to use ulimit command. In this article, I will take you through 33 Practical Examples of ulimit command in Linux/Unix. Example 33: How to change ulimit values permanently.Example 32: How to check ulimit configuration in Linux.Example 31: How to change max memory size limits in Linux.Example 30: How to change max locked memory limits in Linux.Example 29: How to check max locked memory limits in Linux.Example 28: How to change File Lock Limits in Linux.Example 27: How to check File Lock Limits in Linux.Example 26: How to change virtual memory limit in Linux/Unix.Example 25: How to check virtual memory limit in Linux/Unix.Example 24: How to change max user processes limit in Linux/Unix.Example 23: How to check max user processes limit in Linux/Unix.Example 22: How to change CPU time limit in Linux/Unix.Example 21: How to check CPU time limit in Linux/Unix.Example 20: How to change stack size limit in Linux/Unix.Example 19: How to check stack size limit in Linux/Unix.Example 18: How to change real-time priority limit in Linux.Example 17: How to check real-time priority limit in Linux.Example 16: How to change POSIX messages queue limit in Linux.Example 15: How to check POSIX messages queue limit in Linux.Example 14: How to check max memory size limit using ulimit command in Linux.Example 13: How to check the pipe size limit using ulimit command in Linux/Unix.Example 12: How to change Open Files limit in Linux/Unix.Example 11: How to check Open files limit in Linux/Unix.Example 10: How to change the no of Pending Signals limit in Linux.Example 9: How to check the No of Pending Signals limit in Linux.Example 8: How to change the File Size Limit in Linux/Unix.Example 7: How to check the File Size limit in Linux/Unix.Example 6: How to change the Scheduling Priority limit in Linux/Unix.Example 5: How to check the Scheduling Priority limit in Linux/Unix.Example 4: How to change the data segment size limit in Linux/Unix.Example 3: How to check the data segment size limit in Linux/Unix.Example 2: How to change the core file size limit in Linux/Unix.Example 1: How to check the core file size limit in Linux/Unix.Examples of ulimit command in Linux/Unix.

#Check for zero byte file in unix

0 notes

professionalnahas · 3 years ago

Text

What is a serial terminal program

What is a serial terminal program software#

What is a serial terminal program code#

What is a serial terminal program download#

Many terminals has been replaced by terminal emulations, running on a PC. Personal computer has now replaced the terminals in most of the offices. The onset of spread gradually started from about 1977. Through it, one can not only get sequentially arranged output lines as output, it also allowed the maneuverability of the cursor and change of text. While still in the early days of the teletype were similar devices – they were referred as hardcopy terminal, consisting of a printer and a keyboard, the advantage of terminals was that, in the later a screen was used for the reproduction of the data. Today the abbreviation tty stands for the word teletype in Unix operating systems and associated softwares it is an ubiquitous terminology. Another terminology you might not know is Mainframe Computers. Terminals have been developed so far to the mainframe computers as we said before and teletype was used to replace their work. Previously we wrote about the best practices for Mainframe Systems and Terminal Emulation and in this article we will explain the simple question – What is Terminal? It advised to read about Terminal Emulation too, to get the difference. Terminals have been developed so far for the mainframe computers.

What is a serial terminal program software#

You can connect to it with any telnet client program from another computer in network (or over internet from different location) and see what's going on in terminal and send commands etc.ĭownload new version of the Terminal software - version 1.What is Terminal? A terminal console and a computer is for input and display of data. Terminal can also act like telnet server and listen on selected TCP port. More about supported command you can find if you click "Help" button.īelow in attachment section you can find few sample scripts to check and test functionality. You can write short scripts to do some simple tasks. Simple pascal syntax scripting is possible. %SCRS"script.tsc" - load and start script %M03 - this will send/run macro #3 (there are some limits when using this) %RUN"cmd.exe" - this will run command line prompt Macro string can be up to 256 characters long. "saved" even if you don't save it and will be available next time when To insert delay in macro string use %DLYxxxx, where xxxx is value 0000-9999 in ms. Where xx is offset of first byte for calculation. To calculate SUM (1byte sum) checksum byte use %SUMxx command. Twice ($$=$ and #=#).To calculate XOR checksum byte use %XORxx command. If you want to use # or $ char in macro you should type it Where $xx is hex and #xxx dec format ofĪscii code. In macros y ou can use all characters from keyboard and any ASCIIĬhar if you use $xx or #xxx.

What is a serial terminal program download#

!!! If you have problems with new version you can still download older releases on the bottom of the page !!! FIX: ComSetDTR,ComSetRTS.LED turn on/off bug.FIX: lost chars problem.better but not fixed.flush buffers on disconnect to avoid "not responding".

What is a serial terminal program code#

hot key changes: DEL=RXClear, ESC=send code 27.

Tx char delay strategy changed (real zero delay if value=0).

parameter/argument to open script file at start up.

time stamp, scale factors and offsets for CSV graph.

offset for %SUM and %XOR macro commands.

%SCRS"script.tsc" and %SCRE commands for macros - check help.

auto scroll checkbox - to prevent auto scrolling.

4th/black graph and scale factors for CSV values.

CSV Graph - As a little 2012 New Year's Gift ).

scripting (with graph/visualization support).

24 custom transmit macros with auto repeat function.

baudrate up to 256kbps & custom baudrate.

without installation, only single and small.

#What is a serial terminal program

0 notes

paradisetechsoftsolutions · 5 years ago

Text

What is RegEx? Regular Expression in Python & Meta Characters

What is RegEx?

A regular expression (regex, regexp, or re) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expression patterns are assembled into a set of byte codes which are then executed by a matching engine written in C. Regular expressions are widely used in the world of UNIX.

Now let’s understand simple basic regular expression through the following image.

The caret sign (^) serves two purposes. Here, in this figure, it’s checking for the string that doesn’t contain upper case, lower case, digits, underscore and space in the strings. In short, we can say that it is simply matching for special characters in the given string. If we use caret outside the square brackets, it will simply check for the starting of the string.

An example of a "proper" email-matching regex (like the one in the exercise), see below:

import re input_user = input("enter your email address: ") m = re.match( '(?=.*\d)(?=.*[a-z])(?=.*\W)',input_user) if m: print("Email is valid:") else: print("email is not valid:")

The most common usages of regular expressions are:

Search a string (search and match)

Finding a string (findall)

Break string into substrings (split)

Replace part of a string (sub)

'Re' Module

The module 're' gives full assistance for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

Now if we talk about the 're' module, the re module gives an interface to the regular expression engine, that permits you to arrange REs into objects and then perform with the matches. A regular expression is simply a sequence of characters that define a search pattern. Pythons’ built-in 're' module provides excellent support for the regular expressions with a modern and complete regex flavor.

Now, let’s understand everything about regular expressions and how they can be implemented in python. The very first step would be to import 're' module which provides all the necessary functionalities to play with. It can be done by the following statement in any of the IDE’s.

import re

Meta Characters

Metacharacters are characters or we can say it's a sequence of such characters, that holds a unique meaning specifically in a computing application. These characters have special meaning just like a '*' in wild cards. Some set of characters might be used to represent other characters, like an unprintable character or any logical operation. They are also known as “operators” and are mostly used to give rise to an expression that can represent a required character in a string or a file.

Below is the list of the metacharacters, and how to use such characters in the regular expression or regex like;

. ^ $ * + ? { } [ ] \ | ( )

Initially, the metacharacters we are going to explain are [ and ]. It’s used for specifying the class of the character which is a set of characters that you wish to match.

Characters can be listed individually here, or the range of characters can be indicated by giving two characters and separating them by a '-'. For instance, [abc] will match any of the characters a, b, or c; we can say in another way to express the same set of characters i.e. [a-c]. If you wanted to match only lowercase letters, your RE would be [a-z].

Let’s understand what these characters illuminate:

Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

You can also specify a range of characters using - inside square brackets.

[a-e] is the same as [abcde].

[1-4] is the same as [1234].

[0-9] is the same as [0123---9]

You can complement (invert) the character set by using the caret ^ symbol at the start of a square-bracket.

[^abc] means any character except a or b or c.

[^0-9] means any non-digit character.

The basic usages of commonly used metacharacters are shown in the following table:

For example, \$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.

s = re.search('\w+$','789Welcome67 to python') Output: 'python'

\ is used to match a character having special meaning. For example: '.' matches '.', '+'matches '+' etc.

We need to use '\' to match. Regex recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.

The following code example will show you the regex '.' function:

s = re.match('........[a-zA-Z0-9]','Welcome to python') Output: 'Welcome t'

Other Special Sequences

There are some Special sequences that make commonly used patterns easier to write. Below is a list of such special sequences:

Understanding special sequences with examples

\A - Matches if the specified characters are at the start of a string.

s = re.search('\A\d','789Welcome67 to python') Output: '7'

\b - Matches if the specified characters are at the beginning or end of a word.

a = re.findall(r'\baa\b', "bbb aa \\bash\baaa") Output: ['aa']

\B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

a = re.findall('[\B]+', "BBB \\Bash BaBe BasketBall") Output: ['BBB', 'B', 'B', 'B', 'B', 'B']

\d - Matches any decimal digit. Equivalent to [0-9]

a = re.match('\d','1Welco+me to python11') Output: '1'

\D - Matches any non-decimal digit. Equivalent to [^0-9]

a = re.match('\D','Wel12co+me to python11') Output: 'W'

\s - Matches where a string contains any white space character. Equivalent to [ \t\n\r\f\v].

a = re.match('\s',' Wel12co+me to python11') Output: ' '

\S - Matches where a string contains any non-white space character. Equivalent to [^ \t\n\r\f\v].

a = re.match('/S','W el12co+me to python11') Output: 'W'

\w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

a = re.match('[\w]','1Welco+me to python11') Output: '1'

\W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

s = re.match('[\W]','@@Welcome to python') Output: '@'

\Z - Matches if the specified characters are at the end of a string.

s = re.search('\w\Z','789Welcome67 to python') Output: 'n'

Module- Level Functions

'Re' module provides so many top level functions & among them primarily used functions are: match(), search(), findall(), sub(), split(), compile().

These functions are responsible for taking arguments, primarily, regular expression pattern as the first argument and the string where regex has to be applied to be the second. It returns either None or a match object instance. They store the compiled object in a cache for the purpose of making future calls using the same regular expressions and avoiding the need to parse the pattern again and again.

We will explain some of these functions in the below section.

1. re.match() - The match() function is used to match the beginning of the string. In the following example, the match() function will match the first letter of the given string whether it is a digit, lowercase or uppercase letter (underscores included).

a = re.match('[0-9_a-zA-Z-]','Welcome to programming') Output: 'W'

If we add ‘+' outside the character set, it will check for the repeatability of the given characters in 'RE'. In the following example, '+' checks about one or more repetitions of uppercase, lowercase, and digits (underscore included, white spaces excluded).

a = re.match('[_0-9A-Za-z-]+','Welcome to programming') Output: 'Welcome'

'*' is a quantifier that is responsible for matching the regex preceding it 0 or more times. In short, we can say it matches any character zero or more times. Let's understand via the below given example. In the given string ('Welcome to programming'), '*' will match for characters given in the regex as long as possible.

a = re.match('[_A-Z0-9a-z-]*','Welcome to programming') Output: 'Welcome'

If we add '*' inside the character set, the regex will check for the presence of '*' at the beginning of the string. Since in the following example '*' is not present at the beginning of the string, so it will result in 'W'.

a = re.match('[_A-Z0-9a-z-*]','Welcome to programming') Output: 'W'

Using quantifier '?' matches zero or one of whatever precedes it. In the following example '?' matches uppercase or lowercase characters including underscore as well at the beginning of the string.

a = re.match('[_A-Za-z-]?','Welcome to programming') Output: 'W'

There's 're' module function that offers you the set of functions that mainly allows you to search a string for a match. Let’s understand what these functions perform.

2. re. search()- It is mainly used to search the pattern in a text. The function re. search() takes a regex pattern and a string and searches for that particular pattern within the string. In that case, if the search is successful, search() returns a match object or None otherwise. The syntax of re. search is as follows:

a = re.search(pattern, string)

You can better understand the following example.

a = re.search('come', 'welcome to programming') Output: <_sre.SRE_Match object; span=(3, 7), match='come'>

3. re. findall()- Returns a list containing all matches. The function re. findall() is used when you want to iterate over the lines of a file or string, it will return a list of all the matches in a single step. The string is scanned left-to-right, and matches are returned in the order that found. The syntax of re. findall() is as follows:

a = re.findall(pattern, string)

Below is an example of re. findall() function.

a = re.findall('prog','welcome to programming') Output: ['prog']

4. re. split () - Returns a list where the string has been split at each match. Split string by the occurrences of pattern. The syntax of re. split is given below:

a = re.split(pattern, string)

Look at the following example re. split() function:

a = re.split('[\W]+','welcome to programming') Output: ['welcome', 'to', 'programming

a = re.split('\s','Hello how are you') Output: ['Hello', 'how', 'are', 'you']

b = re.split('\d','hello1i am fine') Output: 'hello', 'i am fine']

5. re. sub() - It replaces one or many matches with a string. It is used to replace sub strings and it will replace the matches in string with replacing value. The syntax of re. sub() is as follows:

a = re.sub(pattern, replacing value, string)

The following example replaces all the digits in the given string with an empty string.

m = re.sub('[0-9]','','Welcome to python1234. Coding3456.') Output: 'Welcome to python. Coding.'

6. re. compile() - We can compile pattern into the pattern objects all with the help of function re.compile(), and which contains various methods for operations such as searching for pattern matches or performing string substitutions.

In the following example, the compile function compiles the regex function mentioned and then the code asks the user to enter a name. If the user types/inputs any digit or other special characters, the compile results won't match and it will again ask the user for input. It will continue doing this unless and until the user inputs a name containing characters only.

name_check = re.compile(r"[^A-Za-zs.]") name = input("Please, enter your name: ") while name_check.search(name): print("please enter your name correctly!") name = input("Please, enter your name: ")

The output of the following code is as follows:

Please, enter your name: 1234 please enter your name correctly! Please, enter your name: 5678 Name is provided correctly please enter your name correctly! Please, enter your name: john Name is provided correctly

Wrapping Up

Now that we have a rough understanding of what RegEx is, how regex works in python, further we can move onto something more technical. It's time to get a small project up and running.

#What is RegEx #RegEx In Python #common usage of regular expressions

0 notes

notsadrobotxyz · 6 years ago

Text

DBA Daily/Weekly/Monthly or Quarterly Checklist

In response of some fresher DBA I am giving quick checklist for a production DBA. Here I am including reference of some of the script which I already posted as you know each DBA have its own scripts depending on database environment too. Please have a look on into daily, weekly and quarterly checklist. Note: I am not responsible of any of the script is harming your database so before using directly on Prod DB. Please check it on Test environment first and make sure then go for it.Please send your corrections, suggestions, and feedback to me. I may credit your contribution. Thank you.------------------------------------------------------------------------------------------------------------------------Daily Checks:Verify all database, instances, Listener are up, every 30 Min. Verify the status of daily scheduled jobs/daily backups in the morning very first hour.Verify the success of archive log backups, based on the backup interval.Check the space usage of the archive log file system for both primary and standby DB. Check the space usage and verify all the tablespace usage is below critical level once in a day. Verify Rollback segments.Check the database performance, periodic basis usually in the morning very first hour after the night shift schedule backup has been completed.Check the sync between the primary database and standby database, every 20 min. Make a habit to check out the new alert.log entry hourly specially if getting any error.Check the system performance, periodic basis.Check for the invalid objectsCheck out the audit files for any suspicious activities. Identify bad growth projections.Clear the trace files in the udump and bdump directory as per the policy.Verify all the monitoring agent, including OEM agent and third party monitoring agents.Make a habit to read DBA Manual.Weekly Checks:Perform level 0 or cold backup as per the backup policy. Note the backup policy can be changed as per the requirement. Don’t forget to check out the space on disk or tape before performing level 0 or cold backup.Perform Export backups of important tables.Check the database statistics collection. On some databases this needs to be done every day depending upon the requirement.Approve or plan any scheduled changes for the week.Verify the schedule jobs and clear the output directory. You can also automate it.Look for the object that break rule. Look for security policy violation. Archive the alert logs (if possible) to reference the similar kind of error in future. Visit the home page of key vendors.Monthly or Quarterly Checks:Verify the accuracy of backups by creating test databases.Checks for the critical patch updates from oracle make sure that your systems are in compliance with CPU patches.Checkout the harmful growth rate. Review Fragmentation. Look for I/O Contention. Perform Tuning and Database Maintenance.Verify the accuracy of the DR mechanism by performing a database switch over test. This can be done once in six months based on the business requirements.------------------------------------------------------------------------------------------------------------------------------------------------------- Below is the brief description about some of the important concept including important SQL scripts. You can find more scripts on my different post by using blog search option.Verify all instances are up: Make sure the database is available. Log into each instance and run daily reports or test scripts. You can also automate this procedure but it is better do it manually. Optional implementation: use Oracle Enterprise Manager's 'probe' event.Verify DBSNMP is running:Log on to each managed machine to check for the 'dbsnmp' process. For Unix: at the command line, type ps –ef | grep dbsnmp. There should be two dbsnmp processes running. If not, restart DBSNMP.Verify success of Daily Scheduled Job:Each morning one of your prime tasks is to check backup log, backup drive where your actual backup is stored to verify the night backup. Verify success of database archiving to tape or disk:In the next subsequent work check the location where daily archiving stored. Verify the archive backup on disk or tape.Verify enough resources for acceptable performance:For each instance, verify that enough free space exists in each tablespace to handle the day’s expected growth. As of , the minimum free space for : . When incoming data is stable, and average daily growth can be calculated, then the minimum free space should be at least days’ data growth. Go to each instance, run query to check free mb in tablespaces/datafiles. Compare to the minimum free MB for that tablespace. Note any low-space conditions and correct it.Verify rollback segment:Status should be ONLINE, not OFFLINE or FULL, except in some cases you may have a special rollback segment for large batch jobs whose normal status is OFFLINE. Optional: each database may have a list of rollback segment names and their expected statuses.For current status of each ONLINE or FULL rollback segment (by ID not by name), query on V$ROLLSTAT. For storage parameters and names of ALL rollback segment, query on DBA_ROLLBACK_SEGS. That view’s STATUS field is less accurate than V$ROLLSTAT, however, as it lacks the PENDING OFFLINE and FULL statuses, showing these as OFFLINE and ONLINE respectively.Look for any new alert log entries:Connect to each managed system. Use 'telnet' or comparable program. For each managed instance, go to the background dump destination, usually $ORACLE_BASE//bdump. Make sure to look under each managed database's SID. At the prompt, use the Unix ‘tail’ command to see the alert_.log, or otherwise examine the most recent entries in the file. If any ORA-errors have appeared since the previous time you looked, note them in the Database Recovery Log and investigate each one. The recovery log is in .Identify bad growth projections.Look for segments in the database that are running out of resources (e.g. extents) or growing at an excessive rate. The storage parameters of these segments may need to be adjusted. For example, if any object reached 200 as the number of current extents, upgrade the max_extents to unlimited. For that run query to gather daily sizing information, check current extents, current table sizing information, current index sizing information and find growth trendsIdentify space-bound objects:Space-bound objects’ next_extents are bigger than the largest extent that the tablespace can offer. Space-bound objects can harm database operation. If we get such object, first need to investigate the situation. Then we can use ALTER TABLESPACE COALESCE. Or add another datafile. Run spacebound.sql. If all is well, zero rows will be returned.Processes to review contention for CPU, memory, network or disk resources:To check CPU utilization, go to =>system metrics=>CPU utilization page. 400 is the maximum CPU utilization because there are 4 CPUs on phxdev and phxprd machine. We need to investigate if CPU utilization keeps above 350 for a while.Make a habit to Read DBA Manual:Nothing is more valuable in the long run than that the DBA be as widely experienced, and as widely read, as possible. Readingsshould include DBA manuals, trade journals, and possibly newsgroups or mailing lists.Look for objects that break rules:For each object-creation policy (naming convention, storage parameters, etc.) have an automated check to verify that the policy is being followed. Every object in a given tablespace should have the exact same size for NEXT_EXTENT, which should match the tablespace default for NEXT_EXTENT. As of 10/03/2012, default NEXT_EXTENT for DATAHI is 1 gig (1048576 bytes), DATALO is 500 mb (524288 bytes), and INDEXES is 256 mb (262144 bytes). To check settings for NEXT_EXTENT, run nextext.sql. To check existing extents, run existext.sqlAll tables should have unique primary keys:To check missing PK, run no_pk.sql. To check disabled PK, run disPK.sql. All primary key indexes should be unique. Run nonuPK.sql to check. All indexes should use INDEXES tablespace. Run mkrebuild_idx.sql. Schemas should look identical between environments, especially test and production. To check data type consistency, run datatype.sql. To check other object consistency, run obj_coord.sql.Look for security policy violations:Look in SQL*Net logs for errors, issues, Client side logs, Server side logs and Archive all Alert Logs to historyVisit home pages of key vendors:For new update information made a habit to visit home pages of key vendors such as: Oracle Corporation: http://www.oracle.com, http://technet.oracle.com, http://www.oracle.com/support, http://www.oramag.com Quest Software: http://www.quests.comSun Microsystems: http://www.sun.com Look for Harmful Growth Rates:Review changes in segment growth when compared to previous reports to identify segments with a harmful growth rate. Review Tuning Opportunities and Perform Tuning Maintainance:Review common Oracle tuning points such as cache hit ratio, latch contention, and other points dealing with memory management. Compare with past reports to identify harmful trends or determine impact of recent tuning adjustments. Make the adjustments necessary to avoid contention for system resources. This may include scheduled down time or request for additional resources.Look for I/O Contention:Review database file activity. Compare to past output to identify trends that could lead to possible contention.Review Fragmentation:Investigate fragmentation (e.g. row chaining, etc.), Project Performance into the FutureCompare reports on CPU, memory, network, and disk utilization from both Oracle and the operating system to identify trends that could lead to contention for any one of these resources in the near future. Compare performance trends to Service Level Agreement to see when the system will go out of bounds. -------------------------------------------------------------------------------------------- Useful Scripts: -------------------------------------------------------------------------------------------- Script: To check free, pct_free, and allocated space within a tablespace SELECT tablespace_name, largest_free_chunk, nr_free_chunks, sum_alloc_blocks, sum_free_blocks , to_char(100*sum_free_blocks/sum_alloc_blocks, '09.99') || '%' AS pct_free FROM ( SELECT tablespace_name, sum(blocks) AS sum_alloc_blocks FROM dba_data_files GROUP BY tablespace_name), ( SELECT tablespace_name AS fs_ts_name, max(blocks) AS largest_free_chunk , count(blocks) AS nr_free_chunks, sum(blocks) AS sum_free_blocks FROM dba_free_space GROUP BY tablespace_name ) WHERE tablespace_name = fs_ts_name; Script: To analyze tables and indexes BEGIN dbms_utility.analyze_schema ( '&OWNER', 'ESTIMATE', NULL, 5 ) ; END ; Script: To find out any object reaching SELECT e.owner, e.segment_type , e.segment_name , count(*) as nr_extents , s.max_extents , to_char ( sum ( e.bytes ) / ( 1024 * 1024 ) , '999,999.90') as MB FROM dba_extents e , dba_segments s WHERE e.segment_name = s.segment_name GROUP BY e.owner, e.segment_type , e.segment_name , s.max_extents HAVING count(*) > &THRESHOLD OR ( ( s.max_extents - count(*) ) < &&THRESHOLD ) ORDER BY count(*) desc; The above query will find out any object reaching level extents, and then you have to manually upgrade it to allow unlimited max_extents (thus only objects we expect to be big are allowed to become big. Script: To identify space-bound objects. If all is well, no rows are returned. SELECT a.table_name, a.next_extent, a.tablespace_name FROM all_tables a,( SELECT tablespace_name, max(bytes) as big_chunk FROM dba_free_space GROUP BY tablespace_name ) f WHERE f.tablespace_name = a.tablespace_name AND a.next_extent > f.big_chunk; Run the above query to find the space bound object . If all is well no rows are returned if found something then look at the value of next extent. Check to find out what happened then use coalesce (alter tablespace coalesce;). and finally, add another datafile to the tablespace if needed. Script: To find tables that don't match the tablespace default for NEXT extent. SELECT segment_name, segment_type, ds.next_extent as Actual_Next , dt.tablespace_name, dt.next_extent as Default_Next FROM dba_tablespaces dt, dba_segments ds WHERE dt.tablespace_name = ds.tablespace_name AND dt.next_extent !=ds.next_extent AND ds.owner = UPPER ( '&OWNER' ) ORDER BY tablespace_name, segment_type, segment_name; Script: To check existing extents SELECT segment_name, segment_type, count(*) as nr_exts , sum ( DECODE ( dx.bytes,dt.next_extent,0,1) ) as nr_illsized_exts , dt.tablespace_name, dt.next_extent as dflt_ext_size FROM dba_tablespaces dt, dba_extents dx WHERE dt.tablespace_name = dx.tablespace_name AND dx.owner = '&OWNER' GROUP BY segment_name, segment_type, dt.tablespace_name, dt.next_extent; The above query will find how many of each object's extents differ in size from the tablespace's default size. If it shows a lot of different sized extents, your free space is likely to become fragmented. If so, need to reorganize this tablespace. Script: To find tables without PK constraint SELECT table_name FROM all_tables WHERE owner = '&OWNER' MINUS SELECT table_name FROM all_constraints WHERE owner = '&&OWNER' AND constraint_type = 'P'; Script: To find out which primary keys are disabled SELECT owner, constraint_name, table_name, status FROM all_constraints WHERE owner = '&OWNER' AND status = 'DISABLED' AND constraint_type = 'P'; Script: To find tables with nonunique PK indexes. SELECT index_name, table_name, uniqueness FROM all_indexes WHERE index_name like '&PKNAME%' AND owner = '&OWNER' AND uniqueness = 'NONUNIQUE' SELECT c.constraint_name, i.tablespace_name, i.uniqueness FROM all_constraints c , all_indexes i WHERE c.owner = UPPER ( '&OWNER' ) AND i.uniqueness = 'NONUNIQUE' AND c.constraint_type = 'P' AND i.index_name = c.constraint_name; Script: To check datatype consistency between two environments SELECT table_name, column_name, data_type, data_length,data_precision,data_scale,nullable FROM all_tab_columns -- first environment WHERE owner = '&OWNER' MINUS SELECT table_name,column_name,data_type,data_length,data_precision,data_scale,nullable FROM all_tab_columns@&my_db_link -- second environment WHERE owner = '&OWNER2' order by table_name, column_name; Script: To find out any difference in objects between two instances SELECT object_name, object_type FROM user_objects MINUS SELECT object_name, object_type FROM user_objects@&my_db_link; For more about script and Daily DBA Task or Monitoring use the search concept to check my other post. Follow the below link for important Monitoring Script: http://shahiddba.blogspot.com/2012/04/oracle-dba-daily-checklist.html

0 notes

itbeatsbookmarks · 7 years ago

Link

(Via: Hacker News)

The vchan protocol is used to stream data between virtual machines on a Xen host without needing any locks. It is largely undocumented. The TLA Toolbox is a set of tools for writing and checking specifications. In this post, I’ll describe my experiences using these tools to understand how the vchan protocol works.

Table of Contents

( this post also appeared on Reddit )

Background

Qubes and the vchan protocol

I run QubesOS on my laptop. A QubesOS desktop environment is made up of multiple virtual machines. A privileged VM, called dom0, provides the desktop environment and coordinates the other VMs. dom0 doesn’t have network access, so you have to use other VMs for doing actual work. For example, I use one VM for email and another for development work (these are called “application VMs”). There is another VM (called sys-net) that connects to the physical network, and yet another VM (sys-firewall) that connects the application VMs to net-vm.

My QubesOS desktop. The windows with blue borders are from my Debian development VM, while the green one is from a Fedora VM, etc.

The default sys-firewall is based on Fedora Linux. A few years ago, I replaced sys-firewall with a MirageOS unikernel. MirageOS is written in OCaml, and has very little C code (unlike Linux). It boots much faster and uses much less RAM than the Fedora-based VM. But recently, a user reported that restarting mirage-firewall was taking a very long time. The problem seemed to be that it was taking several minutes to transfer the information about the network configuration to the firewall. This is sent over vchan. The user reported that stracing the QubesDB process in dom0 revealed that it was sleeping for 10 seconds between sending the records, suggesting that a wakeup event was missing.

The lead developer of QubesOS said:

I’d guess missing evtchn trigger after reading/writing data in vchan.

Perhaps ocaml-vchan, the OCaml implementation of vchan, wasn’t implementing the vchan specification correctly? I wanted to check, but there was a problem: there was no vchan specification.

The Xen wiki lists vchan under Xen Document Days/TODO. The initial Git commit on 2011-10-06 said:

libvchan: interdomain communications library

This library implements a bidirectional communication interface between applications in different domains, similar to unix sockets. Data can be sent using the byte-oriented libvchan_read/libvchan_write or the packet-oriented libvchan_recv/libvchan_send.

Channel setup is done using a client-server model; domain IDs and a port number must be negotiated prior to initialization. The server allocates memory for the shared pages and determines the sizes of the communication rings (which may span multiple pages, although the default places rings and control within a single page).

With properly sized rings, testing has shown that this interface provides speed comparable to pipes within a single Linux domain; it is significantly faster than network-based communication.

I looked in the xen-devel mailing list around this period in case the reviewers had asked about how it worked.

One reviewer suggested:

Please could you say a few words about the functionality this new library enables and perhaps the design etc? In particular a protocol spec would be useful for anyone who wanted to reimplement for another guest OS etc. […] I think it would be appropriate to add protocol.txt at the same time as checking in the library.

However, the submitter pointed out that this was unnecessary, saying:

The comments in the shared header file explain the layout of the shared memory regions; any other parts of the protocol are application-defined.

Now, ordinarily, I wouldn’t be much interested in spending my free time tracking down race conditions in 3rd-party libraries for the benefit of strangers on the Internet. However, I did want to have another play with TLA…

TLA+

TLA+ is a language for specifying algorithms. It can be used for many things, but it is particularly designed for stateful parallel algorithms.

I learned about TLA while working at Docker. Docker EE provides software for managing large clusters of machines. It includes various orchestrators (SwarmKit, Kubernetes and Swarm Classic) and a web UI. Ensuring that everything works properly is very important, and to this end a large collection of tests had been produced. Part of my job was to run these tests. You take a test from a list in a web UI and click whatever buttons it tells you to click, wait for some period of time, and then check that what you see matches what the test says you should see. There were a lot of these tests, and they all had to be repeated on every supported platform, and for every release, release candidate or preview release. There was a lot of waiting involved and not much thinking required, so to keep my mind occupied, I started reading the TLA documentation.

I read The TLA+ Hyperbook and Specifying Systems. Both are by Leslie Lamport (the creator of TLA), and are freely available online. They’re both very easy to read. The hyperbook introduces the tools right away so you can start playing, while Specifying Systems starts with more theory and discusses the tools later. I think it’s worth reading both.

Once Docker EE 2.0 was released, we engineers were allowed to spend a week on whatever fun (Docker-related) project we wanted. I used the time to read the SwarmKit design documents and make a TLA model of that. I felt that using TLA prompted useful discussions with the SwarmKit developers (which can see seen in the pull request comments).

A specification document can answer questions such as:

What does it do? (requirements / properties)

How does it do it? (the algorithm)

Does it work? (model checking)

Why does it work? (inductive invariant)

Does it really work? (proofs)

You don’t have to answer all of them to have a useful document, but I will try to answer each of them for vchan.

Is TLA useful?

In my (limited) experience with TLA, whenever I have reached the end of a specification (whether reading it or writing it), I always find myself thinking “Well, that was obvious. It hardly seems worth writing a spec for that!”. You might feel the same after reading this blog post.

To judge whether TLA is useful, I suggest you take a few minutes to look at the code. If you are good at reading C code then you might find, like the Xen reviewers, that it is quite obvious what it does, how it works, and why it is correct. Or, like me, you might find you’d prefer a little help. You might want to jot down some notes about it now, to see whether you learn anything new.

To give the big picture:

Two VMs decide to communicate over vchan. One will be the server and the other the client.

The server allocates three chunks of memory: one to hold data in transit from the client to the server, one for data going from server to client, and the third to track information about the state of the system. This includes counters saying how much data has been written and how much read, in each direction.

The server tells Xen to grant the client access to this memory.

The client asks Xen to map the memory into its address space. Now client and server can both access it at once. There are no locks in the protocol, so be careful!

Either end sends data by writing it into the appropriate buffer and updating the appropriate counter in the shared block. The buffers are ring buffers, so after getting to the end, you start again from the beginning.

The data-written (producer) counter and the data-read (consumer) counter together tell you how much data is in the buffer, and where it is. When the difference is zero, the reader must stop reading and wait for more data. When the difference is the size of the buffer, the writer must stop writing and wait for more space.

When one end is waiting, the other can signal it using a Xen event channel. This essentially sets a pending flag to true at the other end, and wakes the VM if it is sleeping. If a VM tries to sleep while it has an event pending, it will immediately wake up again. Sending an event when one is already pending has no effect.

The public/io/libxenvchan.h header file provides some information, including the shared structures and comments about them:

xen/include/public/io/libxenvchan.h

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

struct ring_shared { uint32_t cons, prod; }; #define VCHAN_NOTIFY_WRITE 0x1 #define VCHAN_NOTIFY_READ 0x2 /** * vchan_interface: primary shared data structure */ struct vchan_interface { /** * Standard consumer/producer interface, one pair per buffer * left is client write, server read * right is client read, server write */ struct ring_shared left, right; /** * size of the rings, which determines their location * 10 - at offset 1024 in ring's page * 11 - at offset 2048 in ring's page * 12+ - uses 2^(N-12) grants to describe the multi-page ring * These should remain constant once the page is shared. * Only one of the two orders can be 10 (or 11). */ uint16_t left_order, right_order; /** * Shutdown detection: * 0: client (or server) has exited * 1: client (or server) is connected * 2: client has not yet connected */ uint8_t cli_live, srv_live; /** * Notification bits: * VCHAN_NOTIFY_WRITE: send notify when data is written * VCHAN_NOTIFY_READ: send notify when data is read (consumed) * cli_notify is used for the client to inform the server of its action */ uint8_t cli_notify, srv_notify; /** * Grant list: ordering is left, right. Must not extend into actual ring * or grow beyond the end of the initial shared page. * These should remain constant once the page is shared, to allow * for possible remapping by a client that restarts. */ uint32_t grants[0]; };

You might also like to look at the vchan source code. Note that the libxenvchan.h file in this directory includes and extends the above header file (with the same name).

For this blog post, we will ignore the Xen-specific business of sharing the memory and telling the client where it is, and assume that the client has mapped the memory and is ready to go.

Basic TLA concepts

We’ll take a first look at TLA concepts and notation using a simplified version of vchan. TLA comes with excellent documentation, so I won’t try to make this a full tutorial, but hopefully you will be able to follow the rest of this blog post after reading it. We will just consider a single direction of the channel (e.g. client-to-server) here.

Variables, states and behaviour

A variable in TLA is just what a programmer expects: something that changes over time. For example, I’ll use Buffer to represent the data currently being transmitted.

We can also add variables that are just useful for the specification. I use Sent to represent everything the sender-side application asked the vchan library to transmit, and Got for everything the receiving application has received:

VARIABLES Got, Buffer, Sent

A state in TLA represents a snapshot of the world at some point. It gives a value for each variable. For example, { Got: "H", Buffer: "i", Sent: "Hi", ... } is a state. The ... is just a reminder that a state also includes everything else in the world, not just the variables we care about.

Here are some more states:

State Got Buffer Sent s0 s1 H H s2 H H s3 H i Hi s4 Hi Hi s5 iH Hi

A behaviour is a sequence of states, representing some possible history of the world. For example, << s0, s1, s2, s3, s4 >> is a behaviour. So is << s0, s1, s5 >>, but not one we want. The basic idea in TLA is to specify precisely which behaviours we want and which we don’t want.

A state expression is an expression that can be evaluated in the context of some state. For example, this defines Integrity to be a state expression that is true whenever what we have got so far matches what we wanted to send:

1 2 3 4 5 6 7 8

(* Take(m, i) is just the first i elements of message m. *) Take(m, i) == SubSeq(m, 1, i) (* Everything except the first i elements of message m. *) Drop(m, i) == SubSeq(m, i + 1, Len(m)) Integrity == Take(Sent, Len(Got)) = Got

Integrity is true for all the states above except for s5. I added some helper operators Take and Drop here. Sequences in TLA+ can be confusing because they are indexed from 1 rather than from 0, so it is easy to make off-by-one errors. These operators just use lengths, which we can all agree on. In Python syntax, it would be written something like:

1 2

def Integrity(s): return s.Sent.starts_with(s.Got)

A temporal formula is an expression that is evaluated in the context of a complete behaviour. It can use the temporal operators, which include:

[] (that’s supposed to look like a square) : “always”

<> (that’s supposed to look like a diamond) : “eventually”

[] F is true if the expression F is true at every point in the behaviour. <> F is true if the expression F is true at any point in the behaviour.

Messages we send should eventually arrive. Here’s one way to express that:

1 2 3

Availability == \A x \in Nat : [] (Len(Sent) = x => <> (Len(Got) >= x) )

TLA syntax is a bit odd. It’s rather like LaTeX (which is not surprising: Lamport is also the “La” in LaTeX). \A means “for all” (rendered as an upside-down A). So this says that for every number x, it is always true that if we have sent x bytes then eventually we will have received at least x bytes.

This pattern of [] (F => <>G) is common enough that it has a shorter notation of F ~> G, which is read as “F (always) leads to G”. So, Availability can also be written as:

1 2 3

Availability == \A x \in Nat : Len(Sent) = x ~> Len(Got) >= x

We’re only checking the lengths in Availability, but combined with Integrity that’s enough to ensure that we eventually receive what we want. So ideally, we’d like to ensure that every possible behaviour of the vchan library will satisfy the temporal formula Properties:

1 2

Properties == Availability /\ []Integrity

That /\ is “and” by the way, and \/ is “or”. I did eventually start to be able to tell one from the other, though I still think && and || would be easier. In case I forget to explain some syntax, A Summary of TLA lists most of it.

Actions

It is hopefully easy to see that Properties defines properties we want. A user of vchan would be happy to see that these are things they can rely on. But they don’t provide much help to someone trying to implement vchan. For that, TLA provides another way to specify behaviours.

An action in TLA is an expression that is evaluated in the context of a pair of states, representing a single atomic step of the system. For example:

1 2 3 4 5

Read == /\ Len(Buffer) > 0 /\ Got' = Got \o Buffer /\ Buffer' = << >> /\ UNCHANGED Sent

The Read action is true of a step if that step transfers all the data from Buffer to Got. Unprimed variables (e.g. Buffer) refer to the current state and primed ones (e.g. Buffer') refer to the next state. There’s some more strange notation here too:

We’re using /\ to form a bulleted list here rather than as an infix operator. This is indentation-sensitive. TLA also supports \/ lists in the same way.

\o is sequence concatenation (+ in Python).

<< >> is the empty sequence ([ ] in Python).

UNCHANGED Sent means Sent' = Sent.

In Python, it might look like this:

1 2 3 4 5

def Read(current, next): return Len(current.Buffer) > 0 \ and next.Got = current.Got + current.Buffer \ and next.Buffer = [] \ and next.Sent = current.Sent

Actions correspond more closely to code than temporal formulas, because they only talk about how the next state is related to the current one.

This action only allows one thing: reading the whole buffer at once. In the C implementation of vchan the receiving application can provide a buffer of any size and the library will read at most enough bytes to fill the buffer. To model that, we will need a slightly more flexible version:

1 2 3 4 5

Read == \E n \in 1..Len(Buffer) : /\ Got' = Got \o Take(Buffer, n) /\ Buffer' = Drop(Buffer, n) /\ UNCHANGED Sent

This says that a step is a Read step if there is any n (in the range 1 to the length of the buffer) such that we transferred n bytes from the buffer. \E means “there exists …”.

A Write action can be defined in a similar way:

1 2 3 4 5 6 7 8 9

CONSTANT BufferSize Byte == 0..255 Write == \E m \in Seq(Byte) \ {<< >>} : /\ Buffer' = Buffer \o m /\ Len(Buffer') <= BufferSize /\ Sent' = Sent \o m /\ UNCHANGED Got

A CONSTANT defines a parameter (input) of the specification (it’s constant in the sense that it doesn’t change between states). A Write operation adds some message m to the buffer, and also adds a copy of it to Sent so we can talk about what the system is doing. Seq(Byte) is the set of all possible sequences of bytes, and \ {<< >>} just excludes the empty sequence.

A step of the combined system is either a Read step or a Write step:

1 2

Next == Read \/ Write

We also need to define what a valid starting state for the algorithm looks like:

1 2 3 4

Init == /\ Sent = << >> /\ Buffer = << >> /\ Got = << >>

Finally, we can put all this together to get a temporal formula for the algorithm:

1 2 3 4

vars == << Got, Buffer, Sent >> Spec == Init /\ [][Next]_vars

Some more notation here:

[Next]_vars (that’s Next in brackets with a subscript vars) means Next \/ UNCHANGED vars.

Using Init (a state expression) in a temporal formula means it must be true for the first state of the behaviour.

[][Action]_vars means that [Action]_vars must be true for each step.

TLA syntax requires the _vars subscript here. This is because other things can be going on in the world beside our algorithm, so it must always be possible to take a step without our algorithm doing anything.

Spec defines behaviours just like Properties does, but in a way that makes it more obvious how to implement the protocol.

Correctness of Spec

Now we have definitions of Spec and Properties, it makes sense to check that every behaviour of Spec satisfies Properties. In Python terms, we want to check that all behaviours b satisfy this:

1 2

def SpecOK(b): return Spec(b) = False or Properties(b)

i.e. either b isn’t a behaviour that could result from the actions of our algorithm or, if it is, it satisfies Properties. In TLA notation, we write this as:

1 2

SpecOK == Spec => Properties

It’s OK if a behaviour is allowed by Properties but not by Spec. For example, the behaviour which goes straight from Got="", Sent="" to Got="Hi", Sent="Hi" in one step meets our requirements, but it’s not a behaviour of Spec.

The real implementation may itself further restrict Spec. For example, consider the behaviour << s0, s1, s2 >>:

State Got Buffer Sent s0 Hi Hi s1 H i Hi s2 Hi Hi

The sender sends two bytes at once, but the reader reads them one at a time. This is a behaviour of the C implementation, because the reading application can ask the library to read into a 1-byte buffer. However, it is not a behaviour of the OCaml implementation, which gets to choose how much data to return to the application and will return both bytes together.

That’s fine. We just need to show that OCamlImpl => Spec and Spec => Properties and we can deduce that OCamlImpl => Properties. This is, of course, the key purpose of a specification: we only need to check that each implementation implements the specification, not that each implementation directly provides the desired properties.

It might seem strange that an implementation doesn’t have to allow all the specified behaviours. In fact, even the trivial specification Spec == FALSE is considered to be a correct implementation of Properties, because it has no bad behaviours (no behaviours at all). But that’s OK. Once the algorithm is running, it must have some behaviour, even if that behaviour is to do nothing. As the user of the library, you are responsible for checking that you can use it (e.g. by ensuring that the Init conditions are met). An algorithm without any behaviours corresponds to a library you could never use, not to one that goes wrong once it is running.

The model checker

Now comes the fun part: we can ask TLC (the TLA model checker) to check that Spec => Properties. You do this by asking the toolbox to create a new model (I called mine SpecOK) and setting Spec as the “behaviour spec”. It will prompt for a value for BufferSize. I used 2. There will be various things to fix up:

To check Write, TLC first tries to get every possible Seq(Byte), which is an infinite set. I defined MSG == Seq(Byte) and changed Write to use MSG. I then added an alternative definition for MSG in the model so that we only send messages of limited length. In fact, my replacement MSG ensures that Sent will always just be an incrementing sequence (<< 1, 2, 3, ... >>). That’s enough to check Properties, and much quicker than checking every possible message.

The system can keep sending forever. I added a state constraint to the model: Len(Sent) < 4 This tells TLC to stop considering any execution once this becomes false.

With that, the model runs successfully. This is a nice feature of TLA: instead of changing our specification to make it testable, we keep the specification correct and just override some aspects of it in the model. So, the specification says we can send any message, but the model only checks a few of them.

Now we can add Integrity as an invariant to check. That passes, but it’s good to double-check by changing the algorithm. I changed Read so that it doesn’t clear the buffer, using Buffer' = Drop(Buffer, 0) (with 0 instead of n). Then TLC reports a counter-example (“Invariant Integrity is violated”):

The sender writes << 1, 2 >> to Buffer.

The reader reads one byte, to give Got=1, Buffer=12, Sent=12.

The reader reads another byte, to give Got=11, Buffer=12, Sent=12.

Looks like it really was checking what we wanted. It’s good to be careful. If we’d accidentally added Integrity as a “property” to check rather than as an “invariant” then it would have interpreted it as a temporal formula and reported success just because it is true in the initial state.

One really nice feature of TLC is that (unlike a fuzz tester) it does a breadth-first search and therefore finds minimal counter-examples for invariants. The example above is therefore the quickest way to violate Integrity.

Checking Availability complains because of the use of Nat (we’re asking it to check for every possible length). I replaced the Nat with AvailabilityNat and overrode that to be 0..4 in the model. It then complains “Temporal properties were violated” and shows an example where the sender wrote some data and the reader never read it.

The problem is, [Next]_vars always allows us to do nothing. To fix this, we can specify a “weak fairness” constraint. WF_vars(action), says that we can’t just stop forever with action being always possible but never happening. I updated Spec to require the Read action to be fair:

Spec == Init /\ [][Next]_vars /\ WF_vars(Read)

Again, care is needed here. If we had specified WF_vars(Next) then we would be forcing the sender to keep sending forever, which users of vchan are not required to do. Worse, this would mean that every possible behaviour of the system would result in Sent growing forever. Every behaviour would therefore hit our Len(Sent) < 4 constraint and TLC wouldn’t consider it further. That means that TLC would never check any actual behaviour against Availability, and its reports of success would be meaningless! Changing Read to require n \in 2..Len(Buffer) is a quick way to see that TLC is actually checking Availability.

Here’s the complete spec so far: vchan1.pdf (source)

The real vchan

The simple Spec algorithm above has some limitations. One obvious simplification is that Buffer is just the sequence of bytes in transit, whereas in the real system it is a ring buffer, made up of an array of bytes along with the producer and consumer counters. We could replace it with three separate variables to make that explicit. However, ring buffers in Xen are well understood and I don’t feel that it would make the specification any clearer to include that.

A more serious problem is that Spec assumes that there is a way to perform the Read and Write operations atomically. Otherwise the real system would have behaviours not covered by the spec. To implement the above Spec correctly, you’d need some kind of lock. The real vchan protocol is more complicated than Spec, but avoids the need for a lock.

The real system has more shared state than just Buffer. I added extra variables to the spec for each item of shared state in the C code, along with its initial value:

SenderLive = TRUE (sender sets to FALSE to close connection)

ReceiverLive = TRUE (receiver sets to FALSE to close connection)

NotifyWrite = TRUE (receiver wants to be notified of next write)

DataReadyInt = FALSE (sender has signalled receiver over event channel)

NotifyRead = FALSE (sender wants to be notified of next read)

SpaceAvailableInt = FALSE (receiver has notified sender over event channel)

DataReadyInt represents the state of the receiver’s event port. The sender can make a Xen hypercall to set this and wake (or interrupt) the receiver. I guess sending these events is somewhat slow, because the NotifyWrite system is used to avoid sending events unnecessarily. Likewise, SpaceAvailableInt is the sender’s event port.

The algorithm

Here is my understanding of the protocol. On the sending side:

The sending application asks to send some bytes. We check whether the receiver has closed the channel and abort if so.

We check the amount of buffer space available.

If there isn’t enough, we set NotifyRead so the receiver will notify us when there is more. We also check the space again after this, in case it changed while setting the flag.

If there is any space:

We write as much data as we can to the buffer.

If the NotifyWrite flag is set, we clear it and notify the receiver of the write.

If we wrote everything, we return success.

Otherwise, we wait to be notified of more space.

We check whether the receiver has closed the channel. If so we abort. Otherwise, we go back to step 2.

On the receiving side:

The receiving application asks us to read up to some amount of data.

We check the amount of data available in the buffer.

If there isn’t as much as requested, we set NotifyWrite so the sender will notify us when there is. We also check the space again after this, in case it changed while setting the flag.

If there is any data, we read up to the amount requested. If the NotifyRead flag is set, we clear it and notify the sender of the new space. We return success to the application (even if we didn’t get as much as requested).

Otherwise (if there was no data), we check whether the sender has closed the connection.

If not (if the connection is still open), we wait to be notified of more data, and then go back to step 2.

Either side can close the connection by clearing their “live” flag and signalling the other side. I assumed there is also some process-local way that the close operation can notify its own side if it’s currently blocked.

To make expressing this kind of step-by-step algorithm easier, TLA+ provides a programming-language-like syntax called PlusCal. It then translates PlusCal into TLA actions.

Confusingly, there are two different syntaxes for PlusCal: Pascal style and C style. This means that, when you search for examples on the web, there is a 50% chance they won’t work because they’re using the other flavour. I started with the Pascal one because that was the first example I found, but switched to C-style later because it was more compact.

Here is my attempt at describing the sender algorithm above in PlusCal:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

fair process (SenderWrite = SenderWriteID) variables free = 0, \* Our idea of how much free space is available. msg = << >>, \* The data we haven't sent yet. Sent = << >>; \* Everything we were asked to send. { sender_ready:- while (TRUE) { if (~SenderLive \/ ~ReceiverLive) goto Done else { with (m \in MSG) { msg := m }; Sent := Sent \o msg; \* Remember we wanted to send this }; sender_write: while (TRUE) { free := BufferSize - Len(Buffer); sender_request_notify: if (free >= Len(msg)) goto sender_write_data else NotifyRead := TRUE; sender_recheck_len: free := BufferSize - Len(Buffer); sender_write_data: if (free > 0) { Buffer := Buffer \o Take(msg, Min(Len(msg), free)); msg := Drop(msg, Min(Len(msg), free)); free := 0; sender_check_notify_data: if (NotifyWrite) { NotifyWrite := FALSE; \* Atomic test-and-clear sender_notify_data: DataReadyInt := TRUE; \* Signal receiver if (msg = << >>) goto sender_ready } else if (msg = << >>) goto sender_ready }; sender_blocked: await SpaceAvailableInt \/ ~SenderLive; if (~SenderLive) goto Done; else SpaceAvailableInt := FALSE; sender_check_recv_live: if (~ReceiverLive) goto Done; } } }

The labels (e.g. sender_request_notify:) represent points in the program where other actions can happen. Everything between two labels is considered to be atomic. I checked that every block of code between labels accesses only one shared variable. This means that the real system can’t see any states that we don’t consider. The toolbox doesn’t provide any help with this; you just have to check manually.

The sender_ready label represents a state where the client application hasn’t yet decided to send any data. Its label is tagged with - to indicate that fairness doesn’t apply here, because the protocol doesn’t require applications to keep sending more data forever. The other steps are fair, because once we’ve decided to send something we should keep going.

Taking a step from sender_ready to sender_write corresponds to the vchan library’s write function being called with some argument m. The with (m \in MSG) says that m could be any message from the set MSG. TLA also contains a CHOOSE operator that looks like it might do the same thing, but it doesn’t. When you use with, you are saying that TLC should check all possible messages. When you use CHOOSE, you are saying that it doesn’t matter which message TLC tries (and it will always try the same one). Or, in terms of the specification, a CHOOSE would say that applications can only ever send one particular message, without telling you what that message is.

In sender_write_data, we set free := 0 for no obvious reason. This is just to reduce the number of states that the model checker needs to explore, since we don’t care about its value after this point.

Some of the code is a little awkward because I had to put things in else branches that would more naturally go after the whole if block, but the translator wouldn’t let me do that. The use of semi-colons is also a bit confusing: the PlusCal-to-TLA translator requires them after a closing brace in some places, but the PDF generator messes up the indentation if you include them.

Here’s how the code block starting at sender_request_notify gets translated into a TLA action:

1 2 3 4 5 6 7 8 9 10 11

sender_request_notify == /\ pc[SenderWriteID] = "sender_request_notify" /\ IF free >= Len(msg) THEN /\ pc' = [pc EXCEPT ![SenderWriteID] = "sender_write_data"] /\ UNCHANGED NotifyRead ELSE /\ NotifyRead' = TRUE /\ pc' = [pc EXCEPT ![SenderWriteID] = "sender_recheck_len"] /\ UNCHANGED << SenderLive, ReceiverLive, Buffer, NotifyWrite, DataReadyInt, SpaceAvailableInt, free, msg, Sent, have, want, Got >>

pc is a mapping from process ID to the label where that process is currently executing. So sender_request_notify can only be performed when the SenderWriteID process is at the sender_request_notify label. Afterwards pc[SenderWriteID] will either be at sender_write_data or sender_recheck_len (if there wasn’t enough space for the whole message).

Here’s the code for the receiver:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

fair process (ReceiverRead = ReceiverReadID) variables have = 0, \* The amount of data we think the buffer contains. want = 0, \* The amount of data the user wants us to read. Got = << >>; \* Pseudo-variable recording all data ever received by receiver. { recv_ready: while (ReceiverLive) { with (n \in 1..MaxReadLen) want := n; recv_reading: while (TRUE) { have := Len(Buffer); recv_got_len: if (have >= want) goto recv_read_data else NotifyWrite := TRUE; recv_recheck_len: have := Len(Buffer); recv_read_data: if (have > 0) { Got := Got \o Take(Buffer, Min(want, have)); Buffer := Drop(Buffer, Min(want, have)); want := 0; have := 0; recv_check_notify_read: if (NotifyRead) { NotifyRead := FALSE; \* (atomic test-and-clear) recv_notify_read: SpaceAvailableInt := TRUE; goto recv_ready; \* Return success } else goto recv_ready; \* Return success } else if (~SenderLive \/ ~ReceiverLive) { goto Done; }; recv_await_data: await DataReadyInt \/ ~ReceiverLive; if (~ReceiverLive) { want := 0; goto Done } else DataReadyInt := FALSE; } } }

It’s quite similar to before. recv_ready corresponds to a state where the application hasn’t yet called read. When it does, we take n (the maximum number of bytes to read) as an argument and store it in the local variable want.

Note: you can use the C library in blocking or non-blocking mode. In blocking mode, a write (or read) waits until data is sent (or received). In non-blocking mode, it returns a special code to the application indicating that it needs to wait. The application then does the waiting itself and then calls the library again. I think the specification above covers both cases, depending on whether you think of sender_blocked and recv_await_data as representing code inside or outside of the library.

We also need a way to close the channel. It wasn’t clear to me, from looking at the C headers, when exactly you’re allowed to do that. I think that if you had a multi-threaded program and you called the close function while the write function was blocked, it would unblock and return. But if you happened to call it at the wrong time, it would try to use a closed file descriptor and fail (or read from the wrong one). So I guess it’s single threaded, and you should use the non-blocking mode if you want to cancel things.

That means that the sender can close only when it is at sender_ready or sender_blocked, and similarly for the receiver. The situation with the OCaml code is the same, because it is cooperatively threaded and so the close operation can only be called while blocked or idle. However, I decided to make the specification more general and allow for closing at any point by modelling closing as separate processes:

1 2 3 4 5 6 7 8 9

fair process (SenderClose = SenderCloseID) { sender_open:- SenderLive := FALSE; \* Clear liveness flag sender_notify_closed: DataReadyInt := TRUE; \* Signal receiver } fair process (ReceiverClose = ReceiverCloseID) { recv_open:- ReceiverLive := FALSE; \* Clear liveness flag recv_notify_closed: SpaceAvailableInt := TRUE; \* Signal sender }

Again, the processes are “fair” because once we start closing we should finish, but the initial labels are tagged with “-“ to disable fairness there: it’s OK if you keep a vchan open forever.

There’s a slight naming problem here. The PlusCal translator names the actions it generates after the starting state of the action. So sender_open is the action that moves from the sender_open label. That is, the sender_open action actually closes the connection!

Finally, we share the event channel with the buffer going in the other direction, so we might get notifications that are nothing to do with us. To ensure we handle that, I added another process that can send events at any time:

1 2 3 4 5 6

process (SpuriousInterrupts = SpuriousID) { spurious: while (TRUE) { either SpaceAvailableInt := TRUE or DataReadyInt := TRUE } }

either/or says that we need to consider both possibilities. This process isn’t marked fair, because we can’t rely these interrupts coming. But we do have to handle them when they happen.

Testing the full spec

PlusCal code is written in a specially-formatted comment block, and you have to press Ctrl-T to generate (or update) then TLA translation before running the model checker.

Be aware that the TLA Toolbox is a bit unreliable about keyboard short-cuts. While typing into the editor always works, short-cuts such as Ctrl-S (save) sometimes get disconnected. So you think you’re doing “edit/save/translate/save/check” cycles, but really you’re just checking some old version over and over again. You can avoid this by always running the model checker with the keyboard shortcut too, since that always seems to fail at the same time as the others. Focussing a different part of the GUI and then clicking back in the editor again fixes everything for a while.

Anyway, running our model on the new spec shows that Integrity is still OK. However, the Availability check fails with the following counter-example:

The sender writes << 1 >> to Buffer.

The sender closes the connection.

The receiver closes the connection.

All processes come to a stop, but the data never arrived.

We need to update Availability to consider the effects of closing connections. And at this point, I’m very unsure what vchan is intended to do. We could say:

1 2 3 4 5

Availability == \A x \in AvailabilityNat : Len(Sent) = x ~> \/ Len(Got) >= x \/ ~ReceiverLive \/ ~SenderLive

That passes. But vchan describes itself as being like a Unix socket. If you write to a Unix socket and then close it, you still expect the data to be delivered. So actually I tried this:

1 2 3 4 5

Availability == \A x \in AvailabilityNat : x = Len(Sent) /\ SenderLive /\ pc[SenderWriteID] = "sender_ready" ~> \/ Len(Got) >= x \/ ~ReceiverLive

This says that if a sender write operation completes successfully (we’re back at sender_ready) and at that point the sender hasn’t closed the connection, then the receiver will eventually receive the data (or close its end).

That is how I would expect it to behave. But TLC reports that the new spec does not satisfy this, giving this example (simplified - there are 16 steps in total):

The receiver starts reading. It finds that the buffer is empty.

The sender writes some data to Buffer and returns to sender_ready.

The sender closes the channel.

The receiver sees that the connection is closed and stops.

Is this a bug? Without a specification, it’s impossible to say. Maybe vchan was never intended to ensure delivery once the sender has closed its end. But this case only happens if you’re very unlucky about the scheduling. If the receiving application calls read when the sender has closed the connection but there is data available then the C code does return the data in that case. It’s only if the sender happens to close the connection just after the receiver has checked the buffer and just before it checks the close flag that this happens.

It’s also easy to fix. I changed the code in the receiver to do a final check on the buffer before giving up:

1 2 3 4

} else if (~SenderLive \/ ~ReceiverLive) { recv_final_check: if (Len(Buffer) = 0) { want := 0; goto Done } else goto recv_reading; }

With that change, we can be sure that data sent while the connection is open will always be delivered (provided only that the receiver doesn’t close the connection itself). If you spotted this issue yourself while you were reviewing the code earlier, then well done!

Note that when TLC finds a problem with a temporal property (such as Availability), it does not necessarily find the shortest example first. I changed the limit on Sent to Len(Sent) < 2 and added an action constraint of ~SpuriousInterrupts to get a simpler example, with only 1 byte being sent and no spurious interrupts.

Some odd things

I noticed a couple of other odd things, which I thought I’d mention.

First, NotifyWrite is initialised to TRUE, which seemed unnecessary. We can initialise it to FALSE instead and everything still works. We can even initialise it with NotifyWrite \in {TRUE, FALSE} to allow either behaviour, and thus test that old programs that followed the original version of the spec still work with either behaviour.

That’s a nice advantage of using a specification language. Saying “the code is the spec” becomes less useful as you build up more and more versions of the code!

However, because there was no spec before, we can’t be sure that existing programs do follow it. And, in fact, I found that QubesDB uses the vchan library in a different and unexpected way. Instead of calling read, and then waiting if libvchan says to, QubesDB blocks first in all cases, and then calls the read function once it gets an event.

We can document that by adding an extra step at the start of ReceiverRead:

1 2 3 4 5

recv_init: either goto recv_ready \* (recommended) or { \* (QubesDB does this) with (n \in 1..MaxReadLen) want := n; goto recv_await_data; };

Then TLC shows that NotifyWrite cannot start as FALSE.

The second odd thing is that the receiver sets NotifyRead whenever there isn’t enough data available to fill the application��s buffer completely. But usually when you do a read operation you just provide a buffer large enough for the largest likely message. It would probably make more sense to set NotifyWrite only when the buffer is completely empty. After checking the current version of the algorithm, I changed the specification to allow either behaviour.

Why does vchan work?

At this point, we have specified what vchan should do and how it does it. We have also checked that it does do this, at least for messages up to 3 bytes long with a buffer size of 2. That doesn’t sound like much, but we still checked 79,288 distinct states, with behaviours up to 38 steps long. This would be a perfectly reasonable place to declare the specification (and blog post) finished.

However, TLA has some other interesting abilities. In particular, it provides a very interesting technique to help discover why the algorithm works.

We’ll start with Integrity. We would like to argue as follows:

Integrity is true in any initial state (i.e. Init => Integrity).

Any Next step preserves Integrity (i.e. Integrity /\ Next => Integrity').

Then it would just be a matter looking at each possible action that makes up Next and checking that each one individually preserves Integrity. However, we can’t do this with Integrity because (2) isn’t true. For example, the state { Got: "", Buffer: "21", Sent: "12" } satisfies Integrity, but if we take a read step then the new state won’t. Instead, we have to argue “If we take a Next step in any reachable state then Integrity'”, but that’s very difficult because how do we know whether a state is reachable without searching them all?

So the idea is to make a stronger version of Integrity, called IntegrityI, which does what we want. IntegrityI is called an inductive invariant. The first step is fairly obvious - I began with:

1 2

IntegrityI == Sent = Got \o Buffer \o msg

Integrity just said that Got is a prefix of Sent. This says specifically that the rest is Buffer \o msg - the data currently being transmitted and the data yet to be transmitted.

We can ask TLC to check Init /\ [][Next]_vars => []IntegrityI to check that it is an invariant, as before. It does that by finding all the Init states and then taking Next steps to find all reachable states. But we can also ask it to check IntegrityI /\ [][Next]_vars => []IntegrityI. That is, the same thing but starting from any state matching IntegrityI instead of Init.

I created a new model (IntegrityI) to do that. It reports a few technical problems at the start because it doesn’t know the types of anything. For example, it can’t choose initial values for SenderLive without knowing that SenderLive is a boolean. I added a TypeOK state expression that gives the expected type of every variable:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

MESSAGE == Seq(Byte) FINITE_MESSAGE(L) == UNION ( { [ 1..N -> Byte ] : N \in 0..L } ) TypeOK == /\ Sent \in MESSAGE /\ Got \in MESSAGE /\ Buffer \in FINITE_MESSAGE(BufferSize) /\ SenderLive \in BOOLEAN /\ ReceiverLive \in BOOLEAN /\ NotifyWrite \in BOOLEAN /\ DataReadyInt \in BOOLEAN /\ NotifyRead \in BOOLEAN /\ SpaceAvailableInt \in BOOLEAN /\ free \in 0..BufferSize /\ msg \in FINITE_MESSAGE(MaxWriteLen) /\ want \in 0..MaxReadLen /\ have \in 0..BufferSize

We also need to tell it all the possible states of pc (which says which label each process it at):

1 2 3 4 5 6 7 8 9 10 11

PCOK == pc \in [ SW: {"sender_ready", "sender_write", "sender_request_notify", "sender_recheck_len", "sender_write_data", "sender_blocked", "sender_check_notify_data", "sender_notify_data", "sender_check_recv_live", "Done"}, SC: {"sender_open", "sender_notify_closed", "Done"}, RR: {"recv_init", "recv_ready", "recv_reading", "recv_got_len", "recv_recheck_len", "recv_read_data", "recv_final_check", "recv_await_data", "recv_check_notify_read", "recv_notify_read", "Done"}, RC: {"recv_open", "recv_notify_closed", "Done"}, SP: {"spurious"} ]

You might imagine that the PlusCal translator would generate that for you, but it doesn’t. We also need to override MESSAGE with FINITE_MESSAGE(n) for some n (I used 2). Otherwise, it can’t enumerate all possible messages. Now we have:

1 2 3 4

IntegrityI == /\ TypeOK /\ PCOK /\ Sent = Got \o Buffer \o msg

With that out of the way, TLC starts finding real problems (that is, examples showing that IntegrityI /\ Next => IntegrityI' isn’t true). First, recv_read_data would do an out-of-bounds read if have = 1 and Buffer = << >>. Our job is to explain why that isn’t a valid state. We can fix it with an extra constraint:

1 2 3 4 5

IntegrityI == /\ TypeOK /\ PCOK /\ Sent = Got \o Buffer \o msg /\ pc[ReceiverReadID] = "recv_read_data" => have <= Len(Buffer)

(note: that => is “implies”, while the <= is “less-than-or-equal-to”)

Now it complains that if we do recv_got_len with Buffer = << >>, have = 1, want = 0 then we end up in recv_read_data with Buffer = << >>, have = 1, and we have to explain why that can’t happen and so on.

Because TLC searches breadth-first, the examples it finds never have more than 2 states. You just have to explain why the first state can’t happen in the real system. Eventually, you get a big ugly pile of constraints, which you then think about for a bit and simply. I ended up with:

1 2 3 4 5 6 7 8 9 10

IntegrityI == /\ TypeOK /\ PCOK /\ Sent = Got \o Buffer \o msg /\ have <= Len(Buffer) /\ free <= BufferSize - Len(Buffer) /\ pc[SenderWriteID] \in {"sender_write", "sender_request_notify", "sender_recheck_len", "sender_write_data", "sender_blocked", "sender_check_recv_live"} => msg /= << >> /\ pc[SenderWriteID] \in {"sender_ready"} => msg = << >>

It’s a good idea to check the final IntegrityI with the original SpecOK model, just to check it really is an invariant.

So, in summary, Integrity is always true because:

Sent is always the concatenation of Got, Buffer and msg. That’s fairly obvious, because sender_ready sets msg and appends the same thing to Sent, and the other steps (sender_write_data and recv_read_data) just transfer some bytes from the start of one variable to the end of another.

Although, like all local information, the receiver’s have variable might be out-of-date, there must be at least that much data in the buffer, because the sender process will only have added more, not removed any. This is sufficient to ensure that we never do an out-of-range read.

Likewise, the sender’s free variable is a lower bound on the true amount of free space, because the receiver only ever creates more space. We will therefore never write beyond the free space.

I think this ability to explain why an algorithm works, by being shown examples where the inductive property doesn’t hold, is a really nice feature of TLA. Inductive invariants are useful as a first step towards writing a proof, but I think they’re valuable even on their own. If you’re documenting your own algorithm, this process will get you to explain your own reasons for believing it works (I tried it on a simple algorithm in my own code and it seemed helpful).

Some notes:

Originally, I had the free and have constraints depending on pc. However, the algorithm sets them to zero when not in use so it turns out they’re always true.

IntegrityI matches 532,224 states, even with a maximum Sent length of 1, but it passes! There are some games you can play to speed things up; see Using TLC to Check Inductive Invariance for some suggestions (I only discovered that while writing this up).

Proving Integrity

TLA provides a syntax for writing proofs, and integrates with TLAPS (the TLA+ Proof System) to allow them to be checked automatically.

Proving IntegrityI is just a matter of showing that Init => IntegrityI and that it is preserved by any possible [Next]_vars step. To do that, we consider each action of Next individually, which is long but simple enough.

I was able to prove it, but the recv_read_data action was a little difficult because we don’t know that want > 0 at that point, so we have to do some extra work to prove that transferring 0 bytes works, even though the real system never does that.

I therefore added an extra condition to IntegrityI that want is non-zero whenever it’s in use, and also conditions about have and free being 0 when not in use, for completeness:

1 2 3 4 5 6 7 8 9 10

IntegrityI == [...] /\ want = 0 <=> pc[ReceiverReadID] \in {"recv_check_notify_read", "recv_notify_read", "recv_init", "recv_ready", "recv_notify_read", "Done"} /\ \/ pc[ReceiverReadID] \in {"recv_got_len", "recv_recheck_len", "recv_read_data"} \/ have = 0 /\ \/ pc[SenderWriteID] \in {"sender_write", "sender_request_notify", "sender_recheck_len", "sender_write_data"} \/ free = 0

Availability

Integrity was quite easy to prove, but I had more trouble trying to explain Availability. One way to start would be to add Availability as a property to check to the IntegrityI model. However, it takes a while to check properties as it does them at the end, and the examples it finds may have several steps (it took 1m15s to find a counter-example for me).

Here’s a faster way (37s). The algorithm will deadlock if both sender and receiver are in their blocked states and neither interrupt is pending, so I made a new invariant, I, which says that deadlock can’t happen:

1 2 3 4 5 6

I == /\ IntegrityI /\ ~ /\ pc[SenderWriteID] = "sender_blocked" /\ ~SpaceAvailableInt /\ pc[ReceiverReadID] = "recv_await_data" /\ ~DataReadyInt

I discovered some obvious facts about closing the connection. For example, the SenderLive flag is set if and only if the sender’s close thread hasn’t done anything. I’ve put them all together in CloseOK:

1 2 3 4 5 6 7 8 9 10 11 12

(* Some obvious facts about shutting down connections. *) CloseOK == \* An endpoint is live iff its close thread hasn't done anything: /\ pc[SenderCloseID] = "sender_open" <=> SenderLive /\ pc[ReceiverCloseID] = "recv_open" <=> ReceiverLive \* The send and receive loops don't terminate unless someone has closed the connection: /\ pc[ReceiverReadID] \in {"recv_final_check", "Done"} => ~ReceiverLive \/ ~SenderLive /\ pc[SenderWriteID] \in {"Done"} => ~ReceiverLive \/ ~SenderLive \* If the receiver closed the connection then we will get (or have got) the signal: /\ pc[ReceiverCloseID] = "Done" => \/ SpaceAvailableInt \/ pc[SenderWriteID] \in {"sender_check_recv_live", "Done"}

But I had problems with other examples TLC showed me, and I realised that I didn’t actually know why this algorithm doesn’t deadlock.

Intuitively it seems clear enough: the sender puts data in the buffer when there’s space and notifies the receiver, and the receiver reads it and notifies the writer. What could go wrong? But both processes are working with information that can be out-of-date. By the time the sender decides to block because the buffer looked full, the buffer might be empty. And by the time the receiver decides to block because it looked empty, it might be full.

Maybe you already saw why it works from the C code, or the algorithm above, but it took me a while to figure it out! I eventually ended up with an invariant of the form:

1 2 3 4

I == .. /\ SendMayBlock => SpaceWakeupComing /\ ReceiveMayBlock => DataWakeupComing

SendMayBlock is TRUE if we’re in a state that may lead to being blocked without checking the buffer’s free space again. Likewise, ReceiveMayBlock indicates that the receiver might block. SpaceWakeupComing and DataWakeupComing predict whether we’re going to get an interrupt. The idea is that if we’re going to block, we need to be sure we’ll be woken up. It’s a bit ugly, though, e.g.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

DataWakeupComing == \/ DataReadyInt \* Event sent \/ pc[SenderWriteID] = "sender_notify_data" \* Event being sent \/ pc[SenderCloseID] = "sender_notify_closed" \/ pc[ReceiverCloseID] = "recv_notify_closed" \/ /\ NotifyWrite \* Event requested and ... /\ ReceiverLive \* Sender can see receiver is still alive and ... /\ \/ pc[SenderWriteID] = "sender_write_data" /\ free > 0 \/ pc[SenderWriteID] = "sender_check_notify_data" \/ pc[SenderWriteID] = "sender_recheck_len" /\ Len(Buffer) < BufferSize \/ pc[SenderWriteID] = "sender_ready" /\ SenderLive /\ Len(Buffer) < BufferSize \/ pc[SenderWriteID] = "sender_write" /\ Len(Buffer) < BufferSize \/ pc[SenderWriteID] = "sender_request_notify" /\ Len(Buffer) < BufferSize \/ SpaceWakeupComing /\ Len(Buffer) < BufferSize /\ SenderLive

It did pass my model that tested sending one byte, and I decided to try a proof. Well, it didn’t work. The problem seems to be that DataWakeupComing and SpaceWakeupComing are really mutually recursive. The reader will wake up if the sender wakes it, but the sender might be blocked, or about to block. That’s OK though, as long as the receiver will wake it, which it will do, once the sender wakes it…

You’ve probably already figured it out, but I thought I’d document my confusion. It occurred to me that although each process might have out-of-date information, that could be fine as long as at any one moment one of them was right. The last process to update the buffer must know how full it is, so one of them must have correct information at any given time, and that should be enough to avoid deadlock.

That didn’t work either. When you’re at a proof step and can’t see why it’s correct, you can ask TLC to show you an example. e.g. if you’re stuck trying to prove that sender_request_notify preserves I when the receiver is at recv_ready, the buffer is full, and ReceiverLive = FALSE, you can ask for an example of that:

1 2 3 4 5 6 7

Example == /\ PCOK /\ pc[SenderWriteID] = "sender_request_notify" /\ pc[ReceiverReadID] = "recv_ready" /\ ReceiverLive = FALSE /\ I /\ Len(Buffer) = BufferSize

You then create a new model that searches Example /\ [][Next]_vars and tests I. As long as Example has several constraints, you can use a much larger model for this. I also ask it to check the property [][FALSE]_vars, which means it will show any step starting from Example.

It quickly became clear what was wrong: it is quite possible that neither process is up-to-date. If both processes see the buffer contains X bytes of data, and the sender sends Y bytes and the receiver reads Z bytes, then the sender will think there are X + Y bytes in the buffer and the receiver will think there are X - Z bytes, and neither is correct. My original 1-byte buffer was just too small to find a counter-example.

The real reason why vchan works is actually rather obvious. I don’t know why I didn’t see it earlier. But eventually it occurred to me that I could make use of Got and Sent. I defined WriteLimit to be the total number of bytes that the sender would write before blocking, if the receiver never did anything further. And I defined ReadLimit to be the total number of bytes that the receiver would read if the sender never did anything else.

Did I define these limits correctly? It’s easy to ask TLC to check some extra properties while it’s running. For example, I used this to check that ReadLimit behaves sensibly:

1 2 3 4 5 6 7 8 9

ReadLimitCorrect == \* We will eventually receive what ReadLimit promises: /\ WF_vars(ReceiverRead) => \A i \in AvailabilityNat : ReadLimit = i ~> Len(Got) >= i \/ ~ReceiverLive \* ReadLimit can only decrease if we decide to shut down: /\ [][ReadLimit' >= ReadLimit \/ ~ReceiverLive]_vars \* ReceiverRead steps don't change the read limit: /\ [][ReceiverRead => UNCHANGED ReadLimit \/ ~ReceiverLive]_vars

Because ReadLimit is defined in terms of what it does when no other processes run, this property should ideally be tested in a model without the fairness conditions (i.e. just Init /\ [][Next]_vars). Otherwise, fairness may force the sender to perform a step. We still want to allow other steps, though, to show that ReadLimit is a lower bound.

With this, we can argue that e.g. a 2-byte buffer will eventually transfer 3 bytes:

The receiver will eventually read 3 bytes as long as the sender eventually sends 3 bytes.

The sender will eventually send 3, if the receiver reads at least 1.

The receiver will read 1 if the sender sends at least 1.

The sender will send 1 if the reader has read at least 0 bytes, which is always true.

By this point, I was learning to be more cautious before trying a proof, so I added some new models to check this idea further. One prevents the sender from ever closing the connection and the other prevents the receiver from ever closing. That reduces the number of states to consider and I was able to check a slightly larger model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

I == /\ IntegrityI /\ CloseOK \* If the reader is stuck, but data is available, the sender will unblock it: /\ ReaderShouldBeUnblocked => \* The sender is going to write more: \/ WriteLimit > Len(Got) + Len(Buffer) /\ Len(msg) > 0 /\ SenderLive \* The sender is about to increase ReadLimit: \/ (\/ pc[SenderWriteID] = "sender_check_notify_data" /\ NotifyWrite \/ pc[SenderWriteID] = "sender_notify_data") /\ ReadLimit < Len(Got) + Len(Buffer) \* The sender is about to notify us of shutdown: \/ pc[SenderCloseID] \in {"sender_notify_closed"} \* If the writer is stuck, but there is now space available, the receiver will unblock it: /\ WriterShouldBeUnblocked => \* The reader is going to read more: \/ ReadLimit > Len(Got) /\ ReceiverLive \* The reader is about to increase WriteLimit: \/ (\/ pc[ReceiverReadID] = "recv_check_notify_read" /\ NotifyRead \/ pc[ReceiverReadID] = "recv_notify_read") /\ WriteLimit < Len(Got) + BufferSize \* The receiver is about to notify us of shutdown: \/ pc[ReceiverCloseID] \in {"recv_notify_closed"} /\ NotifyFlagsCorrect

If a process is on a path to being blocked then it must have set its notify flag. NotifyFlagsCorrect says that in that case, the flag it still set, or the interrupt has been sent, or the other process is just about to trigger the interrupt.

I managed to use that to prove that the sender’s steps preserved I, but I needed a little extra to finish the receiver proof. At this point, I finally spotted the obvious invariant (which you, no doubt, saw all along): whenever NotifyRead is still set, the sender has accurate information about the buffer.

1 2 3 4 5

/\ NotifyRead => \* The sender has accurate information about the buffer: \/ WriteLimit = Len(Got) + BufferSize \* Or the flag is being cleared right now: \/ pc[ReceiverReadID] = "recv_check_notify_read"

That’s pretty obvious, isn’t it? The sender checks the buffer after setting the flag, so it must have accurate information at that point. The receiver clears the flag after reading from the buffer (which invalidates the sender’s information).

Now I had a dilemma. There was obviously going to be a matching property about NotifyWrite. Should I add that, or continue with just this? I was nearly done, so I continued and finished off the proofs.

With I proved, I was able to prove some other nice things quite easily:

1 2 3 4 5

THEOREM /\ I /\ SenderLive /\ ReceiverLive /\ \/ pc[SenderWriteID] = "sender_ready" \/ pc[SenderWriteID] = "sender_blocked" /\ ~SpaceAvailableInt => ReadLimit = Len(Got) + Len(Buffer)

That says that, whenever the sender is idle or blocked, the receiver will read everything sent so far, without any further help from the sender. And:

1 2 3 4

THEOREM /\ I /\ SenderLive /\ ReceiverLive /\ pc[ReceiverReadID] \in {"recv_await_data"} /\ ~DataReadyInt => WriteLimit = Len(Got) + BufferSize

That says that whenever the receiver is blocked, the sender can fill the buffer. That’s pretty nice. It would be possible to make a vchan system that e.g. could only send 1 byte at a time and still prove it couldn’t deadlock and would always deliver data, but here we have shown that the algorithm can use the whole buffer. At least, that’s what these theorems say as long as you believe that ReadLimit and WriteLimit are defined correctly.

With the proof complete, I then went back and deleted all the stuff about ReadLimit and WriteLimit from I and started again with just the new rules about NotifyRead and NotifyWrite. Instead of using WriteLimit = Len(Got) + BufferSize to indicate that the sender has accurate information, I made a new SenderInfoAccurate that just returns TRUE whenever the sender will fill the buffer without further help. That avoids some unnecessary arithmetic, which TLAPS needs a lot of help with.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

(* The sender's information is accurate if whenever it is going to block, the buffer really is full. *) SenderInfoAccurate == \* We have accurate information: \/ Len(Buffer) + free = BufferSize \* In these states, we're going to check the buffer before blocking: \/ pc[SenderWriteID] \in {"sender_ready", "sender_request_notify", "sender_write", "sender_recheck_len", "sender_check_recv_live", "Done"} \/ pc[SenderWriteID] \in {"sender_request_notify"} /\ free < Len(msg) \* If we've been signalled, we'll immediately wake next time we try to block: \/ SpaceAvailableInt \* We're about to write some data: \/ /\ pc[SenderWriteID] \in {"sender_write_data"} /\ free >= Len(msg) \* But we won't need to block \* If we wrote all the data we intended to, we'll return without blocking: \/ /\ pc[SenderWriteID] \in {"sender_check_notify_data", "sender_notify_data"} /\ Len(msg) = 0

By talking about accuracy instead of the write limit, I was also able to include “Done” in with the other happy cases. Before, that had to be treated as a possible problem because the sender can’t use the full buffer when it’s Done.

With this change, the proof of Spec => []I became much simpler (384 lines shorter). And most of the remaining steps were trivial.

The ReadLimit and WriteLimit idea still seemed useful, though, but I found I was able to prove the same things from I. e.g. we can still conclude this, even if I doesn’t mention WriteLimit:

1 2 3 4

THEOREM /\ I /\ SenderLive /\ ReceiverLive /\ pc[ReceiverReadID] \in {"recv_await_data"} /\ ~DataReadyInt => WriteLimit = Len(Got) + BufferSize

That’s nice, because it keeps the invariant and its proofs simple, but we still get the same result in the end.

I initially defined WriteLimit to be the number of bytes the sender could write if the sending application wanted to send enough data, but I later changed it to be the actual number of bytes it would write if the application didn’t try to send any more. This is because otherwise, with packet-based sends (where we only write when the buffer has enough space for the whole message at once) WriteLimit could go down. e.g. we think we can write another 3 bytes, but then the application decides to write 10 bytes and now we can’t write anything more.

The limit theorems above are useful properties, but it would be good to have more confidence that ReadLimit and WriteLimit are correct. I was able to prove some useful lemmas here.

First, ReceiverRead steps don’t change ReadLimit (as long as the receiver hasn’t closed the connection):

1 2 3

THEOREM ReceiverReadPreservesReadLimit == ASSUME I, ReceiverLive, ReceiverRead PROVE UNCHANGED ReadLimit

This gives us a good reason to think that ReadLimit is correct:

When the receiver is blocked it cannot read any more than it has without help.

ReadLimit is defined to be Len(Got) then, so ReadLimit is obviously correct for this case.

Since read steps preserve ReadLimit, this shows that ReadLimit is correct in all cases.

e.g. if ReadLimit = 5 and no other processes do anything, then we will end up in a state with the receiver blocked, and ReadLimit = Len(Got) = 5 and so we really did read a total of 5 bytes.

I was also able to prove that it never decreases (unless the receiver closes the connection):

1 2 3

THEOREM ReadLimitMonotonic == ASSUME I, Next, ReceiverLive PROVE ReadLimit' >= ReadLimit

So, if ReadLimit = n then it will always be at least n, and if the receiver ever blocks then it will have read at least n bytes.

I was able to prove similar properties about WriteLimit. So, I feel reasonably confident that these limit predictions are correct.

Disappointingly, we can’t actually prove Availability using TLAPS, because currently it understands very little temporal logic (see TLAPS limitations). However, I could show that the system can’t deadlock while there’s data to be transmitted:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

(* We can't get into a state where the sender and receiver are both blocked and there is no wakeup pending: *) THEOREM DeadlockFree1 == ASSUME I PROVE ~ /\ pc[SenderWriteID] = "sender_blocked" /\ ~SpaceAvailableInt /\ SenderLive /\ pc[ReceiverReadID] = "recv_await_data" /\ ~DataReadyInt /\ ReceiverLive <1> SUFFICES ASSUME /\ pc[SenderWriteID] = "sender_blocked" /\ ~SpaceAvailableInt /\ SenderLive /\ pc[ReceiverReadID] = "recv_await_data" /\ ~DataReadyInt /\ ReceiverLive PROVE FALSE OBVIOUS <1> NotifyFlagsCorrect BY DEF I <1> NotifyRead BY DEF NotifyFlagsCorrect <1> NotifyWrite <2> have = 0 BY DEF IntegrityI, I <2> QED BY DEF NotifyFlagsCorrect <1> SenderInfoAccurate /\ ReaderInfoAccurate BY DEF I <1> free = 0 BY DEF IntegrityI, I <1> Len(Buffer) = BufferSize BY DEF SenderInfoAccurate <1> Len(Buffer) = 0 BY DEF ReaderInfoAccurate <1> QED BY BufferSizeType (* We can't get into a state where the sender is idle and the receiver is blocked unless the buffer is empty (all data sent has been consumed): *) THEOREM DeadlockFree2 == ASSUME I, pc[SenderWriteID] = "sender_ready", SenderLive, pc[ReceiverReadID] = "recv_await_data", ReceiverLive, ~DataReadyInt PROVE Len(Buffer) = 0

I’ve included the proof of DeadlockFree1 above:

To show deadlock can’t happen, it suffices to assume it has happened and show a contradiction.

If both processes are blocked then NotifyRead and NotifyWrite must both be set (because processes don’t block without setting them, and if they’d been unset then an interrupt would now be pending and we wouldn’t be blocked).

Since NotifyRead is still set, the sender is correct in thinking that the buffer is still full.

Since NotifyWrite is still set, the receiver is correct in thinking that the buffer is still empty.

That would be a contradiction, since BufferSize isn’t zero.

If it doesn’t deadlock, then some process must keep getting woken up by interrupts, which means that interrupts keep being sent. We only send interrupts after making progress (writing to the buffer or reading from it), so we must keep making progress. We’ll have to content ourselves with that argument.

Experiences with TLAPS

The toolbox doesn’t come with the proof system, so you need to install it separately. The instructions are out-of-date and have a lot of broken links. In May, I turned the steps into a Dockerfile, which got it partly installed, and asked on the TLA group for help, but no-one else seemed to know how to install it either. By looking at the error messages and searching the web for programs with the same names, I finally managed to get it working in December. If you have trouble installing it too, try using my Docker image.

Once installed, you can write a proof in the toolbox and then press Ctrl-G, Ctrl-G to check it. On success, the proof turns green. On failure, the failing step turns red. You can also do the Ctrl-G, Ctrl-G combination on a single step to check just that step. That’s useful, because it’s pretty slow. It takes more than 10 minutes to check the complete specification.

TLA proofs are done in the mathematical style, which is to write a set of propositions and vaguely suggest that thinking about these will lead you to the proof. This is good for building intuition, but bad for reproducibility. A mathematical proof is considered correct if the reader is convinced by it, which depends on the reader. In this case, the “reader” is a collection of automated theorem-provers with various timeouts. This means that whether a proof is correct or not depends on how fast your computer is, how many programs are currently running, etc. A proof might pass one day and fail the next. Some proof steps consistently pass when you try them individually, but consistently fail when checked as part of the whole proof. If a step fails, you need to break it down into smaller steps.

Sometimes the proof system is very clever, and immediately solves complex steps. For example, here is the proof that the SenderClose process (which represents the sender closing the channel), preserves the invariant I:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LEMMA SenderClosePreservesI == I /\ SenderClose => I' <1> SUFFICES ASSUME I, SenderClose PROVE I' OBVIOUS <1> IntegrityI BY DEF I <1> TypeOK BY DEF IntegrityI <1> PCOK BY DEF IntegrityI <1>1. CASE sender_open <2> USE <1>1 DEF sender_open <2> UNCHANGED << pc[SenderWriteID], pc[ReceiverReadID], pc[ReceiverCloseID] >> BY DEF PCOK <2> pc'[SenderCloseID] = "sender_notify_closed" BY DEF PCOK <2> TypeOK' BY DEF TypeOK <2> PCOK' BY DEF PCOK <2> IntegrityI' BY DEF IntegrityI <2> NotifyFlagsCorrect' BY DEF NotifyFlagsCorrect, I <2> QED BY DEF I, SenderInfoAccurate, ReaderInfoAccurate, CloseOK <1>2. CASE sender_notify_closed <2> USE <1>2 DEF sender_notify_closed <2> UNCHANGED << pc[SenderWriteID], pc[ReceiverReadID], pc[ReceiverCloseID] >> BY DEF PCOK <2> pc'[SenderCloseID] = "Done" BY DEF PCOK <2> TypeOK' BY DEF TypeOK <2> PCOK' BY DEF PCOK <2> IntegrityI' BY DEF IntegrityI <2> NotifyFlagsCorrect' BY DEF NotifyFlagsCorrect, I <2> QED BY DEF I, SenderInfoAccurate, ReaderInfoAccurate, CloseOK <1>3. QED BY <1>1, <1>2 DEF SenderClose

A step such as IntegrityI' BY DEF IntegrityI says “You can see that IntegrityI will be true in the next step just by looking at its definition”. So this whole lemma is really just saying “it’s obvious”. And TLAPS agrees.

At other times, TLAPS can be maddeningly stupid. And it can’t tell you what the problem is - it can only make things go red.

For example, this fails:

1 2 3 4 5

THEOREM ASSUME pc' = [pc EXCEPT ![1] = "l2"], pc[2] = "l1" PROVE pc'[2] = "l1" OBVIOUS

We’re trying to say that pc[2] is unchanged, given that pc' is the same as pc except that we changed pc[1]. The problem is that TLA is an untyped language. Even though we know we did a mapping update to pc, that isn’t enough (apparently) to conclude that pc is in fact a mapping. To fix it, you need:

1 2 3 4 5 6

THEOREM ASSUME pc \in [Nat -> STRING], pc' = [pc EXCEPT ![1] = "l2"], pc[2] = "l1" PROVE pc'[2] = "l1" OBVIOUS

The extra pc \in [Nat -> STRING] tells TLA the type of the pc variable. I found missing type information to be the biggest problem when doing proofs, because you just automatically assume that the computer will know the types of things. Another example:

1 2 3 4 5

THEOREM ASSUME NEW x \in Nat, NEW y \in Nat, x + Min(y, 10) = x + y PROVE Min(y, 10) = y OBVIOUS

We’re just trying to remove the x + ... from both sides of the equation. The problem is, TLA doesn’t know that Min(y, 10) is a number, so it doesn’t know whether the normal laws of addition apply in this case. It can’t tell you that, though - it can only go red. Here’s the solution:

1 2 3 4 5

THEOREM ASSUME NEW x \in Nat, NEW y \in Nat, x + Min(y, 10) = x + y PROVE Min(y, 10) = y BY DEF Min

The BY DEF Min tells TLAPS to share the definition of Min with the solvers. Then they can see that Min(y, 10) must be a natural number too and everything works.

Another annoyance is that sometimes it can’t find the right lemma to use, even when you tell it exactly what it needs. Here’s an extreme case:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

LEMMA TransferFacts == ASSUME NEW src, NEW src2, \* (TLAPS doesn't cope with "NEW VARAIBLE src") NEW dst, NEW dst2, NEW i \in 1..Len(src), src \in MESSAGE, dst \in MESSAGE, dst2 = dst \o Take(src, i), src2 = Drop(src, i) PROVE /\ src2 \in MESSAGE /\ Len(src2) = Len(src) - i /\ dst2 \in MESSAGE /\ Len(dst2) = Len(dst) + i /\ UNCHANGED (dst \o src) PROOF OMITTED LEMMA SameAgain == ASSUME NEW src, NEW src2, \* (TLAPS doesn't cope with "NEW VARAIBLE src") NEW dst, NEW dst2, NEW i \in 1..Len(src), src \in MESSAGE, dst \in MESSAGE, dst2 = dst \o Take(src, i), src2 = Drop(src, i) PROVE /\ src2 \in MESSAGE /\ Len(src2) = Len(src) - i /\ dst2 \in MESSAGE /\ Len(dst2) = Len(dst) + i /\ UNCHANGED (dst \o src) BY TransferFacts

TransferFacts states some useful facts about transferring data between two variables. You can prove that quite easily. SameAgain is identical in every way, and just refers to TransferFacts for the proof. But even with only one lemma to consider - one that matches all the assumptions and conclusions perfectly - none of the solvers could figure this one out!

My eventual solution was to name the bundle of results. This works:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

TransferResults(src, src2, dst, dst2, i) == /\ src2 \in MESSAGE /\ Len(src2) = Len(src) - i /\ dst2 \in MESSAGE /\ Len(dst2) = Len(dst) + i /\ UNCHANGED (dst \o src) LEMMA TransferFacts == ASSUME NEW src, NEW src2, NEW dst, NEW dst2, NEW i \in 1..Len(src), src \in MESSAGE, dst \in MESSAGE, dst2 = dst \o Take(src, i), src2 = Drop(src, i) PROVE TransferResults(src, src2, dst, dst2, i) PROOF OMITTED LEMMA SameAgain == ASSUME NEW src, NEW src2, NEW dst, NEW dst2, NEW i \in 1..Len(src), src \in MESSAGE, dst \in MESSAGE, dst2 = dst \o Take(src, i), src2 = Drop(src, i) PROVE TransferResults(src, src2, dst, dst2, i) BY TransferFacts

Most of the art of using TLAPS is in controlling how much information to share with the provers. Too little (such as failing to provide the definition of Min) and they don’t have enough information to find the proof. Too much (such as providing the definition of TransferResults) and they get overwhelmed and fail to find the proof.

It’s all a bit frustrating, but it does work, and being machine checked does give you some confidence that your proofs are actually correct.

Another, perhaps more important, benefit of machine checked proofs is that when you decide to change something in the specification you can just ask it to re-check everything. Go and have a cup of tea, and when you come back it will have highlighted in red any steps that need to be updated. I made a lot of changes, and this worked very well.

The TLAPS philosophy is that

If you are concerned with an algorithm or system, you should not be spending your time proving basic mathematical facts. Instead, you should assert the mathematical theorems you need as assumptions or theorems.

So even if you can’t find a formal proof of every step, you can still use TLAPS to break it down into steps than you either can prove, or that you think are obvious enough that they don’t require a proof. However, I was able to prove everything I needed for the vchan specification within TLAPS.

The final specification

I did a little bit of tidying up at the end. In particular, I removed the want variable from the specification. I didn’t like it because it doesn’t correspond to anything in the OCaml implementation, and the only place the algorithm uses it is to decide whether to set NotifyWrite, which I thought might be wrong anyway.

I changed this:

1 2

recv_got_len: if (have >= want) goto recv_read_data else NotifyWrite := TRUE;

to:

1 2 3 4 5 6

recv_got_len: either { if (have > 0) goto recv_read_data else NotifyWrite := TRUE; } or { NotifyWrite := TRUE; };

That always allows an implementation to set NotifyWrite if it wants to, or to skip that step just as long as have > 0. That covers the current C behaviour, my proposed C behaviour, and the OCaml implementation. It also simplifies the invariant, and even made the proofs shorter!

I put the final specification online at spec-vchan. I also configured Travis CI to check all the models and verify all the proofs. That’s useful because sometimes I’m too impatient to recheck everything on my laptop before pushing updates.

You can generate a PDF version of the specification with make pdfs. Expressions there can be a little easier to read because they use proper symbols, but it also breaks things up into pages, which is highly annoying. It would be nice if it could omit the proofs too, as they’re really only useful if you’re trying to edit them. I’d rather just see the statement of each theorem.

The original bug

With my new understanding of vchan, I couldn’t see anything obvious wrong with the C code (at least, as long as you keep the connection open, which the firewall does).

I then took a look at ocaml-vchan. The first thing I noticed was that someone had commented out all the memory barriers, noting in the Git log that they weren’t needed on x86. I am using x86, so that’s not it, but I filed a bug about it anyway: Missing memory barriers.

The other strange thing I saw was the behaviour of the read function. It claims to implement the Mirage FLOW interface, which says that read “blocks until some data is available and returns a fresh buffer containing it”. However, looking at the code, what it actually does is to return a pointer directly into the shared buffer. It then delays updating the consumer counter until the next call to read. That’s rather dangerous, and I filed another bug about that: Read has very surprising behaviour. However, when I checked the mirage-qubes code, it just takes this buffer and makes a copy of it immediately. So that’s not the bug either.

Also, the original bug report mentioned a 10 second timeout, and neither the C implementation nor the OCaml one had any timeouts. Time to look at QubesDB itself.

QubesDB accepts messages from either the guest VM (the firewall) or from local clients connected over Unix domain sockets. The basic structure is:

1 2 3 4 5 6

while True: await vchan event, local client data, or 10 second timeout while vchan.receive_buffer non-empty: handle_vchan_data() for each ready client: handle_client_data()

The suspicion was that we were missing a vchan event, but then it was discovering that there was data in the buffer anyway due to the timeout. Looking at the code, it does seem to me that there is a possible race condition here:

A local client asks to send some data.

handle_client_data sends the data to the firewall using a blocking write.

The firewall sends a message to QubesDB at the same time and signals an event because the firewall-to-db buffer has data.

QubesDB gets the event but ignores it because it’s doing a blocking write and there’s still no space in the db-to-firewall direction.

The firewall updates its consumer counter and signals another event, because the buffer now has space.

The blocking write completes and QubesDB returns to the main loop.

QubesDB goes to sleep for 10 seconds, without checking the buffer.

I don’t think this is the cause of the bug though, because the only messages the firewall might be sending here are QDB_RESP_OK messages, and QubesDB just discards such messages.

I managed to reproduce the problem myself, and saw that in fact QubesDB doesn’t make any progress due to the 10 second timeout. It just tries to go back to sleep for another 10 seconds and then immediately gets woken up by a message from a local client. So, it looks like QubesDB is only sending updates every 10 seconds because its client, qubesd, is only asking it to send updates every 10 seconds! And looking at the qubesd logs, I saw stacktraces about libvirt failing to attach network devices, so I read the Xen network device attachment specification to check that the firewall implemented that correctly.

I’m kidding, of course. There isn’t any such specification. But maybe this blog post will inspire someone to write one…

Conclusions

As users of open source software, we’re encouraged to look at the source code and check that it’s correct ourselves. But that’s pretty difficult without a specification saying what things are supposed to do. Often I deal with this by learning just enough to fix whatever bug I’m working on, but this time I decided to try making a proper specification instead. Making the TLA specification took rather a long time, but it was quite pleasant. Hopefully the next person who needs to know about vchan will appreciate it.

A TLA specification generally defines two sets of behaviours. The first is the set of desirable behaviours (e.g. those where the data is delivered correctly). This definition should clearly explain what users can expect from the system. The second defines the behaviours of a particular algorithm. This definition should make it easy to see how to implement the algorithm. The TLC model checker can check that the algorithm’s behaviours are all acceptable, at least within some defined limits.

Writing a specification using the TLA notation forces us to be precise about what we mean. For example, in a prose specification we might say “data sent will eventually arrive”, but in an executable TLA specification we’re forced to clarify what happens if the connection is closed. I would have expected that if a sender writes some data and then closes the connection then the data would still arrive, but the C implementation of vchan does not always ensure that. The TLC model checker can find a counter-example showing how this can fail in under a minute.

To explain why the algorithm always works, we need to find an inductive invariant. The TLC model checker can help with this, by presenting examples of unreachable states that satisfy the invariant but don’t preserve it after taking a step. We must add constraints to explain why these states are invalid. This was easy for the Integrity invariant, which explains why we never receive incorrect data, but I found it much harder to prove that the system cannot deadlock. I suspect that the original designer of a system would find this step easy, as presumably they already know why it works.

Once we have found an inductive invariant, we can write a formal machine-checked proof that the invariant is always true. Although TLAPS doesn’t allow us to prove liveness properties directly, I was able to prove various interesting things about the algorithm: it doesn’t deadlock; when the sender is blocked, the receiver can read everything that has been sent; and when the receiver is blocked, the sender can fill the entire buffer.

Writing formal proofs is a little tedious, largely because TLA is an untyped language. However, there is nothing particularly difficult about it, once you know how to work around various limitations of the proof checkers.

You might imagine that TLA would only work on very small programs like libvchan, but this is not the case. It’s just a matter of deciding what to specify in detail. For example, in this specification I didn’t give any details about how ring buffers work, but instead used a single Buffer variable to represent them. For a specification of a larger system using vchan, I would model each channel using just Sent and Got and an action that transferred some of the difference on each step.

The TLA Toolbox has some rough edges. The ones I found most troublesome were: the keyboard shortcuts frequently stop working; when a temporal property is violated, it doesn’t tell you which one it was; and the model explorer tooltips appear right under the mouse pointer, preventing you from scrolling with the mouse wheel. It also likes to check its “news feed” on a regular basis. It can’t seem to do this at the same time as other operations, and if you’re in the middle of a particularly complex proof checking operation, it will sometimes suddenly pop up a box suggesting that you cancel your job, so that it can get back to reading the news.

However, it is improving. In the latest versions, when you get a syntax error, it now tells you where in the file the error is. And pressing Delete or Backspace while editing no longer causes it to crash and lose all unsaved data. In general I feel that the TLA Toolbox is quite usable now. If I were designing a new protocol, I would certainly use TLA to help with the design.

TLA does not integrate with any language type systems, so even after you have a specification you still need to check manually that your code matches the spec. It would be nice if you could check this automatically, somehow.

One final problem is that whenever I write a TLA specification, I feel the need to explain first what TLA is. Hopefully it will become more popular and that problem will go away.

#IFTTT #Feedly

0 notes

javatutorialcorner · 8 years ago

Text

tar Linux Command

tar

tar [options] [tarfile] [other-files]

Description

Copy files to or restore files from an archive medium. If any files are directories, tar acts on the entire subtree. Options need not be preceded by - (though they may be). The exception to this rule is when you are using a long-style option (such as --modification-time). In that case, the exact syntax is: tar--long-option -function-options files For example: tar --modification-time -xvf tarfile.tar Function options You must use exactly one of these, and it must come before any other options: -c, --create Create a new archive. -d, --diff, --compare Compare the files stored in tarfile with other-files. Report any differences: missing files, different sizes, different file attributes (such as permissions or modification time). --delete Delete from the archive. This option cannot be used with magnetic tape. -r, --append Append other-files to the end of an existing archive. -t, --list Print the names of other-files if they are stored on the archive (if other-files are not specified, print names of all files). -u, --update Add files if not in the archive or if modified. -x, --extract, --get Extract other-files from an archive (if other-files are not specified, extract all files). -A, --catenate, --concatenate Concatenate a second tar file to the end of the first.

Options

[drive] [density] Set drive (0-7) and storage density (l, m, or h, corresponding to low, medium, or high). Not available in all versions of tar. --anchored Exclude patterns must match the start of the filename (the default). --atime-preserve Preserve original access time on extracted files. -b n, --blocking-factor=n Set block size to n × 512 bytes. --backup[=type] Back up files rather than deleting them. If no backup type is specified, a simple backup is made with ~ as the suffix. (See also --suffix.) The possible values of type are: t, numbered Make numbered backups. nil, existing Make numbered backups if there are already numbered backups; otherwise make simple backups. never, simple Always make simple backups. --checkpoint List directory names encountered. --exclude=pattern Remove files matching pattern from any list of files. -f file, --file=file Store files in or extract files from archive file. Note that file may take the form hostname:filename. --force-local Interpret filenames in the form hostname:filename as local files. -g file, --listed-incremental=file Create new-style incremental backup. --group=group Use group as the group for files added to the archive. -h, --dereference Dereference symbolic links, and archive the files they point to rather than the symbolic link. --help Print help message and exit. -i, --ignore-zeros Ignore zero-sized blocks (i.e., EOFs). --ignore-case Ignore case when excluding files. --ignore-failed-read Ignore unreadable files to be archived. Default behavior is to exit when encountering these. -j, --I, --bzip Compress files with bzip2 before archiving them, or uncompress them with bunzip2 before extracting them. -l, --one-file-system Do not archive files from other filesystems. -k, --keep-old-files When extracting files, do not overwrite files with similar names. Instead, print an error message. -m, --modification-time, --touch Do not restore file modification times; update them to the time of extraction. --mode=permissions Use permissions when adding files to an archive. The permissions are specified the same way as for the chmod command. --newer-mtime=date Add only files whose contents have changed since date to the archive. --no-anchor Exclude patterns may match anything following a slash. --no-ignore-case Do not ignore case when excluding files. --no-same-permissions Do not extract permissions information when extracting files from the archive. This is the default for users, and therefore affects only the superuser. --no-recursion Do not move recursively through directories. --no-same-owner When extracting, create files with yourself as owner. --no-wildcards Don't use wildcards when excluding files; treat patterns as strings. --no-wildcards-match-slash Wildcards do not match / when excluding files. --null Allow filenames to be null-terminated with -T. Override -C. --numeric-owner Use the numeric owner and group IDs rather than the names. -o, --old-archive, --portability Create old-style archive in Unix V7 rather than ANSI format. --overwrite Overwrite existing files and directory metadata when extracting from archive. --overwrite-dir Overwrite existing directory metadata when extracting from archive. --owner=owner Set owner as the owner of extracted files instead of the original owner. owner is first assumed to be a username, then, if there is no match, a numeric user ID. -p, --same-permissions, --preserve-permissions Keep permissions of extracted files the same as the originals. --posix Create a POSIX-compliant archive. --preserve Equivalent to invoking both the -p and -s options. --record-size=size Treat each record as having size bytes, where size is a multiple of 512. --recursion Move recursively through directories. --recursive-unlink Remove existing directory hierarchies before extracting directories with the same name. --remove-files Remove originals after inclusion in archive. --rsh-command=command Do not connect to remote host with rsh; instead, use command. -s, --same-order, --preserve-order When extracting, sort filenames to correspond to the order in the archive. --same-owner When extracting, create files with the same ownership as the originals. --show-omitted-dirs List directories being omitted when operating on an archive. --suffix=suffix Use suffix instead of the default ~ when creating a backup file. --totals Print byte totals. --use-compress-program=program Compress archived files with program, or uncompress extracted files with program. -v, --verbose Verbose. Print filenames as they are added or extracted. --version Print version information and exit. --volno-file=file Use/update the volume number in file. -w, --interactive, --confirmation Wait for user confirmation (y) before taking any actions. --wildcards Use wildcards when excluding files. --wildcards-match-slash Wildcards match / when excluding files. -z, --gzip, --gunzip, --ungzip Compress files with gzip before archiving them, or uncompress them with gunzip before extracting them. -B, --read-full-records Reblock while reading; used for reading from 4.2BSD pipes. -C directory, --directory=directory cd to directory before beginning tar operation. -F script, --info-script=script, --new-volume-script=script Implies -M (multiple archive files). Run script at the end of each file. -G, --incremental Create old-style incremental backup. -K file, --starting-file=file Begin tar operation at file in archive. -L length, --tape-length=length Write a maximum of length × 1024 bytes to each tape. -M, --multivolume Expect archive to be multivolume. With -c, create such an archive. -N date, --newer=date, --after-date=date Ignore files older than date. -O, --to-stdout Print extracted files to standard output. -P, --absolute-names Do not remove initial slashes (/) from input filenames. -R, --block-number Display archive's block number in messages. -S, --sparse Treat sparse files more efficiently when adding to archive. -T file, --files-from=file Consult file for files to extract or create. -U, --unlink-first Remove each existing file from the filesystem before extracting from the archive. -V name, --label=name Name this volume name. -W, --verify Check archive for corruption after creation. -X file, --exclude-from file Consult file for list of files to exclude. -Z, --compress, --uncompress Compress files with compress before archiving them, or uncompress them with uncompress before extracting them.

Examples

Create an archive of /bin and /usr/bin (c), show the command working (v), and store on the tape in /dev/rmt0: tar cvf /dev/rmt0 /bin /usr/bin List the tape's contents in a format like ls -l: tar tvf /dev/rmt0 Extract the /bin directory: tar xvf /dev/rmt0 /bin Create an archive of the current directory and store it in a file backup.tar: tar cvf - `find . -print` > backup.tar (The - tells tar to store the archive on standard output, which is then redirected.) Filter an archive through gzip, extracting the contents but leaving the original file compressed: tar xvfz chapters.tar.gz

from Java Tutorials Corner http://ift.tt/2thz1aC via IFTTT

#IFTTT #Java Tutorials Corner

0 notes

howdoitechtk-blog · 8 years ago

Text

An introduction to Linux's EXT4 filesystem

In past articles about Linux filesystems, I composed a prologue to Linux filesystems and about some more elevated amount ideas, for example, everything is a record. I need to broadly expound about the specifics of the EXT filesystems, however to start with, how about we answer the question, "What is a filesystem?" A filesystem is the greater part of the accompanying:

Information stockpiling: The essential capacity of any filesystem is to be an organized place to store and recover information.

Namespace: A naming and hierarchical procedure that gives guidelines to naming and organizing information.

Security display: A plan for characterizing access rights.

Programming interface: System work calls to control filesystem objects like registries and documents.

Execution: The product to actualize the above.

This article focuses on the principal thing in the rundown and investigates the metadata structures that give the legitimate system to information stockpiling in an EXT filesystem.

EXT filesystem history

Albeit composed for Linux, the EXT filesystem has its underlying foundations in the Minix working framework and the Minix filesystem, which originate before Linux by around five years, being first discharged in 1987. Understanding the EXT4 filesystem is substantially less demanding in the event that we take a gander at the history and specialized development of the EXT filesystem family from its Minix roots.

More Linux assets

What is Linux?

What are Linux compartments?

Download Now: Linux summons cheat sheet

Propelled Linux summons cheat sheet

Our most recent Linux articles

Minix

When composing the first Linux bit, Linus Torvalds required a filesystem yet would not like to keep in touch with one then. So he basically incorporated the Minix filesystem, which had been composed by Andrew S. Tanenbaum and was a piece of Tanenbaum's Minix working framework. Minix was a Unix-like working framework composed for instructive purposes. Its code was unreservedly accessible and suitably authorized to enable Torvalds to incorporate it in his first form of Linux.

Minix has the accompanying structures, the vast majority of which are situated in the segment where the filesystem is produced:

A boot area in the primary part of the hard drive on which it is introduced. The boot piece incorporates a little boot record and a segment table.

The principal hinder in each segment is a superblock that contains the metadata that characterizes the other filesystem structures and finds them on the physical plate alloted to the segment.

An inode bitmap square, which figures out which inodes are utilized and which are free.

The inodes, which have their own particular space on the circle. Each inode contains data around one record, including the areas of the information pieces, i.e., zones having a place with the document.

A zone bitmap to monitor the utilized and free information zones.

An information zone, in which the information is really put away.

For both sorts of bitmaps, one piece speaks to one particular information zone or one particular inode. On the off chance that the bit is zero, the zone or inode is free and accessible for utilize, yet in the event that the bit is one, the information zone or inode is being used.

What is an inode? Short for record hub, an inode is a 256-byte obstruct on the circle and stores information about the document. This incorporates the document's size; the client IDs of the record's client and gathering proprietors; the record mode (i.e., the get to authorizations); and three timestamps indicating the time and date that: the document was last gotten to, last changed, and the information in the inode was last altered.

The inode additionally contains information that focuses to the area of the document's information on the hard drive. In Minix and the EXT1-3 filesystems, this is a rundown of information zones or pieces. The Minix filesystem inodes bolstered nine information squares, seven immediate and two backhanded. On the off chance that you'd get a kick out of the chance to take in more, there is an incredible PDF with a nitty gritty portrayal of the Minix filesystem structure and a speedy outline of the inode pointer structure on Wikipedia.

EXT

The first EXT filesystem (Extended) was composed by Rémy Card and discharged with Linux in 1992 to beat some size confinements of the Minix filesystem. The essential auxiliary changes were to the metadata of the filesystem, which depended on the Unix filesystem (UFS), which is otherwise called the Berkeley Fast File System (FFS). I discovered next to no distributed data about the EXT filesystem that can be checked, evidently in light of the fact that it had critical issues and was immediately superseded by the EXT2 filesystem.

EXT2

The EXT2 filesystem was very effective. It was utilized as a part of Linux circulations for a long time, and it was the main filesystem I experienced when I began utilizing Red Hat Linux 5.0 back in around 1997. The EXT2 filesystem has basically a similar metadata structures as the EXT filesystem, however EXT2 is more forward-looking, in that a great deal of circle space is left between the metadata structures for sometime later.

Like Minix, EXT2 has a boot area in the primary division of the hard drive on which it is introduced, which incorporates a little boot record and a segment table. At that point there is some held space after the boot segment, which traverses the space between the boot record and the principal segment on the hard drive that is more often than not on the following barrel limit. GRUB2—and conceivably GRUB1—utilizes this space for some portion of its boot code.

The space in each EXT2 parcel is separated into barrel bunches that take into account more granular administration of the information space. As far as I can tell, the gathering size normally adds up to around 8MB. Figure 1, beneath, demonstrates the fundamental structure of a barrel gathering. The information designation unit in a barrel is the piece, which is typically 4K in size.

cylindergroup-01_1.png

Figure 1: The structure of a chamber aggregate in the EXT filesystems

The main square in the chamber gathering is a superblock, which contains the metadata that characterizes the other filesystem structures and finds them on the physical circle. A portion of the extra gatherings in the segment will have reinforcement superblocks, however not all. A harmed superblock can be supplanted by utilizing a plate utility, for example, dd to duplicate the substance of a reinforcement superblock to the essential superblock. It doesn't occur regularly, however once, numerous years prior, I had a harmed superblock, and I could reestablish its substance utilizing one of the reinforcement superblocks. Luckily, I had been foresighted and utilized the dumpe2fs order to dump the descriptor data of the parcels on my framework.

Taking after is the incomplete yield from the dumpe2fs summon. It demonstrates the metadata contained in the superblock, and in addition information about each of the initial two barrel bunches in the filesystem.

# dumpe2fs/dev/sda1

Filesystem volume name: boot

Keep going mounted on:/boot

Filesystem UUID: 79fc5ed8-5bbc-4dfe-8359-b7b36be6eed3

Filesystem enchantment number: 0xEF53

Filesystem update #: 1 (dynamic)

Filesystem highlights: has_journal ext_attr resize_inode dir_index filetype needs_recovery degree 64bit flex_bg sparse_super large_file huge_file dir nlink extra_isize

Filesystem banners: signed_directory_hash

Default mount choices: user_xattr acl

Filesystem state: clean

Mistakes conduct: Continue

Filesystem OS sort: Linux

Inode check: 122160

Square number: 488192

Held piece check: 24409

Free squares: 376512

Free inodes: 121690

To begin with piece: 0

Square size: 4096

Piece measure: 4096

Amass descriptor estimate: 64

Held GDT pieces: 238

Squares per gathering: 32768

Parts per gathering: 32768

Inodes per gathering: 8144

Inode squares per gathering: 509

Flex square gathering size: 16

Filesystem made: Tue Feb 7 09:33:34 2017

Last mount time: Sat Apr 29 21:42:01 2017

Last compose time: Sat Apr 29 21:42:01 2017

Mount number: 25

Most extreme mount tally: - 1

Last checked: Tue Feb 7 09:33:34 2017

Check interim: 0 (<none>)

Lifetime composes: 594 MB

Saved squares uid: 0 (client root)

Saved squares gid: 0 (aggregate root)

In the first place inode: 11

Inode measure: 256

Required additional isize: 32

Sought additional isize: 32

Diary inode: 8

Default registry hash: half_md4

Registry Hash Seed: c780bac9-d4bf-4f35-b695-0fe35e8d2d60

Diary reinforcement: inode squares

Diary highlights: journal_64bit

Diary estimate: 32M

Diary length: 8192

Diary grouping: 0x00000213

Diary begin: 0

Amass 0: (Blocks 0-32767)

Essential superblock at 0, Group descriptors at 1-1

Saved GDT obstructs at 2-239

Piece bitmap at 240 (+240)

Inode bitmap at 255 (+255)

Inode table at 270-778 (+270)

24839 free pieces, 7676 free inodes, 16 catalogs

Free pieces: 7929-32767

Free inodes: 440, 470-8144

Gather 1: (Blocks 32768-65535)

Reinforcement superblock at 32768, Group descriptors at 32769-32769

Saved GDT obstructs at 32770-33007

Piece bitmap at 241 (bg #0 + 241)

Inode bitmap at 256 (bg #0 + 256)

Inode table at 779-1287 (bg #0 + 779)

8668 free pieces, 8142 free inodes, 2 catalogs

Free pieces: 33008-33283, 33332-33791, 33974-33975, 34023-34092, 34094-34104, 34526-34687, 34706-34723, 34817-35374, 35421-35844, 35935-36355, 36357-36863, 38912-39935, 39940-40570, 42620-42623, 42655, 42674-42687, 42721-42751, 42798-42815, 42847, 42875-42879, 42918-42943, 42975, 43000-43007, 43519, 43559-44031, 44042-44543, 44545-45055, 45116-45567, 45601-45631, 45658-45663, 45689-45695, 45736-45759, 45802-45823, 45857-45887, 45919, 45950-45951, 45972-45983, 46014-46015, 46057-46079, 46112-46591, 46921-47103, 49152-49395, 50027-50355, 52237-52255, 52285-52287, 52323-52351, 52383, 52450-52479, 52518-52543, 52584-52607, 52652-52671, 52734-52735, 52743-53247

Free inodes: 8147-16288

Gather 2: (Blocks 65536-98303)

Piece bitmap at 242 (bg #0 + 242)

Inode bitmap at 257 (bg #0 + 257)

Inode table at 1288-1796 (bg #0 + 1288)

6326 free pieces, 8144 free ino

0 notes

itbeatsbookmarks · 8 years ago

Link

(Via: Hacker News)

The latest 4.13.9 source release of the Linux kernel is 780MiB, but thanks to xz compression, the download is a much more managable 96 MiB (an 88% reduction)

Before xz took over as the default compression format on kernel.org in 2013, following the "latest" link would have gotten you a bzip2 compressed file. The tar.bz2 would have been 115 MiB (-85%), but there’s was no defending the extra 20 MiB after xz caught up in popularity. bzip2 is all but displaced today.

bzip2 became the default in 2003, though it had long been an option over the less efficient gzip. However, since every OS, browser, language core library, phone and IoT lightswitch has built-in support for gzip, a 148 MiB (-81%) tar.gz remains an option even today.

gzip itself started taking over in 1994, before kernel.org, and before the World Wide Web went mainstream. It must have been a particularly easy sell for the fledgeling Linux kernel: it was made, used and endorsed by the mighty GNU project, it was Free Software, free of patent restrictions, and it provided powerful .zip style DEFLATE compression in a Unix friendly package.

Another nice benefit was that gzip could decompress other contemporary formats, thereby replacing contested and proprietary software.

Among the tools it could replace was compress, the de-facto Unix standard at the time. Created based on LZW in 1985, it was hampered by the same patent woes that plagued GIF files. The then-ubiquitous .Z suffix graced the first public Linux releases, but is now recognized only by the most long-bearded enthusiasts. The current release would have been 302 MiB (-61%) with compress.

Another even more obscure tool it could replace was compress‘s own predecessor, pack. This rather loosely defined collection of only partially compatible formats is why compress had to use a capital Z in its extension. pack came first, and offered straight Huffman coding with a .z extension.

With pack, our Linux release would have been 548 MiB (-30%). Compared to xz‘s 96 MiB, it’s obvious why no one has used it for decades.

Well, guess what: gzip never ended its support! Quoth the man page,

gunzip can currently decompress files created by gzip, zip, compress, compress -H or pack.

While multiple implementations existed, these were common peculiarities:

They could not be used in pipes.

They could not represent empty files.

They could not compress a file with only one byte value, e.g. "aaaaaa…"

They could fail on "large" files. "can’t occur unless [file size] >= [16MB]", a comment said dismissively, from the time when a 10MB hard drive was a luxury few could afford.

These issues stemmed directly from the Huffman coding used. Huffman coding, developed in 1952, is basically an improvement on Morse code, where common characters like "e" get a short code like "011", while uncommon "z" gets a longer one like "111010".

Since you have to count the characters to figure out which are common, you can not compress in a single pass in a pipe. Now that memory is cheap, you could mostly get around that by keeping the data in RAM.

Empty files and single-valued files hit an edge case: if you only have a single value, the shortest code for it is the empty string. Decompressors that didn’t account for it would get stuck reading 0 bits forever. You can get around it by adding unused dummy symbols to ensure a minimum bit length of 1.

A file over 16MB could cause a single character to be so rare that its bit code was 25+ bits long. A decompressor storing the bits to be decoded in a 32bit value (a trick even gzip uses) would be unable to append a new 8bit byte to the buffer without displacing part of the current bit code. You can get around that by using "package merge" length restricted prefix codes over naive Huffman codes.

I wrote a Haskell implementation with all these fixes in place: koalaman/pack is available on GitHub.

During development, I found that pack support in gzip had been buggy since 2012 (version 1.6), but no one had noticed in the five years since. I tracked down the problem and I’m happy to say that version 1.9 will again restore full pack support!

Anyways, what could possibly be the point of using pack today?

There is actually one modern use case: code golfing.

This post came about because I was trying to implement the shortest possible program that would output a piece of simple ASCII art. A common trick is variations of a self-extracting shell script:

sed 1d $0|gunzip;exit <compressed binary data here>

You can use any available compressor, including xz and bzip2, but these were meant for bigger files and have game ruining overheads. Here’s the result of compressing the ASCII art in question:

raw: 269 bytes

xz: 216 bytes

bzip2: 183 bytes

gzip: 163 bytes

compress: 165 bytes

and finally, pack: 148 bytes!

I was able to save 15 bytes by leveraging gzip‘s forgotten legacy support. This is huge in a sport where winning entries are bytes apart.

Let’s have a look at this simple file format. Here’s an example pack file header for the word "banana":

1f 1e -- Two byte magic header 00 00 00 06 -- Original compressed length (6 bytes)

Next comes the Huffman tree. Building it is simple to do by hand, but too much for this post. It just needs to be complete, left-aligned, with eof on the right at the deepest level. Here’s the optimal tree for this string:

/\ / a /\ / \ /\ n b eof

We start by encoding its depth (3), and the number of leaves on each level. The last level is encoded minus 2, because the lowest level will have between 2 and 257 leaves, while a byte can only store 0-255.

03 -- depth 01 -- level 1 only contains 'a' 01 -- level 2 only contains 'n' 00 -- level 3 contains 'b' and 'eof', -2 as mentioned

Next we encode the ASCII values of the leaves in the order from top to bottom, left to right. We can leave off the EOF (which is why it needs to be in the lower right):

61 6e 52 -- "a", "n" ,"b"

This is enough for the decompressor to rebuild the tree. Now we go on to encode the actual data.

Starting from the root, the Huffman codes are determined by adding a 0 for ever left branch and 1 for every right branch you have to take to get to your value:

a -> right = 1 n -> left+right = 01 b -> left+left+left -> 000 eof -> left+left+right -> 001

banana<eof> would therefore be 000 1 01 1 01 1 001, or when grouped as bytes:

16 -- 0001 0110 C8 -- 1100 1 (000 as padding)

And that’s all we need:

$ printf '\x1f\x1e\x00\x00\x00\x06'\ '\x03\x01\x01\x00\x61\x6e\x62\x16\xc8' | gzip -d banana

Unfortunately, the mentioned gzip bug triggers due to failing to account for leading zeroes in bit code. eof and a have values 001 and 1, so an oversimplified equality check confuses one for the other, causing gzip to terminate early:

b gzip: stdin: invalid compressed data--length error

However, if you’re stuck with an affected version, there’s another party trick you can do: the Huffman tree has to be canonical, but it does not have to be optimal!

What would happen if we skipped the count and instead claimed that each ASCII character is equally likely? Why, we’d get a tree of depth 8 where all the leaf nodes are on the deepest level.

It then follows that each 8 bit character will be encoded as 8 bits in the output file, with the bit patterns we choose by ordering the leaves.

Let’s add a header with a dummy length to a file:

$ printf '\x1F\x1E____' > myfile.z

Now let’s append the afforementioned tree structure, 8 levels with all nodes in the last one:

$ printf '\x08\0\0\0\0\0\0\0\xFE' >> myfile.z

And let’s populate the leaf nodes with 255 bytes in an order of our choice:

$ printf "$(printf '\\%o' {0..254})" | tr 'A-Za-z' 'N-ZA-Mn-za-m' >> myfile.z

Now we can run the following command, enter some text, and hit Ctrl-D to "decompress" it:

$ cat myfile.z - | gzip -d 2> /dev/null Jr unir whfg pbaivaprq TMvc gb hafpenzoyr EBG13! <Ctrl+D> We have just convinced GZip to unscramble ROT13!

Can you think of any other fun ways to use or abuse gzip‘s legacy support? Post a comment.

#IFTTT #Feedly

0 notes