AGS data – The perils of the extended ASCII character set

Grumpy cat

The AGS data format is over 30 years old which, for a file format in the digital/data domain, is a venerable old age! Having said that, with help from a few upgrades over the years, it is still in great shape and is probably used more than it ever has been. However, as our digital processes get smarter, from time to time we occasionally uncover problems within the AGS format itself.

In this blog I am going to talk in some detail about an issue that is not new, but it has come to the fore recently, mainly as a result of work on the AGS validator which I have discussed in a past blog. The subject is character sets, in particular ASCII and the tricky topic of ‘extended ASCII’. I will also talk about encoding, which is not mentioned at all in the AGS format rules, but it turns out to be pretty damn important.

If you are reading this intro and thinking that this does not apply to your data, then please think again because what am I about to talk about could potentially affect any AGS file. In particular, if you regularly come across files or imported data infected with unexpected gobbledygook, then you should definitely read on because I will be explaining the root cause of this.

A couple more things before I start…

The AGS Data Management Working Group is responsible for maintaining the AGS data format. Whilst I am an active member of this group, this article presents my personal thoughts on the subject. It is not intended as an official view/opinion from the AGS DMWG.

This article is rather long and detailed, and quite technical in places. I will do my best to explain things in plain English, so you should not need to be a digital expert to follow this. However, you may want to grab a coffee first!

AGS Rule 1 – Are we extending ASCII or not?

We will start our journey at AGS format rule 1, which states:

“The data file shall be entirely composed of ASCII characters”

Sounds clear enough? Alas, no. Anyone who knows anything about ASCII will immediately realise that there is a problem here. In particular, some may be wondering: does this mean extended ASCII is permitted? Other more enlightened and/or mischievous readers may prefer to point out that there is no such thing as extended ASCII. I will address that nuance in the next section.

Let’s assume for now that there is an extended ASCII character set, in which case the answer to the original question is YES, use of the extended ASCII character set is permitted in AGS data. This is not clearly stated in the AGS 4.1.1 documentation but it can be inferred from the following:

In the old AGS3 and AGS3.1 documentation, Rule 1 included the following:

“The extended ASCII character set must not be used.”

The change log for AGS4 (v4.0.3) includes:

“The full ASCII character set now permitted (Rule 1).”

In other words, extended ASCII was not permitted in AGS3, but it is now in AGS4. In practice, that means that symbols such as ~ ° ± ² µ ½ can now be used.

It is fair to say that the AGS docs could be a lot clearer.

So, we have sorted that out. Does that mean the end of this article already? I’m afraid not. We are just warming up.

But there is no such thing as extended ASCII!

If you look up ‘extended ASCII’ on the internet you will come across articles that deny its existence, but also some that affirm its right to live amongst us, albeit with a caveat.

Cut to the chase, my preferred interpretation of a convoluted situation is that extended ASCII does exist, but there is no single unambiguous definition of it. It turns out that many variants exist, some say over 200, although in practice there are only a few that the average user is likely to come across.

I am not going to go into full technical detail about why this is the case as there is much written about the subject online. If you wish to find out more then I recommend the article below. It is long and detailed, but also fairly entertaining, and it is still relevant today despite being about 20 years old. The only thing I would add is that Unicode and UTF-8 have become even more dominant since this was written.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The above article explains how the original ASCII character set comprises 128 characters at ‘code points’ 0 to 127, and the entire world is in agreement with regard to what these characters are. What is commonly known as ‘extended ASCII’ is code points 128 to 255. Here things go a bit awry as different character sets have been used for these same code points, at different times, in different places and by different people. In practice, the character set for this range was, and in some cases still is, defined by a ‘code page’. These code pages were/are specific to each and every computer, but in practice the same default code pages are used in a particular country or region of the world. For example, users in the UK are mostly likely to be using windows code page 1252. Non-UK users may be using different code pages that accommodate foreign language characters.

The good news is that we now have something called Unicode, a universally accepted and adopted character set. This has been around for some time now and it is used by modern software and applications. However, the Unicode character set is much much larger than ASCII and extended ASCII. This means that it cannot be used to its full extent for AGS data at present.

So, what character set or code page does AGS require for extended ASCII? The answer is: undefined. It is left to the users of AGS data to determine and provide any further clarification that may be required. This is not written anywhere, but it is my understanding. I have been made aware of projects where such clarification has been provided to facilitate use of a non-standard character set to accommodate the characters required for a particular language. In practice, this is rarely explicitly considered and the vast majority of AGS users are likely just crossing their fingers and hoping for the best when venturing into the extended ASCII range. They may then huff and puff a bit when things go wrong. Of course, most of the time it does not matter as we do not often have to extend into this range, and many wisely choose not to. However, there is plenty of data out there that does use such characters, sometimes inadvertently, and as a result we often find our AGS files, databases and/or applications infected with apparent gobbledygook.

Having now identified that there is a potential problem, let’s explore it in more detail…

Extended ASCII in AGS files – a deep dive

In practice, there are three different ‘extended ASCII’ character sets that may be encountered by most users in the UK:

  • ISO-8859-1 (Latin-1)
  • Windows-1252 (western Europe) code page
  • Unicode

Comparing the above, we find that, for code points:

  • 0-127: they are all the same, i.e. Unicode = ISO-8859-1 = Windows-1252 = ASCII (original/base).
  • 128 – 159: all different! In Unicode these are non-printable (control) characters. In ISO-8859-1 (Latin-1) they are not used. Windows-1252 uses this range for some printable characters.
  • 160 – 255: all the same, i.e. Unicode = ISO-8859-1 = Windows-1252. This range includes: ~ £ © ° ± ² ³ µ « » ¼ ½ ¾.

If you wish to take a closer look, there is a useful comparison tool here: Windows-1252 vs ISO-8859-1 – ASCII Code (ascii-code.com)

In summary, all of the above are the same except in the code point range 128-159, which Windows-1252 uses for some printable characters that differ from ISO-8859-1 and Unicode (these characters are found in Unicode but at code points >255). These characters may be problematic, i.e. not rendering as expected due to confusion about the character set (i.e code page) that should be used. Characters of interest in this range include:

  • Left and right ‘smart’ quote marks ‘ ’
  • Left and right ‘smart’ double quote marks “ ”
  • en and em dash – —
  • Bullet •

All of these are included in Unicode, but at code points beyond 255, so we would not be allowing them according to Rule 1. However, there are good alternatives available for the above in the ASCII/Unicode 0-255 set.

Based on my own experience and talking to colleagues across industry. It is the quote marks and en/em dash that cause most trouble, and there is a good reason for this. It turns out that many of our day-to-day applications, such as MS Word, are by default set up to autocorrect normal straight quote marks to the so-called ‘smart’ slanty quote marks. Similarly normal dashes can be unexpectedly turned into the en or em dash. Text from such applications sometimes get pasted into databases or other applications, propagating these troublesome characters.

Some of the characters found within code points 160-255 are potential candidates for use in AGS files, such as ° ± ² ³ µ « » ¼ ½ ¾. Some of these relate to units and whilst it would be acceptable to use these in text fields, e.g. as part of remarks, they should never be used for the units defined in the UNIT header line. For example, you should be using use Mg/m3, not Mg/m³. This is not formally spelt out by an AGS data format rule, but it is implied via the ‘suggested’ units and standard unit list, and there is a good reason for this.

The good reason is that this subset of characters can still be problematic, and you may come across AGS files that have multiple unexpected characters replacing these. The cause of this is related to something called encoding…

Encoding

Encoding is how we get from a character or code point number to the 0 and 1s that make up the file.

Once upon a time ASCII was an encoding standard. It is long obsolete as such, but it lives on because the ASCII encoding method (for code points 0-127) has, to all intents and purposes, been subsumed into both of the following.

What is commonly known as ANSI encoding* was once popular, and it still exists and may be encountered. If you were creating a text file a few years ago then the chances are that it would be encoded as ANSI. This is effectively the ‘extended ASCII’ encoding where the extended part is determined by the code set used by your computer. Obviously this means that there may trouble ahead if you are using a code set that differs from that being used, at a later date, by others reading the same file!

UTF-8 is the standard encoding for Unicode. This is now dominant and has been for a few years now. This is a good thing as it is unambiguous. If we had all been using Unicode and UTF-8 for our AGS files all along then there would be no need for this article.

* Strictly speaking, use of the term ANSI as an encoding is not correct. Its complicated. However, many still refer to ANSI encoding and it is called this by most text editors, so I’m sticking with it.

I do have some good news. For standard ASCII characters (code points 0-127) the encoding used by ASCII, UTF-8 and ANSI is identical. If your file only uses these, then you should not have any problems.

Now the bad news. For extended ASCII code points 128-255, the UTF-8 and ANSI encodings are different. Very different, because ANSI uses one byte (8 bits, i.e. 8 x 0 or 1) whereas UTF-8 encoding uses two bytes in this range. The encoding differs even though both may be trying to represent the same character at the same code point.

Text files (and HTML) being created today are very likely to be using UTF-8. However, it is possible that some older software or applications are still using ANSI. Some legacy AGS files may be ANSI.

But surely we can deal with this as long as we know what the encoding is? Unfortunately, it is not as simple as that. It turns out that there is no sure-fire way of determining the encoding of a text file by just looking at it. Applications, including standard software such as MS Word and Excel, have to take a guess or make an assumption. Most of the time things work out ok, but this is not guaranteed. In particular, if an ANSI file is read as UTF-8 or vice versa then things can start going awry.

For completeness I should mention that there are other variants of UTF encoding, but in practice we do not need to worry about these as you are very unlikely to come across them in an AGS data context.

What does AGS say about encoding? Nothing. It is silent. Let’s consider some of the implications of this…

Encoding, AGS data and the AGS validator

We can see from the above that we have a few issues to consider when processing an AGS file:

  • Extended ASCII is ambiguous, i.e. some characters may be incorrectly displayed or interpreted
  • Files may use different encodings, e.g. ANSI or UTF-8, resulting in unexpected outcomes if the wrong encoding is assumed
  • An application reading the file will not normally know, for sure, what the encoding is

These are problems that we also face when trying to validate an AGS file. In a previous blog I have introduced the work of the AGS validator project. It is the ongoing work on the validator that has flushed out this issue and this article is based on some its findings.

For the AGS validator, after careful consideration it was realised that effective and unambiguous validation can only be achieved if we make some assumptions:

  • File assumed to be encoded as UTF-8
  • The extended ASCII character set (code points 128-255) assumed to be the Unicode set

The validator will check whether characters appearing in the file are within both the original ASCII and extended ASCII range (128-255). Any character falling outside of both will be flagged up as invalid. Any within the extended ASCII range will be deemed to be valid, but an ‘FYI’ warning will be output, along with a general message explaining the assumptions made by the validator. The thinking here is that the validator will never be able to identify mis-interpreted characters, but it may be helpful to flag up characters that have the potential to be problematic, thus inviting the user to undertake their own investigations.

The assumption of UTF-8 encoding should be safe for the large majority of files being created these days. However, when running the validator check function (via the Python library) there is an option to specify an alternative encoding to override the UTF-8 assumption, e.g. “cp1252” could be specified for an ANSI encoding based on code page 1252. However, if you do so you may find that characters in code point range 128-159 are still (wrongly) flagged as invalid. This is a quirk of the validator, but we do not expect this to be a commonly occurring situation in the real world. Furthermore, there is a workaround, as suggested by the validator documentation: simply re-save the file as UTF-8 (using a text editor) then re-run the validation.

There may, of course, be situations where it is known that an AGS file is using a different encoding and/or character set (code page), i.e. the assumptions made by the validator are not appropriate. In such cases, users should interpret the output from the validator accordingly and satisfy themselves as to whether the file is valid according to the AGS Rule 1.

AGS Rules – future changes?

A reasonable conclusion to draw from the above would be that the current AGS data format rules relating to character sets and encoding are ambiguous, and in need of clarification. The recent work undertaken for the AGS validator has brought this to the attention of the AGS Data Management Working Group. The approach adopted for the validator also provides a potential way forward.

Unfortunately, restrictions on what can and cannot be done with respect to changing the ‘rules’ of the AGS data format mean that it may be some time before any formal clarification can be provided in the documentation itself.

At the next major revision to the AGS format, i.e. AGS5, adoption of Unicode and UTF-8 will almost certainly be on the radar. However, there will also need to be consideration of whether existing commonly used software is capable of handling full Unicode.

Conclusions

This has been a long read, but hopefully I have managed to provide some clarity on this much misunderstood topic. In particular, I have tried to explain the root causes behind the appearance of unexpected characters when reading an AGS file.

As for how to avoid such problems, or how to deal with them when they arise, the main takeaways are:

  • The only way to be completely safe from the troubles described above is to avoid using characters in the extended ASCII range.
  • If you are using characters in the extended ASCII range, be clear (to future users) about which character set you are using. Ideally this should be the Unicode set. If not then you should ensure that users of your data know what the intent is.
  • Beware cut/paste from the likes of MS Word or Excel, bearing in mind their tendency to autocorrect quote marks and dashes. At the very least, turn off the auto-correction so that your quotes and dashes do not get messed up.
  • Make sure that your files are encoded as UTF-8. Most new files should be, but worth checking if you are not sure.
  • If the AGS validator is throwing up errors or warnings, take heed, but also look closely at your data. It may be that the data is ok but the assumptions that necessarily need to be made by the validator may be inappropriate.

Before departing, I should mention that there is some commonly used software (spoiler alert: its keyAGS) that appears to be a major culprit for generating gobbledygook in the extended ASCII range. Whilst the root cause of this is related to the discussion here, there are also some other things going on that exacerbate the problem. I am currently taking a closer look at this with a view to devoting a separate blog to this.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *