AGS data – Don’t get bitten by the byte order mark (BOM)

AGS data – Don’t get bitten by the byte order mark (BOM)

Ever had an AGS file that refuses to import, or can’t get through an AGS checker/validator, but it looks absolutely fine when you view it in a text editor? Alternatively, you have been told, perhaps by some software, that your AGS file contains something called a byte order mark. If either of these apply, then read on. It may save you a lot of time and stress!

Before I dive into the details, I should just point out that is this is a problem that has been around for a few years, but my understanding is that it is rarely seen these days. There is a good reason for that which I will expand on below. However, I am not aware of anything being published on the subject, in an AGS data context, so I thought that it would be helpful to share my knowledge.

The problem

I first came across this problem about five or six years ago. A colleague sent me a misbehaving AGS file to have a look at. Initially this seemed like no big deal as I was used to successfully fixing problems with AGS files. However, this one had me flummoxed. It was not importing into gINT (my database software) and the gINT AGS checker was either not flagging up a problem or was failing completely, I can’t remember which. All the while, the file itself looked perfectly ok when viewed in a text editor. I tried all sorts, but had to concede that I could not figure out what was wrong.

I returned to it a few days later. I can’t remember exactly what I did, but I suspect I was searching for dodgy (non-ASCII) characters given that I had seen problems with these in other files. Turns out it was not the same problem, but I gleaned enough information from this second look to set me on the trail of something I had never heard of at the time: the byte order mark (BOM). Turns out that I had one in my file, and that was the cause of the trouble.

Once I had diagnosed the problem, I quickly managed to figure out the solution, with a bit of help from the web. Implementing the fix was moderately painful as I had to download some software (Notepad++, more on that later). Fortunately, it is a much easier fix these days.

From discussions with others around the industry, I understand that the problem described here may manifest itself in different ways, depending on the software that you are using.

Bottom line – if you have a file that is refusing to check and/or import properly, but you cannot see anything wrong with it, then read on. This may be your problem, in which case you should be good to go in a few minutes time. If not, and it is something else, then best of luck!

What is a byte order mark?

I am not going to go into detail here as there is no point, and there are plenty of web pages out there that do a better job of explaining it than me. It even has its own Wikipedia page. Here is one of the official sounding definitions:

byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.

Source: Glossary (unicode.org)

Fully understand that? No, me neither. Try this…

In layman’s terms, a byte order mark (BOM) is a hidden character that appears at the top of a text file that tells your computer something about how that file should be handled.

Do you need it? No, not normally. In fact, hardly ever. Most text files you work with these days, including AGS files, will use something called UTF-8 encoding, in which case use of the BOM is optional.

Do you want one in your AGS file? NO! The first problem is that a BOM is not an ASCII character, so immediately your file will fall foul of AGS Rule 1, a bad start! However, a BOM can cause problems for unsuspecting software in other ways too. In my case it contaminated the top line of data which the gINT importer then tried to ignore. Unfortunately this had dire consequences for the rest of the import. So, you definitely do not want a BOM in your AGS file.

A bit of extra nerdy info: UTF-8 is one of the encodings defined by the Unicode Standard. ASCII is a subset of UTF-8, but there are characters in UTF-8 that are not part of ASCII. AGS data is currently required to comply with ASCII, although that may change one day. By contrast, the recently released AGSi format allows full UTF-8 encoding.

I reiterate that you cannot see a BOM in a text editor, even when you know where to look!

How do I know if I have a BOM, and how do I fix it?

How can you see something that you cannot see? Easy peasy, when you know how.

First of all, a quick plug for the AGS Validator, which I have talked about in a previous blog. This is a new AGS managed open source tool for validation (what we wrongly used to call checking) of an AGS file. It is intended to take over from the ageing gINT and KeyAGS checkers. The AGS validator is available as a Python library and desktop application. There are also a couple of free to use web app versions that use this as their back-end, the first of which is by yours truly – see links below.

Digital Geotechnical – AGS validator app

British Geological Survey – AGS File Utilities Tool and API (bgs.ac.uk)

The AGS validator cannot 100% confirm that you have a byte order mark in your file, but if it suspects one, i.e. all the planets are aligning in that direction, then it will give you a warning that you may have a BOM, and that you should do something about it.

If you get such a warning, the simplest way to check for sure is to open your file in a compatible text editor. When I say compatible, I mean one that provides options relating to saving with or without a BOM. It used to be the case that only the more advanced editors, such as Notepad++, could do this. However, since 2019 the standard Windows Notepad has also been able to do this.

The next step may differ depending on your editor, but hopefully you will be able to figure it out based on the following example.

If you are using Windows Notepad, open the ‘Save As’ dialogue (under File). Then expand the ‘Encoding’ dropdown menu, found near the bottom of the window. It should look something like this…

BOM in Notepad screenshot

Take note of what entry is pre-selected. If it is UTF-8 with BOM, then jackpot! You have found the problem, and you can fix it by selecting the ‘UTF-8’ option instead, then re-save. Job done.

However, the more observant of you may have spotted that there is another way to identify whether you have a BOM before opening Save As. Check out the bottom right-hand corner of the screenshot!

In Notepad++, ‘Encoding’ is found on the main menu. Simply change the selection, then re-save.

Why does it happen?

I don’t know for sure, but after doing a bit of research I now have my suspicions.

AGS files generated from specialist commercial software are unlikely to include a BOM. If they were, we would have all known about it long before now. I suspect that BOMs have been introduced where AGS files have been assembled using a text editor, typically when stitching together different source files. For example, it is not uncommon for the fieldwork and laboratory data to be generated by different organisations. It is not best practice to splice them together like this. Much better to assemble the data within a single database, then create the AGS file. But it does happen, always has done and probably always will, albeit hopefully less often now as the software, and its users, get better.

However, what I describe above has been going on for probably as long as AGS data has existed. So why have not always had this problem, and why don’t we see it much these days?

My best guess is that the root cause of BOM problems in AGS files is related to changes in the behaviour of Windows Notepad over time. Until recently, the default encoding used by Notepad was ANSI. UTF-8 got added as an option along the way. Unfortunately, it appears that Notepad’s UTF-8 encoding originally included a BOM, whether it was asked for or not. Therefore, if UTF-8 encoding was selected in Notepad, the BOM would appear.

Even so, why were we seeing problems if the default was ANSI, not UTF-8? I did a big of digging around on this and found some reports of Notepad silently changing the encoding to UTF-8 (with its sneaky BOM) when text was pasted in from other software. Typically, the trigger would be non-standard characters in the pasted text that ANSI could not handle.

Whatever, Microsoft addressed the problem in 2019. Since then, Notepad has offered users the choice of UTF-8 encoding with or without a BOM. It also now defaults to UTF-8 (without BOM) instead of ANSI.

Other text editors like Notepad++ had this functionality all along. However, many still use the Windows default editor. One positive outcome from my BOM experience is it introduced me to Notepad++, which is much better than Windows Notepad in many respects.

Summary

In this blog I have shown you how to diagnose and fix a byte order mark (BOM) problem in an AGS file, should you ever come across this.

I have provided some background information, which may have interested a few of you, but probably bored the rest!

It is probably fair to say that this article may be a few years late, but if I end up helping just one person then it will have been worthwhile. If this article has, indeed, helped you fix a real world BOM problem, then it would be great if you could let me know.

Finally, a little plug for my business! If you are having any problems of any sort with your AGS data, then please get in touch. I may be able to help.

2 Comments

  1. James Tasker

    Hi Neil, this is something I came across myself a few months back and I too was flummoxed by it. With a lot of comparing between other AGS files and the aid of google/ChatGPT I managed to fix it…somehow. Until I read this post today, I still did not understand what the “UTF-8 with BOM” encoding was so thank you for elaborating on this! I also appreciate that a check for this issue has been included in the AGS validator. Another one to add to the many reasons it is my first port of call for all things .ags related!

    • Thank you for the comment. It sounds like we had a similar experience, initially. Pity that I did not manage to get this blog up a few months earlier when it would have been more useful to you.

Comments are closed