Tags

, , , , , , , ,

The Strings and Text are not the same post on Musing Mortoray discusses the difference between “text” and “strings” and got me thinking. Rather than weigh down his (or her) comment section with a very long comment, I thought I’d go on a bit about it here.

I agree totally with the basic premise: that “text” and “strings” are different beasts. I also agree that text-handling depends on the text. There might be some difference in how we define text and string, and thinking about how I define them turned up a lot of thoughts on the matter. This isn’t intended as an opposition post (except on one point with regard to HTML). What follows is just one programmer’s opinion.

Let me start with the idea of text, which I take to be a generic, very inclusive, abstract class of physical object. The books on the shelves around me are filled with text. The CD and DVD covers have various (sometimes very inventive) forms of text on them. This post and the one that triggered it are both text objects. A hand-written letter is a form of text.

Almost by definition, language written down in readable form is text (for broad definitions of “written down” and “readable”). So far, I think, Mortoray and I are in full accord. Where we diverge may be in how we define a “string.”

For me, a string is a computer language term for a kind of data object. It has a generic, but notably concrete, definition (whereas text is generic and highly abstract). To see how a string fits into the picture, I have to go all the way back to the ones and zeros. (I know this will be obvious and old-hat to many, but bear with me; I’m laying groundwork here.)

Inside the computer it’s all just same-sized chunks of ones and zeros. (Technically, they’re not even that; they’re voltage levels interpreted as either zero or one.) Chunks are often stuck together (in various ways) to make bigger chunks. To make this useful, we interpret the chunks in meaningful ways. Taken individually and collectively, the chunks are data, the most generic form of information.

We can stop right here. Computation does not require anything more than the most fundamental data types (think Turing machine). Once we treat chunks as encoded numbers, that’s all we absolutely need. (Plus an important distinction between “data” (in the usual sense) and a special kind of data we call “code.”)

But to be more effective (and clear!) it’s handy to impose higher data abstractions on the chunks. To my way of thinking, data first breaks down into three generic classes: text (strings, chars, regex, etc.), numbers (integers, floats, complex, etc.) and other (dates, locations, images, etc.). These three main classes break down into various physical classes (some of which are named in the parentheses).

I do mean “class” not “type,” and I mean “class” in the abstract sense of “a kind of thing” (and not a class definition in an object-oriented language). We’re still talking data abstraction here; no metal involved. Unless referencing a specific language, the terms “string” or “integer” or “date” are still abstract and generic, but they’re a lot less abstract than “text” or “number” because we’re starting to get into semantics and physical format.

Dates, integers and strings all have specifics that affect how we store them as data. One obvious parameter is that the nature of the data controls how many chunks we need to represent that data. An interesting thing about strings is that they have variable length compared to most other data types.

Another thing we can say about a string is that is has some sort of context or semantic; it means something. But all data has some context. All data means something. Strings, like integers, are general enough data types to be used in a variety of ways, so about all we can really say about strings is that they have length and meaning, and only the first point is special.

Of course, text also has length and a meaning. Strings and text are not different in this regard. But I think they do differ in a big way in level of abstraction, as covered above. It’s arguable that they even differ in kind. I believe we all agree on that much.

For one thing text (at least to me) implies multiple lines of text. Just about everything I would label as “text” does have multiple lines. On the other hand, I think of strings as short bits of text — single lines at best. There are many places in the computer world where a multi-line string is a special case. I definitely see string as a special kind of text. (There is even a common phrase: “a string of text.”)

If I were to read a text file, I might very well read it into an array of strings, particularly if the text file had a line-oriented structure (e.g. properties or config file). If the file wasn’t particularly line-oriented, say an XML or HTML file, I’d probably read the whole thing into a single string or parse it on the fly.

Which brings us to a crucial point of disagreement: the idea that HTML is not text. In my view, HTML absolutely is text. So are XML, XSD, XSLT, SQL, JSON, DDL et many alii. A possibly important aspect: they are texts in specific languages. (One crucial distinction: all are computer languages, but only XSLT is a computer programming language. Specifically, HTML is not a programming language!)

These languages are not disqualified as text because editor software ignorant of the language mangles the text. An HTML editor edits HTML as expected, just as an XML editor edits XML. They are text because they can be loaded into any text editor and be edited as text by a person knowledgeable in the language. Obviously, syntax-aware editors are better than simple text editors, and structured editors are better still. But that doesn’t mean the source text isn’t text. (To me the term “codes” just means “numbers” and is too ambiguous for use. What these languages do contain is tokens.)

In the Mark-up languages (SGML, HTML, XML, etc.) the presence of meta-data tokens might lead one to classify these texts as non-textual, but I see them as texts in a specific language. There are language-specific editing considerations, as there might be in any language. The key, though, it any text editor can edit any text file.

In fact, one of the great things about text is how universal it is. This is why XML is the monster hit it is; text with structure and data type! It’s possible that HTML, CSS and JavaScript (and http itself!) all being text formats was instrumental in the huge success of “the web.”

The line is actually pretty fuzzy between text and structured, marked-up text. In the old days, we used *bold* and _italics_ meta-symbols. (There are websites that detect these and bold or italic accordingly.) One can even view punctuation, such as exclamation and question marks (or even the period), as meta-tokens that add meaning to the text. And I’ve seen coders use HTML tags (particularly made up ones) in text <seriously>to make a point.</seriously> Did my using two made up “HTML” tags make this not text? And what about the use of double-quotes to mean not really?

In my experience, in the computer world, text is the Yin to the Yang of binary. Data that isn’t binary is text. (If it’s not text, it’s usually something bizarre.) Text is code for “won’t blow up anything that handles text” or “won’t look like Martian when printed” (keeping in mind that coders are smart about what isn’t Martian). In the old days, text meant seven-bit ASCII (or EBCDIC or whatever). As we’ve become more global, we’ve grown towards Unicode, so now the definition of text is a bit broader. (Definitely requires graceful eight-bit handling.)

If it isn’t binary, it’s text. Yin/Yang. For me, that’s the bottom line.

But this may be mostly terminology (or pedantry). I think the real issue Mortoray is addressing is that string is too generic and that, in context, most strings have deeper semantics than “string” so they should be more complex data types. On that point, I agree completely.

There are many places where this is addressed by software. I have an XML editor (XML Spy) that allows me to create and edit XML without ever seeing the source text.  I’ve also used (but generally not liked) HTML WYSIWYG editors. To be honest, I typically use a text editor (gvim) for almost all flavors of text file. I like having full control of my XML and HTML.

The complaint that basic string operations are often inappropriate for structured or marked-up text is accurate, but I wouldn’t expect generic operations to apply to special cases.  String operations are necessarily generic and intended as a foundation for general string handling. And they work fine in many general cases. But special cases demand special  handling.

In object-oriented terms, handling HTML strings is a specialization of string handling and appropriately sub-classed. Most browsers do this when reading an HTML website. They convert the HTML text to a DOM object that reflects the tree structure of the page. If one is working with  HTML or JSON or XML it’s fairly easy to work with an object model and to consider source text as just for input and output.

It’s difficult for a language to provide this natively, but many provide capabilities in their libraries. Even so, it’s not unusual for a coder to create sub-types that handle structured text in ways tailored to the application. Long-time coders often have personal libraries that provide advanced functionality they’ve found useful.

To wrap it up, in my view, text is an abstraction, strings are computer objects, and complex text requires types that are knowledgeable about them. (In fact that last point is just basic OOD.)