Subjects ->
Computer Networks ->
Lectures ->
Lecture #24
Lecture 24: Data Formats and Encoding -- A Philosophy Lecture
Reflections on Data Encoding
Compare:
- Classic Internet Application Protocols
- Protocol messages usually lines of printable ASCII
text, using the telnet NVT convention
for line endings. Data is either textual (hence transmitted as
Telnet NVT lines), encoded into textual form
(eg, Base64 for email attachents) or simply transmitted as
binary (eg, images in HTTP) -- no generic
rules apply across all protocols.
- SNMP-based Network Management
- Data and protocols are both described using
ASN.1, encoded using the TLV-style
BER for transmission -- a
binary format. The entire PDU (data and
"header information") is a single BER entity. Note,
incidentally, that ASN.1 technology is in wideapread use other
than in network management; eg, in LDAP,
X.500 (and related), Microsoft NetMeeting and
in many industrial applications.
Both of these formats exemplify a principle whereby the protocol
message is encoded into a standardised or
canonical form for transmission. What "goes over the
wire" is in the same format, regardless of the type and characterictics
of each of the machines involved in the transfer[1]. This is a Big Idea.
[1] There are alternatives to this: we
already seen a technique (way back in the telnet lecture) generically
called terminal emulation, whereby the sender of the
data converts it to the specific format expected by the receiver
before sending. The other approach is called (in some circles)
receiver-makes-right. Here the receiving software,
knowing the source of the data, converts it to its own format before
proceeding. This obviously fails if the source can't be
determined!
Parsing
ASCII text-based protocols have the advantage of human
readability, which has aided the debugging and development of
these protocols. Also, many other data types can easily expressed in
ASCII -- for example, numeric data: eg, the ASCII string
"2529
" is clearly an integer. Note, however, that
even such a simple system has potential pitfalls: think of the
textfile conventions of Unix systems, PCs and Macs
vis-a-vis the telnet NVT "line-of-text" convention used in these
protocols.
Protocol messages in these classic Internet application are structured
to conform to a grammar -- a set of syntax
rules. The receiver of such a message has to
parse it to discover its meaning. This can be compared
to the process whereby (eg) a Java source file is
compiled to a byte-code equivalent. The problem here
is that writing a parser is (still) considered to be a difficult
programming problem, and developers tend to try and avoid them if
possible...
In contrast, an ASN.1/BER bytestream can be
interpreted using (in principle, at least) a somewhat simpler
pattern matcher. Such software is, in general, easier
to write -- it can be written using a "Finite State Machine" model, or
could even be as simple as a sequence of nested IF-statements. The
downside is a protocol that can't be tested using "human-readable"
messages. TANSTAAFL.
Document Formats -- XML
We have concentrated, so far, on protocol formats, but
the data (or document) is also interesting. For example, the (usually)
ASCII HTML document is the basis of the World Wide Web. HTML is a
curious mixture of structural (or
semantic) markup, and markup elements used for
in-line presentational formatting. For example,
<h2>Header</h2>
is clearly a structural
markup, whereas
<b><i>important text</i></b>
is (generally speaking) simply an indication of how the author would
like the text displayed.
HTML has evolved (via mechanisms such as Cascading Style Sheets (CSS))
into the far richer XML (eXtensible Markup Language).
In XML, the details of both the meaning of the markup tags, and the
presentational aspects of the document have been separated from it. The
document itself contains only semantic (or structural)
information. Conceptually we have the notion of "Document as
Database"
XML can be considered as a document-level canonical form. It has
already been used extensively in the Web, both as an adjunt to HTML and
as a replacement -- modern browsers can already process XML documents
using associated XSL style sheets. More importantly,
it is becoming clear that more complex "Web Services"
can, and will, be based on XML, see later.
Background: Client-Server Programming with RPC
Until now, this unit has only looked at (socket-based) protocols where
the details of the protocol are visible to the programmer. An
alternative paradigm is that of the Remote Procedure Call
(RPC). In this model, a programmer (using an
imperative or procedual programming
model) thinks of a service on a network server as though it were a
sub-routine (or procedure, or
function[2]) in almost
exactly the same way he/she thinks of a local sub-routine.
An RPC application is built (compiled), as usual, but with external
(remote) procedures replaced with stub procedures. The
RPC system arranges for the stub procedure to transparently send
network messages to the remote procedure, and receive returned values.
Thus development of networked applications is, in theory at least, not
harder than development for a single machine.
The Unix RPC system (originally developed at Sun Microsystems) uses a
canonical form called XDR (eXternal Data
Representation) data encoding system for sending data across
the network. It is quite a complex specification: we will examine how
one data type -- the integer is handled.
[2] "Sub-routine" is an
historical generic term for a re-usable code-segment with formally
specified parameter passing conventions. The term
procedure was used for the same thing in Pascal, and
function in C.
Example: Integers in Unix RPC
We assume that an integer is 32 bits (4 bytes) in length. There are
(basically) two ways in which an integer can be stored in the memory of
a computer: with the Least Significant Byte in the
lowest numbered address (so-called Little-Endian
format), or with the Most Significant Byte at that
position (Big-Endian). The Intel (and compatible)
range of processors is Little-Endian, as were the Digital range of
CPUs, and virtually all others (past and present) are Big-Endian.
Take, for example, the integer
1003421
dec
(000f4f9d
hex). We assume that this
integer is stored at address X
in memory. In the
Little-Endian storage, shown at left, the byte at the "address of" the
integer has value 9d
hex. In Big-Endian
storage, shown at right, the byte at the "address of" the integer is
00
hex.
Software which desires to send (as raw bytes) such an integer as a
parameter to a remote procedure cannot simply read the bytes from
memory and transmit them, because the remote machine might use a
different byte-order. In XDR, the solution is to
(transparently) convert integers from their native format to
Big-Endian format for transmission, and transparently
convert them back at the other end to the appropriate native format.
Hence, two non-Intel machines will incur no "translation overhead",
whereas two Intel machines communicating will be required to convert the
order at each end of the communications.
It will be readily seen that, as mentioned, XDR uses canonical
forms for data transmission. More importantly, the required
conversions occur within the RPC sub-system, so the programmer
never needs to be aware of them. Their operation is
transparent.
Extended RPC: "Distributed Object" Programming Models
The emergence of Object-Oriented Programming (OOP) --
particularly in languages such as C++
and
Java -- changed the way in which programmers thought
about RPC. Instead of executing a remote procedure/function, the
conceptual model became that of "networked objects", and thus
invocation of their object methods across the network.
The three major "frameworks" in this space have (historically) been:
- CORBA (Common Object Request Broker
Architecture)
- Developed by the Object
Management Group (OMG), this framework was the first attempt
to create a "distributed object" environment. Based on the idea
of an "Object Request Broker", it uses a
protocol called the "Internet Inter-ORB Protocol
(IIOP)". Available for most platforms.
- DCOM
- This framework was developed by Microsoft, and is
specific to their platforms and language development
environments, although Java is supported, and third-party
companies ahve developed implmentations for other platforms.
The "Object Remote Procedure Call (ORPC)"
protocol on which it's based is derived from the older DCE
specification, a competitor to Sun's original RPC.
- Java/RMI
- Sun Microsystems has
developed this system to support its "Java Everywhere" model of
programming -- only supported for the Java language from
release 1.1. The underlying protocol is called "Java
Remote Method Protocol (JRMP)" and was (apparently)
developed from the original Sun RPC.
Each of these frameworks (and their underlying protocols) is based on
the idea of serializing the objects to be transferred,
transparently to the developer. He/she does not need
to know the details of how the system is implemented, or what it's
doing "underneath". The mappings from a program's (system's) internal
data structures to (and from) what's sent over the network is
automatic.
Future RPC: Web Services with SOAP & XML-RPC
The XML data model is rich enough to represent virtually any data
object. Initially, a group working at Microsoft came up with the idea
of doing Remote Procedure Calls using XML as the "serializing"
technology. Their original work has spun off to become the "XML-RPC" project,
which has the aim of "...remote procedure calling using HTTP as
the transport and XML as the encoding. XML-RPC is designed to be as
simple as possible, while allowing complex data structures to be
transmitted, processed and returned.". XML-RPC is based on
HTTP's POST request for the "procedure call" and an ordinary HTTP
response to return the results.
A separate project team, at Microsoft, decided to extend the basic idea
of XML-based RPC to a much more elaborate protocol, calling it the "Simple Object Access
Protocol (SOAP)". It has been submitted to W3C as a proposed standard. It can
run over HTTP or SMTP (?), and allows arbitrary objects to be encoded
(or serialized). SOAP has the backing of several influential companies
(Microsoft, IBM, etc).
The (recently invented) expression "Web Services" is
based on SOAP, and describes a range of proposed "Business-to-Business"
XML-based services running over HTTP (port 80). Perhaps the most
significant aspect of SOAP-based Web Services is that both the protocol
(usually HTTP) and the core language (XML) are public standards, and
are well understood. Even more significant is that SOAP builds on the
knowledge gained from a decade of "The Web", and from this perspective
alone is likely to succeed.
So What's Wrong with XML?
Not much. Except that it general it creates BIG
datasets. In fact, the XML spec states: "Terseness in XML markup
is of minimal importance". Some typical numbers: a colleague's
recent ASCII database dump of about 9MB turned into 25MB in XML for
network transfer. Why is this a problem?
An oft-quoted(?) technology axiom states (approximately):
"Bandwidth and batteries do not follow Moore's Law". That
is, whilst CPUs roughly double in performance every 18 months, other
more "mundane" technologies don't. Some examples:
- transferring data to "smart cards" and other embedded devices with
severely limited power, memory and I/O capacity.
- Transferring data to mobile devices. It's obviously more profitable
if a carrier can squeeze more information into the same airtime.
It's also better for battery life if airtime is minimised.
In other words, compactness in data encoding will always be important
in networking.
Compact Encodings
So what's the best way to encode compact data?
- Answer #1:
- Compress the XML before transmission? Wrong.
Why? Unless the document is large, typical compresion algorithms (eg
gzip
) actually make the data bigger. And
lots of CPU power is needed at the receiver to decompress. This
is a contentious issue, however.
- Answer #2:
- Ignore the problem. Unfortunately this is wrong too. The
problem is that in XML the recipient is required to "parse"
(a slightly different meaning of the word than
previous) the document to extract information. This can
be compared to the traditional RPC approach where the RPC
libraries map information directly to "internal" data
structures. Parsing is a heavy consumer of CPU, and hence
battery power. Note that there isn't universal agreement on
this point either.
- Answer #3
- Invent a standardised way of converting an XML entity into a
new (compact) form for transmission. The XML Binary group is
working on this possibility.
- Answer #4
- Use an existing compact binary encoding, of
which the best known and understood is probably ASN.1/BER!
Montagues and Capulets: ASN.1 and XML[3])
One of the fascinating research efforts in this area has been
integrating the ASN.1 "view of the universe" with XML. Consider this:
- The modern way to describe the structure (and meaning) of an XML
document is by XSD -- XML
Schemas. An XSD is written in XML.
- The ASN.1 language is, of course, a schema
language too. In fact, it turns out that it's possible
(and for simple cases, trivial) to automatically convert an XSD
into an ASN.1 definition, and(?) vice-versa.
The ASN.1 community is now suggesting that ASN.1 is a better schema
language than XSD. A document/data entity which is described using
ASN.1 can be automatically mapped to textual XML for network transfer,
and an XER (XML
Encoding Rules) standard is now available. Alternatively, it can be
encoded using BER (or, more likely its successor DER) into a compact
binary format where this is needed. The Fast
Web Services initiative is now focussing commercialising this.
[3] The Montagues and Capulets were the
two feuding families in Shakespeare's play "Romeo and
Juliet". The comparison was (apparently) first made in this
paper (caution: link is MS Powerpoint document).
References
The ISO 8859
Alphabet Soup
Google's
Component Frameworks -- Comparison and Review Page
A Detailed
Comparison of CORBA, DCOM and Java/RMI
OMG Home. See also CORBA Home.
RMI
tutorial. from Sun. See also here.
Microsoft's COM Technologies
page. Doesn't display in my copy of Netscape 4.
XML,
Web Services, and the .NET Framework
SOAP vs.
DCOM & RMI/IIOP
XML-RPC
vs. SOAP
Google's
Web Services -- SOAP Page. The "Categories" and "Related
Categories" lists of useful useful links are good here too.
More about SOAP (and related
protocols) than you're ever likely to need...
XML-RPC Home Page
The tutorial for this lecture is
Tutorial #24.
[Previous Lecture]
[Lecture Index]
[Next Lecture]
Copyright © 2005 by
Philip Scott,
La Trobe University.