End tip mode
A tip every other day on POSIX and C. Start from the tip intro page.
Hey, thanks for reading. It was a fun challenge to produce a new tip every other day. I don't envy columnists whose job is to say something interesting every Monday, Wednesday, and Friday.
I'd resolved to keep at it until I reached a certain goal, which I reached just a few days ago (hint: see tip #83 on m4 macros to generate docbook XML). I'll let you know when there's a publication date.
So what's next for this blog? My real interest is in better modeling. I've blogged this before, and it still holds: I'm one of the people who learned a ton of computing technique for the purpose of going somewhere with it, and for me that somewhere is models of the world that go beyond simple linear fare for something more elaborate.
Multilevel models (which go under several other names) are built around taking one model and using its output as an input to another model. That seems like a simple enough sentence, but you're not going to get there until you can generate objects with the appropriate functions in them, and you may further need to conveniently extend an existing struct. Now that you're doing one potentially computationally expensive operation nested inside another, both had darn well better be fast, so you had better know how to thread and how to speed up your database. You're not going to write all this from scratch, so you'll need to know how to use libraries, which you can't do unless you can write a decent makefile.
It may not be evident, but I'm human, and I want to have an easy time at the computer, so within C's grammar, having functions that can take in lists of arbitrary length and functions that take named and optional arguments saved my life. Partially-featured debuggers are like walking around a dark room with sunglasses on (R debug(), I'm looking at you), so you'll be using GDB and you'll need to know how to print all these elaborate structures in the debugger. Having a decent shell and knowing how to use it will also make all this less painful.
And you'd darn well better well have a decent documentation system, or else you'll get lost in your own maze.
So that's how I wound up here. The goal has always been the same: I want to be able to go beyond
computationally cheap models and more accurately portray the world around us. It's 2012,
and you'd think that we'd be able to do that without knowing anything about all this
low-level stuff, but we're not there yet. Maybe the stuff I'm working on will help us
to get there. I'll write more on the goal of better modeling in future, sporadic blog posts.
[link][no comments]
How to learn a new programming language
A tip every other day on POSIX and C. Start from the tip intro page.
This is a sequence of a dozen or so steps for getting to know a new programming language.
It came to mind a few weeks ago, during an informal job-type interview. There's always that point in the interview where the interviewer asks So what programming languages do you know? I know she was expecting something like `Yeah, I took some SAS classes'; I told her I had reasonable facility and had done at least some nontrivial work in C, C++, Java, S-PLUS and R, Scheme, Matlab and Octave, Perl, Python, Ruby, FORTRAN 77, and I forget a few.
I find some people like my interviewer, who get by on what they learned in school; and others who find my list of a dozen languages to be average, who see that it's the same thing over and over but with the parens in different places. The intent of this paper is to point out the commonalities, and to help you make the jump to being a computing badass who sees across languages.
This paper is a checklist, a series of exercises, and a series of suggestions of where to look when you first find out that you're going to have to do work in some platform that you never thought about until somebody told you it's going to be your best friend for the duration of the next project. You'll need to work with the documentation and tutorials for your target language to answer all the questions I ask here--and you'll really have to delve. Introductory tutorials tend to be advertisements for the easiest features of a language, but you can't seriously work until you've gotten to know the ugly and the missing parts as well.
In writing this, I contend that this checklist applies to any mainstream programming language (thus excluding specialized languages like TEX, sed, gnuplot, or SAS). When you discover inevitable exceptions, that's great, because revealing those exceptions will reveal the character of the language and what makes it different from (and hopefully better than) all the others. So the questions are easy and lowest-common-demoninator, but once you find the right part of the manual that describes how to do the easy things, you're in the right place to see if your language does more.
OK, on to the checklist. I'll start with the basic platform, then move on to data types, functions, and larger blobs of code.
How do I print “Hello world” to the screen?
This was originally an exercise from Kernighan and Ritchie's 1978 C book, and the question it is really asking is the same in all languages: how do I set up the environment in which I have to work?
It is the question I can offer the least guidance on, because there are just so many ways in which these things happen: there are compilers, interpreters, integrated development environments (IDEs), a few languages that primarily run in your browser, and who knows what else.
Ex. Get Hello world to print on your screen.
How do I insert comments?
As essential as this is, some systems don't have an explicit comment mechanism, but just expect you to write a string or some other expression that evaluates to something innocuous.
Python. """This part prints Hello World:"""
Where's the documentation?
Your language may have a clever means of built-in documentation, which you will naturally want to get to know immediately. There may be manual pages, like the man perl set of pages or the POSIX-standard man pages for C libraries (try man 3 printf or man operator).But the answer to this question also lies with your Web browser. The built-in and official documentation is probably a reference work, and learning from reference works is a bad idea: you also want didactic works that point out what's important in the reference, and you'll find those by checking in with your favorite search engine.
You will almost certainly be using external libraries or packages of some sort, so take some time to find the reference and/or tutorial documentation for those. They may or may not be related to the official documentation.
There's no explicit exercise for this one, because one of the goals of this tutorial is to get you to find and poke around the documentation, but if you want one, how about: Ex. Find the documentation for reading from/writing to a text file. The instructions may not be clear at this point, but you know where they are.
What are numbers like?
Your language has a data type that represents plain text (i.e. strings) and a type or two that represents numbers. We'll get to strings later, but there isn't much variation in number systems. The only serious point of variation is whether an operation on two integers always produces an integer--that is, does 8/3 turn out to be 2.66666 or just the truncated value of 2?Ex. Check whether your language keeps a wall between integers and real numbers. What is 8/3? What about 8/3. or 8/(3+0.0)?
Python 2. 8/3
→2
Python 3. 8/3
→2.6666
What are the lists like?
I say list, but this could also be called an array or a vector. The big difference is that some of these types tend to have fixed length and some variable, but in all cases, they are an ordered collection of homogeneous items. Some languages don't even hold that requirement that all elements of the list/array/vector have the same type, but we'll stick to that for now.Ex. Create a list of five numbers, 1, 2, 3, 4, 5 (don't bother with a for loop or such cleverness yet; just type it out). Print them to the screen.
Ex. Double all of the elements of your list; print.
Were you able to double the elements in place, or did you have to make a copy (perhaps a copy with the same name)? Did you have to double each element individually, or were you able to write something like 2*my_list?
How do I declare a new variable?
Your language may make some smug claims that it doesn't need you to declare variables, but you sometimes do need a means of indicating the type of an item. Here is a snippet of code for you to implement, which starts with an empty list, and then grows it by appending one item at a time:
my_list = [ ] #an empty list for i=1 to 10: my_list = [my_list, i]
That first line is a declaration, even if you don't want to call it that.
The problem here is that all languages have some set of rules to automtically convert one type to another. The more types you have, and the more do what I mean the language tries to be, the more conversion rules you need (and every language has at least a few). On the first step in the loop, when you append 1 to an empty list, you need to know if you can refer to a not-yet-existent list, whether that empty list will somehow get typecast away, and whether your one-element list [1] remains that way or gets cast into an integer.
Ex. Let a variable x be an arbitrary integer. Write code like the sample above to incrementally build the list, [1, 2, 3, ..., x].
Ex. Casting: assign the number ten to a text string ("10"), perhaps via a print-to-string function or a cast-to-string-type; convert the text string "10" to an integer variable.
C. char *str; asprintf(&str, "%s", 10); int n=atoi(str);
How are references handled?
Every operation that takes in a value and returns a value can either modify its input where it is, or make a copy of the input and mangle the copy to produce output. It is imperative that you know at every step of your program which is happening.Some languages lack pointers/references/aliases, and so copy every single time. If your langauge does that, bear in mind that it will be slow for operations on large data sets, and when we get to function calls, remember that modifying the variable you passed in to a function is really just modifying a copy of the passed-in variable. You can sometimes use this to your advantage to write shorter functions that don't have to take care to prevent side-effects.
On the other end, some languages mostly lack copying, in the sense that they work with aliases/references/pointers by default. After you assign x = y, if you double x, then y will double along with it. But there is going to be some way to specify x = copy(y), which you should take note of for occasional cases when you'll need it.
Ex. [Impossible in some languages] You have a list [1 2 3 ...9 10]; give the list an alias as per the x = y example above. Verify that the alias works by changing the second element in the aliased list to 100, then print the original list.
Ex. Copy your list to a new variable with a new name. Change the second element in your new variable to 100; verify that the original list didn't change.
How do I handle strings of text?
If you stop to think about how they are handled in memory, you quicky realize that strings are hard. A number has a fixed memory footprint: you don't need twice as much memory to write 20 or 4 as you need to write 2. But Hello needs five slots in memory while Hi only needs two. If you wrote HHello, then fixing your typo means moving every item in the string over by one in the little array that is the sequence of letters.C is famously terrible about hiding these details from you, to the point that you might want to check GLib for its smarter string library. Many languages (C family included) treat them like arrays, so you can do array-like operations. Some languages make heavy use of regular expressions for string manipulation, and if you're already good with regexes, then great--you can reduce your problem count with them.
Ex. Fix the typo: put HHello in a string and replace it with the string with the extra H lopped off. Were you able to do it in place, or did you have to copy to a new variable?
Perl. $x = "Hhello"; $x =∼ s/Hh/h/;
What are the structs like?
Arrays are for homogeneous items; structures, dictionaries, or hashes can be used for heterogeneous collections, where each item is a named element of the whole.If your preferred language uses formally declared structures, you may one day find yourself in a language that uses a dictionary or hash--an array with names instead of numeric indices--to serve as your structure. Conversely, if you're used to dictionaries or hashes, bear in mind that some languages require small collections of heterogeneous items be declared in structures. For a long list of homogeneous items that happen to have the same type, you may have to just use a numeric index, generate a hash-type device using an array of key/value structs, or find a key/value system in the libraries.1
Ex. Write a structure, which we'll call the rational structure, with three parts representing a fraction: an integer numerator, an integer denominator, and a text name (like "5/6"). For now, just declare the type if necessary, fill an instance of such a structure for 5/6, and print the elements.
How do I write and call a function? Are function arguments copied in or pointed to?
The form of a function call doesn't change much from language to language. You define the language in one place, and call it with a form like, new_fn(x) (or (for LISP-inspired languages) a form like (new_fn x)).
Ex. Reusing the code you wrote above, write a function that takes in an integer x, and returns a list [1, 2, ..., x].
Some languages allow you to write inline functions--nameless little routines for throwaway transformations of a list [a, b, c] into [f (a), f (b), f (c)]. This may be in the index under list comprehension, lambda functions, or anonymous functions. Those languages that allow you to do this kind of thing tend to rely heavily on the facility, so check that it's possible, and if it is, redo the above example about doubling your list using that feature.
Scheme. (define L (list 1 2 3 4 5))
(map (lambda (x) (* 2 x)) L)
How do I debug a function?
Here are the sort of things I mean by debugging:- pausing your program at a certain point,
- getting the current value of any variables that exist in that function;
- jumping to a parent function and checking variable values there;
- stepping past the point where you paused, one line of code at a time.
There's diversity here: for C, you might use the GNU debugger or your IDE might have a built-in hook, some interpreted languages have the facility built in to the interpreter, some have a debugging library that you import like any other library, javascript has some browser plugins, bash has a verbose mode which doesn't do all of the above tasks but is at least a start.
Ex. Write a program/script that calls your function to produce a list 100 elements long. Set a breakpoint that stops when your list is 34 items long and print the list as it looks at that moment.
The worst case is inserting print statements where you need to know a variable's value. It takes a few seconds to set up each print statement, and is annoying when you run and then find out that you needed one more variable's value or the same variable's value three lines down. If the debugging facilities I enumerated above really are not available for for your language of choice, bear in mind that large projects may be a pain relative to how they'd work in other languages.
What are the scoping rules?
At the least, you need to know how to declare variables that are global to the entire program, and variables that are local to a function. Some languages go crazy from there; I count four different scoping systems that you'd have to bear in mind for C++ (file, curly brace delimited, object, namespace).Ex. Rewrite your function to have an explicit iterator (i.e., use a for i=1 to N sort of loop). After calling the function, try to print the value of your iterator (i), and verify that it is not defined outside the function's context.
Ex. Make the iterator global, so that its value is N after the function is called. [Then revert to the prior version, because having an iterator as a global variable is absurdly bad form.]
In lexical scoping, variables that are not explicitly defined in a function, and so would normally be looked up in the global environment, are looked up in the environment as it looked when the function was first called. If you're in a lexically-scoped language, you can do some tricks to generate on-the-fly specialized functions, but well before you get creative with those methods you'll need to make sure you don't get confused and presume that you are looking at a global variable as it is now when you are actually looking at the version that was bound to the function on first call.
How do I maintain continuity across function calls?
To fix the idea, I'll start with the exercise:Ex. Write a function so that on the first call, count() returns one, on the next call, count() returns two, and so on to infinity.
The cheap way of making this work is to use a global variable. C goes one step further with the static keyword. Some have an explicit syntax for continuations.
C. int count(){static int i=0; return i++;}
Ex. Write a function that takes in a list or NULL/nil/0/whatever. If it gets a list, it returns the first item in the list and stores the list internally; if it gets a blank marker, then it returns the next item in the stored list; use the mod (or %) operator to cycle back to the beginning of the list if you hit the end.
Ex. If your language has lexical scoping, write a function that takes in a list and returns a function. The returned function is as above: on each call, it will return the next item in the list, cycling back to the beginning as needed. Here's a sample use of the function you're going to write:
next_prime = generate_list_step_function([1 2 3 5 7 11 13]) non_prime = generate_list_step_function([4 6 8 9 10 12 14]) next_prime() next_prime() next_prime() non_prime()
which will print 1 2 3 4. In some languages without lexical scoping, implementing this requires some creativity; in some it is impossible (in which case you'll probably wind up sending in the list every time).
Can I do text substitutions (i.e., macros)?
Your typical language mostly focuses on functions that generate their own space in which to work (as per the scope section); the text substitution simply replaces a blob of text with another blob of text.There are some languages that have no macro processing abilities, some that have a preprocessor that takes the text that is your program file and convert pieces of text into other pieces of text, some that are sufficiently self-aware to convert a text string to operational code in real time via an eval-type function, and some that use lazy evaluation to leave your text as text until you want it to be evaluated.
Ex. Write a macro you'd call like this: call_function(your_function, 10) or like this: call_function("your_function", 10). The macro would then print Calling the your_function function to the screen and then execute your_function(10). Use this to call the list-generating function you wrote above.
OK, to this point, everything has been about the details of the language: how do I deal with types? Can I do clever tricks with scope and persistent variables? The rest of this is going to be about design: how can I introduce new nouns, and the verbs that those nouns can do? How do I package them so I can easily use them for the next project, and how do I use already packaged elements for today's project?
How do I load libraries/packages so I don't have to reinvent wheels?
The most basic sort of library inclusion is to simply have a mechanism for including a text file with some code verbatim at the head of the text file you're working with now.
Let me give you a few examples, because the point of the package concept is that not everybody has similar needs.
Ex. A: Make a hundred draws from a standard Normal distribution.
Ex. B: Load an XML document into memory, and print all of one type of element.
Ex. C: Open an empty SQLite database and create a table.
How do I set up functions that act on a specific structure?
So very much of the code in the world involves a single specialized data structure, and a set of functions that manipulate that struture. This, of course, is the basis of object-oriented coding, wherein those functions to manipulate a structure are a part of the structure itself. But before that happened, there were simple naming conventions, which worked just fine but weren't as attractive. Some languages have no discernable naming convention, in which case there's a null answer to this question: just put the functions that act on a structure wherever.
Ex. Write a rational_set function that takes in two integers and outputs a rational structure, a get_value function that returns the value of the fraction, a get_name function that returns the fraction as a text string, and an add or + function that adds two rationals. If appropriate, also define a free or destroy function. Add those functions to the object itself if appropriate.
How do I get an auxiliary structure that builds upon a base structure?
OK, so you wrote a structure above that has a fixed list of elements, but would now like to extend the structure in some way. As above, some systems will require that you declare the structure beforehand, so you will need to extend either via a structure-extending mechanism (new struct is a version of old struct), or a new structure that consists of some elements and an embedded old structure (new struct has a version of old struct). The wrapping-around approach works everywhere; the object-inheritance method of directly extending a structure is pretty common, and when it exists people will expect you to make some use of it.2 Even in languages where extending a structure just means tacking another element onto the list, there are reasons for an object-extension grammar.Ex. Use accepted custom to create a new structure, a signed rational, where all of the elements are positive, but there is another element sign, which is 1 if the fraction is positive, -1 if it is negative. Write new set/get/add functions that makes use of the new structure. If you are comfortable with irrational numbers, make this exercise more interesting by extending your rationals to complex rationals.
How do I package my own stuff?
Since you already loaded a package/library/file above, you have some idea of how packages work in your new language. There may or may not be extra steps to packaging your own.
Ex. Package the rational structures' declarations (if any) and the get/set/add functions in a
separate file (or files if you need a header or manifest or what-have-you).
To make all this work, you wrote tests that made sure that the functions
worked as planned; make sure those tests still work when you import/include the
structure and functions as a package or library.
Footnotes
- ... libraries.1
- Dictionary, hash, or struct--they're all about as good, but there are some older languages that have absolutely no way to bundle a set of heterogeneous variables. They are obsolete, and should be used only when the situation really gives you no other choice.
- ... it.2
- This is a claim about custom across communities, and therefore can only be roughly true. Some languages have a bolted-on object model that never received wide acceptance--some even have two bolted-on object models.
[link][no comments]
Tip 84: Use m4 to automate OOP boilerplate
level: macro hacker, bored with OOP
purpose: get to the good stuff
prerequisite: Last episode, where we used m4 to generate HTML
A tip every other day on POSIX and C. Start from the tip intro page.
Every object needs a typedef, a new/copy/free function, and a header for public use. These functions tend to be pretty boilerplate, and every language that expects its users to generate new objects all day long has some means of autogenerating this stuff. Some have an elaborate catalog of templates; some expect that every IDEs will have a means of generating all this crap via a text editor macro.
I wrote some C preprocessor macros to do all this at one point, but here's another approach, via m4, a little macro language which does nothing better than generate boilerplate.
Of course, you're still going to have to do some work, inventing the elements that go into the struct, and deciding how they get allocated, copied, and freed, but you have a framework to put that in for free. You might not like my boilerplate object declarations anyway. As with many of these blog entries, this is more to give you a potentially useful concept and enough code to start hacking it to work the way you do.
OK, enough with the caveats. Here's the m4 code for you. It's self-documenting, especially if you read last episode, where we used m4 to generate HTML.
m4_divert(-1)
MekeHeader(objname) generates a header file with a
typedef and new/copy/free elements.
MakeObject(objname) generates the boilerplate new/copy/free
functions themselves.
Usage:
echo "MakeHeader(objname)" | m4 -P objectify.m4 '-' > obj.h
echo "MakeObject(objname)" | m4 -P objectify.m4 '-' > obj.c
There are markers for where the actual content goes.
m4_changequote(`‹',`›')
m4_changecom(‹m4 comment:›)
m4_define(MakeHeader, m4_dnl
typedef struct $1 {
//Place elements here.
} $1;
$1 *$1_new(void);
$1 *$1_copy($1 *in);
void $1_free($1 *in);)
m4_define(MakeObject,#include "$1.h"
$1 *$1_new( ){
$1 *out = malloc(sizeof($1));
*out = ($1){ };
return out;
}
$1 *$1_copy($1 *in){
$1 *out = malloc(sizeof($1));
*out = *in;
//Element pointers and other copying happens here
return out;
}
void $1_free($1 *in){
//free subelements here
free(in);
})
m4_divert(0)m4_dnl
[link][no comments]
Tip 83: Use m4 in the middle of your documents
level: macro hacker
purpose: eliminate any and all repetition
A tip every other day on POSIX and C. Start from the tip intro page.
The m4 macro language is mostly interesting because its macros are intended to be put anywhere in any text file. We'll bulletproof the example in a little bit, but let's say that we have the macro file:
m4_divert(-1) m4_define(Emph, <em>$1</em>) m4_divert(1)</body></html> m4_divert(0)<html><head><meta charset="utf-8" /></head><body>
and the text file
Welcome to my Emph(lovely) web site.
then after you run m4 macros.m4 text > out.html you'd wind up with:
<html><head><meta charset="utf-8" /></head><body> Welcome to my <em>lovely</em> web site. </body></html>
What just happened:
- We just automated a whole lot of cruft at once. There is just no way to produce HTML, XML, or even certain blocks of LATEXwithout help from something that automates redundancy, and m4 will do that for you.
- The
m4_defineis the simplest macro definition: you give the macro definer a name and an expansion, where the expansion can have the usual positional parameters like $1, $2, .... That's all it takes for m4 to know what to do when it gets into your text and sees Emph(lovely). - Diversions:
m4_divert(-1)sends output to /dev/null. So after that line m4 is reading in macro definitions but isn't writing anything.m4_divert(1)stores output into buffer 1.m4_divert(0)writes to standard out, which should be the normal course of affairs. At the end of the text, the buffers get written to output in order, which is how we got the end tags at the end of the file. Or, write a footer.m4 to put on the command line after your main text. Usem4_undivert(1)to empty out buffer 1 sooner.
The expansions are aggressive: if your macro doesn't have parens after it, it'll still get
expanded, so if you happen to have Emph in plain text, that'll get turned into HTML
tags. If we're going to have m4 operate on an arbitrary text or code file, we need to make
certain that it doesn't surprise us. E.g., use macro names that don't make sense as standalone
strings. Notice also that we're using m4 -P, which puts that m4_ tag at the head of
every function name. Otherwise, if you use the word divert in your text, it gets
eaten. You may also find stray line breaks due to expansions; use m4_dnl to prevent
those (delete to new line). Here's an m4 file with some further protections and tricks built in:
m4_divert(-1)
m4_changequote(`‹',`›') # m4 eats all quote-endquote markers, so make sure
# they will never appear in your text by using odd ones.
# Notice how these aren't the plain <> signs;
# vim users, try :help digraph.
# I also wrote a vim macro to write (‹ and ›) for me.
# To avoid sad surprises, wrap all all macro inputs in these.
m4_changecom(‹m4 comment:›) #Octothorpes appear in plain text.
#A macro to define new macros.
m4_define(newXML, ‹m4_define($1, <$2>‹$›‹1›</$2>)m4_dnl›)
newXML(Emph,em)
newXML(Pp,p)
m4_divert(1)</body></html>
m4_divert(0)<html><head><meta charset="utf-8" /></head></body>
# Let's throw some sample uses here, so we can test the
# m4 file with itself. When we're happy, move the m4 divert(0)
# line to the end so these get sent to /dev/null.
Pp(‹Dear reader,›)
Pp(‹HTML was Emph(originally) designed to be handwritten,
but now generating well-formed documents is just a pain.›)
The language does a few more tricks: optional arguments, if/thens, loops, but if you keep
it simple, you can fix a lot of annoyances with just a few lines of macro definitions.
[link][no comments]
Tip 82: Insert NA, NaN, and other markers into your data set
level: somewhat advanced data trickster
purpose: leave notes
A tip every other day on POSIX and C. Start from the tip intro page.
The IEEE floating-point standard, which is how your computer represents numbers, has a form for NaN, which indicates a math error, like 0/0 or ln(- 3).
In fact, it has a lot of forms for NaN: the sign bit can be zero or one, then the exponent is all ones, and the rest is nonzero, so you have a bunch of bits like this: S11111111MMMMMMMMMMMMMMMMMMMMMMM, where S is the sign and M the unspecified mantissa.
That gives us a whole lot of room to play around, because we can specify those Ms to be anything we want (as long as it's nonzero, because zero mantissa indicates ± infinity, depending on the sign bit). Once we have a way to control those free bits, we can add all kinds of distinct semaphores into a cell in a numeric grid.
The little program below generates and uses an NA marker. The trick is primarily in
set_na, so focus your attention there first. We first produce a plain NaN by
calculating 0/0. (where the dot is important, because we need floating-point division,
not integer division). Then, we point a char* at the bit pattern, where char
is C's way of saying byte. Now that we can manipulate individual bytes of the
floating-point number, we set the third byte to match the character
a, where the third byte is comfortably in the middle of the bit pattern that is this NaN.
Now we have a bit pattern that is a NaN, but a very specific one, and one that the system
didn't generate. Now the is_na function just needs to check whether the bit
pattern of the number we're testing matches the special bit pattern that we made up. We
do this by treating both inputs as character strings and doing character-by-character
comparison.
#include <stdio.h>
#include <math.h> //isnan
#include <stdlib.h>
double ref;
int is_na(double in){
if (!ref) return 0;//set_na was never called==>no NAs yet.
char *cc = (char *)(&in);
char *cr = (char *)(&ref);
for (int i=0; i< sizeof(double); i++)
if (cc[i] != cr[i]) return 0;
return 1;
}
double set_na(){
if (!ref) {
//take an NaN, fiddle with a byte to make it distinct.
ref=0/0.;
char *cr = (char *)(&ref);
cr[2]='a';
}
return ref;
}
int main(){
double x = set_na();
double y = x;
printf("Is x=set_na() NA? %i\n", is_na(x));
printf("Is x=set_na() NAN? %i\n", isnan(x));
printf("Is y=x NA? %i\n", is_na(y));
printf("Is 0/0 NA? %i\n", is_na(0/0.));
printf("Is 8 NA? %i\n", is_na(8));
}
discussion:
I assert that this is pretty portable. However, I can't get a fix on the rules for what NaN bit patterns get produced on every machine. Anybody have examples where the above would fail, and how?
My NA marker is a type of NaN; not all NaNs are NAs. This seems logical and correct to me: if the data is missing, we can't do math on it; a math error is from mis-processing extant data, not a missing data point. The R system does it the other way `round: NaN is a type of NA, but NA is not a type of NaN. They're smart people who have good reasons for (most of) what they do, but this seems backward to me. I'm not sure how I'd implement their version using the tool here.
I produced a single semaphore to store in a numeric data point, using the character a as the key element of the marker. Given that the alphabet continues on to b, c, ..., and A, B, ...are different characters entirely, we can insert a few dozen other distinct markers directly into our data set. SAS users love to do this, inserting .a through .zs in their data sets for all sorts of exceptions.
I've inquired to many geeks about whether they actually use the distinction between
NaN and NA, like whether they write code of the form if (is.NA) do this; else
if(is.NAN) do something else, and I couldn't find a single person who does. And so,
a question for you, the reader: given the ability to place arbitrary semaphores in a
matrix of numbers, what would you use it for?
[link][4 comments]
|
on Wednesday, March 14th, Max Lybbert said
It's my understanding that the C Standard does not require IEEE floating point (or twos-complement signed arithmetic!), although it's a pretty safe bet that your platform provides IEEE floating point arithmetic. |
|
on Wednesday, March 14th, Max Lybbert said Oops, my template didn't make it through the mangler. There should be a "<double>" in the std::numeric_limits code. |
|
on Thursday, March 15th, the author said To back up your statement that `it's a pretty safe bet' that a given machine works with the IEEE 754, here's a Stack overflow query enumerating the relatively exotic hardware where IEEE 754 isn't used. |
|
on Saturday, March 17th, Max Lybbert said Thinking more about this, further evidence that you can portably play with the guts of IEEE floating point numbers is the use of NaN boxing in Javascript implementations ( http://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations ). |
Tip 81: Deprecate floats
level: basic numerics
purpose: Ignore lots of caveats
A tip every other day on POSIX and C. Start from the tip intro page.
I don't know if you've noticed, but I stopped using float.
There's a lot of advice about how you've got to be careful about avoiding floating-point tricks all the way along. Much of it is still valid today, but much of it is easy to handle quickly: use double instead of float.
For example, have a look at the caveat on p 24 of Writing Scientific Software, which advises users to avoid what they call the single-pass method of calculating variances.
They give an example which is ill-conditioned. I reprint their list of numbers below, and you can see that even though the numbers are in the tens of thousands, they differ mostly after the decimal. The authors get terrible results, with a variance that seems off by two orders of magnitude.
Apophenia uses the advised-against single-pass method, as does the GSL. How bad are the results? Not bad at all, actually.
Here's the code to run their example. I do the example twice: once with the ill-conditioned version, and once after subtracting 34,120 from every number, which thus gives us something that even a plain float can handle with full precision. We can be confident that the results given the not-ill-conditioned numbers are accurate.
#include <apop.h>
int main(){
apop_data *d = apop_data_fill(apop_data_alloc(1, 6),
34124.75,
34124.48,
34124.90,
34125.31,
34125.05,
34124.98);
double m, var;
apop_matrix_mean_and_var(d->matrix, &m, &var);
printf("mean: %.10g var: %.10g\n", m, var*6/5.);
apop_data_fill(d,
4.75,
4.48,
4.90,
5.31,
5.05,
4.98);
apop_matrix_mean_and_var(d->matrix, &m, &var);
printf("mean: %.10g var: %.10g\n", m, var*6/5.);
}
- Apophenia returns the population variance; we scale to produce the sample variance, which the authors prefer.
- I used %g as the format specifier in the printfs; that's the `general' form, which accepts both floats and doubles.
- Internally,
apop_matrix_mean_and_varuses a long double, following the basic principle that you should keep your intermediate values one step more precise to prevent intermediate roundoffs from aggregating into problems. It used to just use a double, and the results weren't actually different.
Here are the results:
mean: 34124.91167 var: 0.07901676614 mean: 4.911666667 var: 0.07901666667
So the means are off by 34,120 but otherwise precisely identical (the .66666 would continue off the page if we let it), and the variances differ by one in the sixth nonzero digit, which is frankly not worth caring about. The ill-conditioning had no appreciable effect.
That, dear reader, is technological progress. Where a book from 2006 told us to take great care in implementing algorithms, all we had to do was throw twice as much space at the problem. If there's a speed difference between a program written with all doubles and one written with all floats, I certainly can't perceive it, and it's worth extra seconds to be able to ignore so many caveats.
long long int
Should we use long ints everywhere integers are used? The case isn't quite as open-shut. A float representation of π is more imprecise than a double representation of π, even though we're in the ballpark of three; both int and long int representations of numbers up to a few billion are precisely identical. The range of integers goes up to about ± 2.1 billion on a typical machine (I read that on some machines it can be scandalously short, like around 30,000 but I wonder if those are all obsolete at this point). If you think there's even a remote possibility that you have a variable that might multiply its way up to the billions (that's just 200×200×100×500), then you certainly need to use a long int or even a long long int, or else your answer won't just be imprecise-it'll be entirely wrong, as C suddenly wraps around from +2.1 billion to -2.1 billion.
long ints aren't quite as immediate a drop-in replacement for ints as doubles are for floats. I suppose long int looks ugly all over the place, though you can get away fine with just writing long, and you'll need to modify all your printfs to use %li instead of %i. Have a look at /usr/include/limits.h for details; on my machine it says that int and long int are actually identical.
But, again, if there actually is a cost to using longs and long longs
with great frequency, it's darn cheap relative to the cost of going over the max and
rolling over to a negative number.
[link][no comments]
