Which characters cannot appear in http, https, or mailto URLs?

Right now, the format Dialog uses to define resources is:

(define resource @IDENT) URL [ ; ALT-TEXT]

Where IDENT is an identifier for the resource, URL is either a URL starting with http, https, or mailto or a path to a local file, and the part in brackets is optional.

There’s a feature of the Å-machine that’s currently unused, that allows each resource to also embed an “option string” that the interpreter can use however it wants. For example, the option string could specify whether a sound should be looped or not, or the volume it should play at.

I’d like to add this capability to Dialog—but since absolutely anything is legal in the alt-text, the option string will have to go before it. (I want to maintain backward compatibility, and that means not suddenly reinterpreting part of a user’s alt-text as an option string.)

In other words, I want to change the specification to:

(define resource @IDENT) URL [DELIM OPTION] [ ; ALT-TEXT]

Where DELIM is some sort of delimiter.

The question is, what should that delimiter be? It can’t be ;, for reasons of ambiguity. It can’t be anything that can legally appear in a URL, which rules out :. And it should be printable ASCII, for the sake of users’ sanity. I would prefer to also avoid @, #, $, (, ), [, ], {, }, since those always need to be escaped in Dialog source text, but they’re not entirely off the table.

The set of characters that’s legal in a URL seems like the sort of thing that should be easy to look up, but different sources have given me different answers—but, I figure IF people do enough with HTML that they’ve probably thought about this before!

My current inclination is to use !, but sources disagree on whether or not that’s a valid character in URLs. It looks like it is valid in NTFS filenames, but since it’s not valid in FAT filenames, I think it’s perfectly reasonable for Dialog to disallow it for the sake of portability. (I don’t think anyone wants their code to suddenly start acting differently when they put it on a flash drive!)

Or perhaps >, which is disallowed in URLs and filenames across most systems, but is perhaps less immediately memorable.

I don’t know anything about Dialog. But could it be something like ;;; (or another repeated string)?

Unfortunately not, since it’s currently possible that someone wants the alt text “;;” for their picture of two semicolons.

Of course, we are incrementing the major version number with the new release, so it’s okay to break backwards compatibility if we have a good reason.

Are you sure about > being widely disallowed? Because, I just tested it and had no issue creating a file named .txt on an ext4 filesystem on my Linux Desktop. Do have to escape the angle brackets on the command line and couldn’t copy it to my fat32 formatted SD card, but had no issue using such a filename on my system drive.

Edit: Okay, for some reason, the forum ate the word test inside angle brackets, corrupting the test filename I used.

All characters are legal in filenames except slash. Backslash may be a problem on Windows but I don’t know the details on that.

I believe all characters are usable in URLs. It’s just a question of how you have to quote them.

(I am somewhat dodging the question by talking about characters being usable in URLs. I think, digging through some discussion, that there are characters that are illegal in URLs but people still use them and browsers cope with it.)

(Control characters are an extra headache, particularly the null \000 character. I’m leaving that as an aside because I’m pretty sure you don’t want to use control characters for delimiters!)

Any reason you can’t add:

(define resource with options @IDENT) URL ; OPTS [ ; ALT-TEXT]

I think > might in fact be one of the most memorable characters in the IF space…

3 Likes

There’s more character «valid» for URL and emails than they are normally «used», but here you have some info:

That would work, but it feels less syntactically elegant.

What about THISISTHEDIALOGURLDELIMINATORIFSOMEHOWYOUGOTTHISCOMBINATIONOFLETTERSINYOURURLWTFAREYOUDOING?

3 Likes

I mean, if you mandate a space between the URL and the delimiter, you could really choose any keyword as a delimiter, right? (We could assume there won’t be spaces in the URL, I think, or even mandate it.)

1 Like

True! It sounds like there are basically no characters that are illegal in filenames, but Dialog can just mandate specific rules for the filenames used in assets (like, currently, not containing semicolons), since those are always under the author’s control. URLs and email addresses aren’t, so those rules matter more.

And those rules, to my understanding, forbid unescaped spaces (they need to be encoded as %20), right? So we could just use whitespace, in theory.

Pretty sure the vertical bar is disallowed in URLs, hence why it’s used to delimit the link label in wiki links.

The spec would like you to percent-encode it, but if you use a raw | character, browsers and servers will permit it.

The set of characters that’s legal in a URL seems like the sort of thing that should be easy to look up

I’m not an expert here, but it seems generic URL syntax is defined in RFC 1738. In section 5, you can see all of the allowed characters. It might be outdated by RFC 3986, which describes URI syntax in its Appendix A. (I’ve always been a little fuzzy on the distinction between URL and URI.)

Unless I’m missing something, any printable ASCII character can appear in a URL (or URI) except for space and backslash.

Spaces can be replaced by the plus sign (“+”) or maybe by the percent encoding (“%20”).

Proper file-scheme URLs are supposed to use slashes to separate segments, even if the underlying filesystem uses backslashes to do so. (Any slash or backslash that’s literally part of a segment name is supposed to be percent-escaped.)

However, some other characters are deemed “unsafe” and thus should be percent encoded. In particular there’s “<” and “>”.

Both RFC 1738 and RFC 3086 suggests that if a URL needs to be distinguished from the surrounding context, it could be bookended with angle brackets (“<” and “>”) and prefixed with “URL:” (e.g., "Go to <URL:http://www.example.com/index.html> to learn more.).

Perhaps your syntax could allow the angle brackets when necessary to disambiguate.

1 Like

That makes sense! I’m leaning more and more toward using a space or tab, whichever comes first (i.e. any non-newline whitespace).

Thank you all!