| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ICWiener 4094 days ago

> It's a tradeoff, of course.

> Reasonable people can disagree about whether the tradeoff is worth it.

> But that's my point: reasonable people can disagree.

Sure, we disagree. I understand that you think there is a tradeoff, like in many design decisions. However, you did not tell me what benefits you expect from having the syntactical distinction.

> It syntactically distinguishes information that applies to the tag from information to which the tag applies.

That statement alone does not say why it is a good thing to distinguish syntactically both kind of informations. This looks interesting, of course, to be able to have a distinction. The fact that there is a distinction on the semantical level does not mean it should be there syntactically, though.

The syntax for attributes is unfortunately flawed, which is why ...

> Any language feature can be abused. The proper response IMO is to stop abusing the feature, not to eliminate it.

... is taking the problem completely backwards. No language feature were abused. In fact, attributes were acting against a natural organization of information. And that is why, as a workaround, it was needed to express meta-data with tags. I don't expect you to agree with this, so consider a classical example of markup usage.

You want to represent a document, with reviewers, publication dates and authors (those are meta-informations, right?) as well as content (the actual text being stored). However, there exists meta-informations about people (name, title) and dates (calendar). Where do we store structured meta-informations about attributes?

Unfortunately, attributes do not allow to express structured information, and they cannot have meta-informations attached to them. Here is what you obtain:

     <root>      
          <document author="id0"
                    published_in="ACM"
                    publication_date="2010/03/02"
                    publication_volume=345
                    link="doi://1020301.202.301.1023"
                    reviewers_id="id1;id2;id3;id4">

            ... content ...

          </document>

          <peoples>
            <people id="id0"><name>John Doe</name>...</people>
            <people id="id1">...</people>
            <people id="id2">...</people>
            <people id="id3">...</people>
            <people id="id4">...</people>
          </peoples>
     <root>

Notice how informations about publication are scattered into different attributes instead of being a single attributes with sub-components? (has the document been published in March or February? In which timezeone?)

Authors are only indirectly referenced through identifiers, because the real structure cannot be easily expressed in attributes only. Also, a list of reviewers is actually a string with semicolon-separated identifiers.

An so, peoples are not just meta-informations, but tags with nested children and you must have a "root" element around your document, and a special list of "peoples". Just to be clear, having identifiers is not bad and could be a good way to model relationships. The problem is that you do not have a choice anymore of using different level of meta-attributes.

Notice how the link to a "DOI" identifier is itself encoded in a string (this is a custom format just for this example), instead of using a more useful nested structure:

       (link (protocol doi)
             (path (digits 1020301 202 301 1023)))

Each time you use a string to encode structured information in an attribute with a custom mini-language, you are asking for trouble. Imagine how each of those strings now need to have a dedicated parser because you need to take care of escaping "special" characters.

You might say that this is unfortunate that attributes are "flat", and that maybe a kind of hierarchical way of expressing attributes would be more preferable. And then, you would have nested-attributes as well as nested-elements. Why not merge them into the same syntactical structure?

If you consider that identifiers are not necessary, or if your format allows for sharing common sub-expressions (like #1=(author), #1#), then you could go with that kind of data-format:

       (document (author (name "John Doe") (job-title "Professor") (institution "MIT"))
                 (anchor
                      (link (protocol doi)
                            (path (digits 1020301 202 301 1023)))
                      (target blank-page))
                 (reviewers (reviewer (name "..."))
                            (reviewer (name "..."))
                            ...)
                 (encoding (utf 8))
                 (sections
  
                    ...))

Then, you have multiple layers of "meta"-informations, instead of just 2: "data" and "dumb meta-data". I agree we disagree, but I do not think both approaches are equal. You talk about tradeoffs, but I really do not see anything useful in having attributes, whereby I can see the inconvenience they bring when trying to structure information in a meaningful way.

1 comments

lisper 4094 days ago

> I really do not see anything useful in having attributes

That's because you chose examples that show attributes at their worst.

Suppose that instead of a document with a single author you instead had a document to which many people contributed, and you wanted to mark it up to show who wrote which section. Using attributes (and identifiers) you would have, e.g.

    <author id=Bob>[info about Bob]</author>
    <author id=Alice>[info about Alice]</author>
    <span author=Bob>This part was written by Bob</span>
    <span author=Alice>and this part was written by Alice</span>
    <span author=Bob>and this part was written by Bob again.</span>

This example also highlights why it is sometimes NECESSARY to use identifiers in order to produce the semantically correct structure. Suppose you put all the author information in-line as you suggest. The result would look something like this:

    <span><author><name>Bob</name>[info about Bob]</author>This part was written by Bob</span>
    <span><author><name>Alice</name>[info about Alice]</author>This part was written by Alice</span>
    <span><author><name>Bob</name>[info about Bob]</author>This part was written by Bob</span>

Were the first and third parts written by the same person, or by two different people whose names both happen to be Bob? If you put everything in-line there is no way to express that two pieces of structure are intended to be EQ to each other.

link

ICWiener 4094 days ago

Did you not read my reply, really?

I already mentionned that with a Lisp like data-format, shared sub-expressions could be denoted using CL's reader variables:

      (document
        #1=(author (id "Bob") ... )
        #2=(author (id "Alice") ... )
        (span (author #1#) "written by Bob")
        (span (author #2#) "written by Alice")
        (span (author #1#) "written by Bob"))

I do not claim that this is the most appropriate solution in all cases, just that we are not forced to introduce indirection levels when unnecessary. Now, if I am using Lisp and I want to introduce external references to authors described in other documents, I could introduce a meta-data with an appropriate semantical structure:

       (external-element (pathname (directory (relative "path" "to")) 
                                   (type "lisp")
                                   (name "file")) 
                         (tree-path 2 1 3 2 2 3))

This would be a practical way to encode a precise location in a tree in an external file. And I could use this form everywhere I need to reference an object. Also, the tree-path notation is handy because there is no distinction between an attribute or an element, just which branch to take at each step from the root.

Now, with XML attributes, I would typically have an "xref" attribute. How can we model xref attributes? If we wanted to have structured data, we would need to create external tags with the same concepts as above, like <pathname>, create a local identifier for each xref and refer indirectly to each xref using their local identifier: because we can only put strings. I mean:

     <author xref="xref02"> 
     ...
     <xref id="xref02">
       <pathname> ... </pathname>
       <tree-path> ... </tree-path>
     </xref>

Or, we do as everybody and encode it like for XMI, or ECORE, or any other custom format, with a complex string, hoping that HTML entities are properly escaped.

Besides, you failed to notice that you had <author> tags, which precisely goes against your idea that there should be a place for "meta-data" and a place for "data": effectively, authors are now part of the content of the document, and are not only meta-informations.

If you think my examples are artificial, open the source code of this page, and observe how any kind of complex information written in attributes has to be properly escaped to bypass the limitation of stringly-typed data:

       reply?id=9556252&amp;goto=item%3Fid%3D9555880"

       href="vote?for=9556252&amp;dir=up&amp;auth=0UU000REDACTED000208d8b9f4a45575b4edea3779&amp;goto=item%3Fid%3D9555880"

Notice how you need to escape HTML entities in inline javascript attributes (onclick) but not on script tags. Why are inline javascript not tags instead?

(see http://stackoverflow.com/questions/8749001/escaping-html-ent...).

Whatever example you choose, you cannot deny the fact that attributes are not given the same rights as elements, because the way they do not allow to contain structured data or cannot have meta-attributes themselves.

link

lisper 4094 days ago

> Did you not read my reply, really?

I did read it.

> I already mentionned that with a Lisp like data-format, shared sub-expressions could be denoted using CL's reader variables:

Yes, of course this is possible. But that's just a different way of implementing tags (and not even a very good one either because your tags are constrained to be numeric).

> we are not forced to introduce indirection levels when unnecessary

That's a tautology.

> I could use this form everywhere I need to reference an object.

Of course you could. Most problems have more than one reasonable solution. But pointing out one reasonable solution is irrelevant to the question of whether a different solution is also reasonable.

> your idea that there should be a place for "meta-data" and a place for "data"

That wasn't exactly my idea. What I said was that there was value in having a syntactic distinction between data and meta-data. But I didn't say that this distinction should be universal. In fact it is impossible to distinguish between data and metadata in general, so you can always come up with examples where a particular datum's role is ambiguous. That doesn't change the fact that in many practical circumstances, having a syntactic distinction is appropriate and useful.

> observe how any kind of complex information written in attributes has to be properly escaped

Again, citing circumstances where things fall apart does not change the fact that in many practical circumstances, having a syntactic distinction between data and meta-data is appropriate and useful.

If you choose to reply to this, please remember: I'm a Lisp fan. (Look at my HN user ID!) I hate XML. I much prefer S expressions. When I have to deal with XML, the first thing I do is parse it into S-expressions. The world would be a better place if everything were S-exprs no one used SGML or any of its devil spawn syntaxes. But that's not the world we live in. In the world we live in, where markup languages exist and are required to have matching end tags, attributes are a defensible design.

link