| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ssmoot 5533 days ago

I don't know how NFS keeps coming up. It's an entirely different use case. It doesn't help the credibility of a critique on networked block storage to harp on a vendor specific implementation of a technology that doesn't even operate in the same sphere.

An NFS server is very simple. With NFS on it's own VLAN, and some very basic QoS, there's no reason an NFS server should be the weak point in your infrastructure. Especially since it's resilient to disconnection on a flaky network.

If you're looking for 100% availability, sure, NFS is probably not the answer. If on the other hand you're running a website, and would rather trade a few bad requests for high-availability and portability, then NFS can be a great fit.

None of that has anything to do with EBS or block-storage though.

Joyent's position is that iSCSI was flaky for them because of unpredictable loads on under-performing equipment. The situation would degrade to the point that they could only attach a couple VM hosts to a pair of servers for example, and they were slicing the LUNs on the host, losing the flexibility networked block-storage provides for portability between systems.

Here's what we do:

We export an 80GB LUN for every running application from two SAN systems.

These systems are home-grown, based on Nexenta Core Platform v3. We don't use de-dupe since the DDT kills performance (and if Joyent was using it, then is local storage without it really a fair comparison?). We provide SSDs for ZIL and ARCL2.

These LUNs are then mirrored on the Dom0. That part is key. Most storage vendors want to create a black-box, bullet-proof "appliance". That's garbage. If it worked maybe it wouldn't be a problem, but in practice these things are never bullet-proof, and a failover in the cluster can easily mean no availability for the initiators for some short period of time. If you're working with Solaris 10, this can easily cause a connection timeout. Once that happens you must reboot the whole machine even if it's just one offline LUN.

It's a nightmare. Don't use Solaris 10.

snv_134 will reconnect eventually. Much smoother experience. So you zpool mirror your LUNs. Now you can take each SAN box offline for routine maintenance without issue. If one of them out-right fails, even with dozens of exported LUNs you're looking at a minute or two while the Dom0 compensates for the event and stops blocking IO.

These systems are very fast. Much faster than local storage is likely to me without throwing serious dollars at it.

These systems are very reliable. Since they can be snapshotted independently, and the underlying file-systems are themselves very reliable, the risk of data-loss is so small as to be a non-issue.

They can be replicated easily to tertiary storage, or offline incremental backup easily.

To take the system out, would require a network melt-down.

To compensate for that you spread link-aggregated connections across stacked switches. If a switch goes down, you're still operational. If a link goes down, you're still operational. The SAN interfaces are on their own VLAN, and the physical interfaces are dedicated to the Dom0. The DomU's are mapped to their own shared NIC.

The Dom0, or either of it's NICs is still a single point of failure. So you make sure to have two of them. Applications mount HA-NFS shares for shared media. You don't depend on stupid gimmicks like live-migration. You just run multiple app instances and load-balance between them.

You quadruple your (thinly provisioned) storage requirements this way, but this is how you build a bullet-proof system using networked storage (both block (iSCSI) and filesystem (NFS)) for serving web-applications.

If you pin yourself to local storage you have massive replication costs, you commit yourself to very weak recovery options. Locality of your data kills you when there's a problem. You're trading effective capacity planning for panic fixes when things don't go so smoothly.

This is why it takes forever to provision anything at Rackspace Cloud, and when things go wrong, you're basically screwed.

Because instead of proper planning, they'd rather not have to concern themselves with availability of your systems/data.

It's not a walk in the park, but if you can afford to invest in your own infrastructure and skills, you can achieve results that are better in every way.

Sure, you might not be able to load a dozen high-traffic Dom0's onto these SAN systems, but that matters mostly if you're trying to squeeze margins as a hosting provider. Their problems are not ours...

2 comments

chubot 5533 days ago

The point of the article is that you are taking an ancient interface and using it for something new. Millions of lines of code was written against that interface with old assumptions, and now you've moved it to a new implementation without changing any of it. Things are bound to go wrong.

When you move sqlite to NFS, for example, file locking probably won't work. There is nothing to tell you this.

It sounds like you have experience making NFS work well, but I don't see how anything you wrote addresses this point. In fact I think you're just echoing some of the article's points about "enterprise planning". AFAICT you come from the enterprise world and are advocating overprovisioning, which is fine, but not the same context.

link

ssmoot 5533 days ago

I work at a small shop who was badly burned by Sun/Oracle. :-)

It's not that I believe in overprovisioning I think. It's that if data is really that critical, and it's availability is critical, then that has to be taken into account during planning.

Everything fails at some point. The Enterprise Storage Vendors would have you believe their stuff doesn't. In practice it's pretty scary when the black box doesn't work as advertised anymore though _after_ you've made it the centerpiece of your operations.

So with those lessons learned, our replacement efforts took into account the level of availability we wanted to achieve.

I did go off on an NFS tanget. Sorry. But this article was about block-storage, which is a different beast from what you describe.

Seeing all networked storage lumped together is like seeing: FastCGI isn't 100% reliable, which is why I hate two-phase-commits.

link

edw 5533 days ago

I brought up NFS because it's an example of a service that implements an abstraction but does so in a way that undermines the assumptions of the implemented abstraction. I do not disagree that local disks are an unrealistic strategy for creating a scalable, fault-tolerant system. The disk abstraction is of limited utility when creating such systems, because "disk thinking" leads to giving in to seductive assumptions about the performance and reliability of the storage resources you have at your disposal.

link

mkramlich 5533 days ago

> I brought up NFS because it's an example of a service that implements an abstraction but does so in a way that undermines the assumptions of the implemented abstraction.

yep. NFS and the like make you more vulnerable to the Fallacies of Distributed Computing.

link