by Steven J. Owens (unless otherwise attributed)
Big data, machine learning and cloud computing are three related, to some degree even overlapping, things. This is oversimplifying (isn't everything?) but you can think of them this way:
a) Big data,
b) is enabled by machine learning,
c) which is enabled by cloud computing.
"The cloud" didn't come from machine learning or big data, but it's certainly playing a big role in making big data and machine learning fly. You could say that big data/machine learning are cloud computing's flagship app or killer app.
You could also say that cloud computing was a necessary precondition for the increasing use of big data and machine learning.
Cloud computing is essentially good ol' fashioned server farms plus lots of automation to address most of the problems of running and using a server farm:
It's that last one that's really the rub. Getting into detail beyond this quickly gets complicated.
There are two big takeaways to the cloud. First that it enables you to tackle really big problems. Second that it enables a pay-for-what-you-use model.
"Big problems" in the sense of the amount of data and and in the amount of processing power.
There is no black magic or fundamental quantum leap in theory of parallel processing or distributed computing. People have been trying to find solutions to fundamental hard problems in this space (distributed computing, parallel processing, etc) for decades. Cloud computing hasn't made any quantum leap forward, it simply doesn't care about the wasted resources.
Computers have become cheap enough and fast enough, and network connections between them fast enough, that the overhead in breaking up the problems across hundreds of computers has become acceptable.
"Pay for what you use" has become a thing because the same approach used to manage tackling big problems is also available to the cloud vendors to manage provisioning separate customers. Thus cloud customers, from small startups to large corporations, can rent as much cloud resources as they need from cloud vendors, and if they need more, can turn up the knob, while only paying for what they use.
Also, because of pay-for-what-you-use, you can scale up relatively cheaply and easily, as long as you design your app for the cloud from the start. This is important in general for "web scale" apps, and also because you can develop a big data app on a smaller data set, more cheaply, and then easily scale up.
"Scaling" is shortened from "scaling up" or "scaling down". In software, most of the time it's about "scaling up", and most of the time it's about a particular sense of scaling up: a tool is scalable if it can continue to produce results in the same amount time even though number of requests or amount of work increases.
Cloud computing is about automating a lot of the headaches of using horizontal scaling for computer processing.
The metaphor I usually use to explain scaling is vehicles for a delivery service. There are two kinds of scaling, vertical and horizontal.
Vertical scaling is getting a bigger truck.
Horizontal scaling is getting more trucks.
With trucks, there are some advantages to vertical scaling and some disadvantages, and the same with horizontal scaling.
Larger trucks are actually more efficient at bulk transport (per pound or cubic foot), and you can handle either more packages or larger packages. You only need to hire one driver. Maintaining a single vehicle is cheaper than maintaining multiple vehicles.
On the other hand, as the trucks get bigger, they start to get a lot more expensive, and they start to need specialized support: commercial drivers license, diesel fuel instead of regular, specialized mechanics, more expensive parts. If it breaks, you're out of action. At some point you've got the biggest truck there is, that's all there is, there ain't no more. And believe me, by that point you're paying a lot!
https://www.google.com/search?q=Komatsu+960E-1&tbm=ischWait, did I say a lot?
https://www.google.com/search?q=nasa+crawler+transporter&tbm=ischHorizontal scaling has other advantages. More flexibility; you can send trucks on two different jobs at once. If one truck breaks the other trucks still work. You can buy a lot of smaller trucks for the same amount as one of the bigger trucks. Parts and maintenance are easier and cheaper; and if you have to, it's a lot easier to just replace a smaller truck.
The big disadvantage is the maximum package size, and even that you can often work around, if you think hard about it, breaking packages down and sending parts on different trucks.
With computers, horizontal scaling tends to win, for a number of reasons. A big one is the speed at which computing capacity improves, and at which faster computers drop down into the commodity pricing range.
Virtualization plays a big role in how most clouds work. You can skip this part if you aren't interested, it's not critical. But you'll see these buzzwords come up a lot in discussions of the cloud, so it helps to have a general sense of them.
Virtualization is another hugely complex and sprawling topic, but here's a quick and simplified description.
Virtualization means hypervisors, for example VMWare, Xen, VirtualBox, KVM, Parallels, and increasingly, these days, containers, the most popular example example being Docker.
You can think of an operating system as a layer of software that goes between the hardware and the applications. The job of the OS is two things:
a) hide any nuances and differences in the hardware
b) juggle/manage the use of hardware resources
Virtualization is the ability to insert an extra layer, between the OS and the hardare. That extra layer allows us to manipulate and manage the OS's use of resources in much the same way that the OS manages the application's use of resources.
Some of the big deal things that virtualization buys us are:
a) We can run multiple OS instances (which can also be different
flavors of OS) simultaneously and manage the resources they use.
b) We can save a snapshot of a running OS to disk, transfer it to a
whole new machine and start it back up.
c) We can set everything up and save a snapshot, then just run a copy
of that snapshot on each machine that we need.
d) cloud vendors can run multiple virtual instances on the same
machine, for different customers.
All of these make running a server farm a heck lot easier.
Virtualization has gotten a whole ton faster in recent years. I remember around 2000 it was a rule of thumb that running your OS on virtualization used up about half your resources, i.e. your virtual machine using all of your hardware resources was about half as powerful as your actual machine.
Today that's more like 10%, and in the meantime machines have gotten a whole lot faster, so now it's worth burning that 10% overhead for the server management advantages that virtualization gives you.
Part of this is the emergence of "bare metal" hypervisors, where the hypervisor is effectively its own (very stripped down) operating system. Instead of the picture looking like this:
a) hardware, which runs the
b) OS, which runs the
c) hypervisor program, which runs the
d) virtualized OS, which runs
e) your application
It looks like this:
a) hardware, which runs the
b) hypervisor program, which runs the
c) virtualized OS, which runs
d) your application
Another part of this is the emergence of hardware support in PC CPUs, starting with Intel's VM-x in 2005, which is currently available in most Intel chips. This pushes the dividing line between the programs deeper, into the hardware layer, which means it can be done more efficiently and more reliably.
A new twist on virtualization is "containers". Like virtualization in general, containers are really the second coming an old twist, in shiny new clothes.
Containers run even more efficiently than normal virtualization, so you get more bang for your hardware buck. Containers get this bang by trading off flexibility. Multiple containers share underlying OS resources, so all of your containers have to run the same flavor OS. This greatly reduces the memory size of an app running in a container, versus an app running in a virtualized OS. There are, however some potential security risks to that sharing, so containers aren't simply replacing all use of virtualization.
(And all of the above is, of course, an oversimplification. )
Different cloud vendors have different approaches to what the cloud is and how you work with it.
Amazon's initial cloud offering, for example, was EC2, (Elastic Compute Cloud). EC2 is basically the ability to click a button and spin up virtual server instances. It's your job to set up the OS, the app server and the applications on the EC2 instances, and your worry to figure out how the apps share work and so forth.
That's not really very "cloud". It may be a cloud from Amazon's point of view, because they're using a cloud to provide you with those EC2 instances. But from your point of view, in using EC2 instances, you still have most of the same setup and management hassles you would have with actual machines.
The big thing you got with EC2 is the ability to just order more, without having to deal with capital outlay, budgeting in advance, upgrading machinery, etc. That's very helpful, but at the time not technologically revolutionary. However, Amazon has several other services that can take over different chunks of the problem, and they've added more of these as time has gone by. That's where it really gets cloudy.
The idea is that you start from scratch and decompose your application needs into features provided Amazon's different services. These services aren't tied to individual machine instances, they're just services, to scale up you simply use more of them. You sidestep a bunch of the headaches of figuring out how to break up the work, rewriting your software to do it that way, managing starting and configuring new instances, etc.
As an example, in a StackExchange discussion about the different Amazon storage services a user suggested the following pattern for an image storing service. First I'll give you the original answer, then I'll repeat it with explanatory annotations.
http://stackoverflow.com/questions/2288402/should-i-persist-images-on-ebs-or-s3"I have architected solutions on AWS for Stock photography sites which
stores millions of images spanning TB's of data, I would like to share
some of the best practice in AWS for your requirement:
P1) Store the Original Image file in S3 Standard option
P2) Store the reproducible images like thumbs etc [thumbnail images]
in the S3 Reduced Redundancy option (RRS) to save costs
P3) Meta data about images including the S3 URL can be stored in
Amazon RDS or Amazon DynamoDB depending upon the query
complexity. Query the entries from Amazon RDS. If your query is
complex it is also common practice to Store the meta data in Amazon
CloudSearch or Apache Solr.
P4) Deliver your thumbs to users with low latency using Amazon CloudFront.
P5) Queue your image conversion either thru SQS or RabbitMQ on Amazon EC2"
Here's the expanded version, with explanatory comments:
"P1) Store the Original Image file in S3 Standard option"
Amazon S3 (Simple Storage Solution) is a file-oriented storage solution. You hand it a file, it stores it. You ask for the file, it gives you the whole file. You can't request just part of the file, you can't edit the file, you have to upload a new version of the file to replace the old version. It's reliable and cost-effective, and it can serve the file moderately quickly.
The original question was about Amazon EBS (Elastic Block Storage) versus S3. EC2 instances go away completely when they're restarted or moved. They don't have any built-in system for storing and copying that data to the new EC2 instance. EBS is the solution for that. EBS provides a file system you can mount on an EC2 server instance. The EC2 can use that EBS file system like a regular hard drive, and then the EC2 instance goes down or is restarted, the data sticks around on the EBS.
There's also Amazon Glacier, cheaper and much slower, for long-term storage of bulk data.
"P2) Store the reproducible images like thumbs etc [thumbnail images] in the S3 Reduced Redundancy option (RRS) to save costs"
S3 RRS is cheaper at the cost of being less reliable; it's better for easily replaced bulk data, for example the thumbnails, which are generated from the original images.
"P3) Meta data about images including the S3 URL can be stored in Amazon RDS or Amazon DynamoDB depending upon the query complexity. Query the entries from Amazon RDS."
Amazon RDS is the Relational Database Service. It's a cloud-based service to provide you several different flavors of relational database (MySQL, Oracle, Microsoft SQLServer, etc). You could just run your own installation of MySQL or whatever, on an EC2 instance. But then you'd have to deal with all of the configuration and management. Even with this offering, RDBMS don't scale too well at the high end of scaling (though they do go a fair distance).
Amazon Dynamo is a NoSQL database, in essence a distributed key/value store. These are generally very fast and scale to very large sizes/requests, but don't usually support sophisticated query languages. NoSQL is a whole other, somewhat complicated topic. This is one of the better/simpler introductions to it, that I've seen:
http://www.lampefamily.us/divconq_blog/2010/migrate-a-relational-database-structure-into-a-nosql-cassandra-structure-part-i/"If your query is complex it is also common practice to Store the meta data in Amazon CloudSearch or Apache Solr."
Amazon Cloudsearch is a search engine provided as a cloud service. Search engines are far more effective for unstructured data and freeform searching, databases are better for structured data and predefined queries.
Apache Solr is a featureful, open-source search engine application built around the Apache Lucene search engine. I think he's suggesting you could run your own Solr installation on an Amazon EC2 server instance, as Amazon didn't have a SOLR cloud service at the time. However, in 2014 Amazon announced that their Cloudsearch service is now based on Solr. Note that, as with RDS, Cloudsearch does not offer complete API access to the underlying Solr APIs.
"P4) Deliver your thumbs to users with low latency using Amazon CloudFront."
Amazon Cloudfront is a CDN (Content Delivery Network) service. A CDN speeds up web delivery of static data files by keeping copies of them in servers all around the web, so your browser is getting the file from a server that's closer to you and therefore faster, maybe even inside the data center of the same ISP you use to access the public internet.
"P5) Queue your image conversion either thru SQS or RabbitMQ on Amazon EC2"
Amazon SQS (Simple Queue Service) is a message queue service. You can think of a message queue as "email built to milspec". Message Queues are used for communicating between different parts of your enterprise application. Your app hands a message to the message queue service and forgets about it. The message queue service, like the MTA (Mail Transport Agent) that your email client talks to, takes care of gettnig the message distributed to various destinations, reliably, quickly and scalably.
In this example, the messages are image conversion jobs, to be distributed to multiple EC2 server instances that do the work. The job wouldn't include the actual uploaded "Original Image" itself, but rather the location of the image in
There's a recurring theme if you read up on the various AWS cloud services, which is that you trade control and detailed access to the infrastructure in return for that scalability and ease of use. This is a general pattern from all over software, particularly in regard to ease of use. Another example where this is really obvious is what's called PaaS or "Platform as a Service".
By the way, there are similar acronyms for the other types, but they're more or less "backcronyms". Low level stuff, like EC2 by itself plus a few other options would be "Infrastructure as a Service". The services provided in the Stock Photography site example are "Software as a Service".
Note that SaaS had an earlier incarnation as "SAS", back in the early 2000s when people were first getting used to the idea of software products that lived on websites, like Gmail, or SalesForce. There still seems to be some confusion or disagreement between SaaS meaning, for example, Amazon RDS, or more consumer-level services. Technically, both are SaaS in the sense that they both provide an application in a service-like way.
Also note that these are all marketing terms and subject to a lot of typically "fluid" definitions we see so often in marketing.
There are a whole bunch of different "application server" technologies, like J2EE. Every single one of them involves the tradeoff that you design and build your app in a certain way, and the application server takes care of some of the plumbing and management for you, including some degree of scaling. PaaS is taking that same approach into the cloud world.
Again, the definition is a little fluid here. I take a somewhat strict view, which is more like Google App Engine. You're highly constrained in how you design the pieces of the app, but it scales automatically. I.e. by forcing you to break down the problem in a certain way, those chunks can be automatically, transparently farmed out to different machines, and you don't have to work on the plumbing or coordinating to make it happen.
This is, of course, a matter of design tradeoffs, and there's lots of debate about different ways to do it, and at what level. The longer I stay in this field, the more I begin to think everything about software boils down to that. Even in this essay, I see echoes of the same idea in the section on Virtualization, for example.