In my day-to-day life I need to access a number of different intranet web services. Some of these are on my local work intranet, but I also need to be able to get to things like temporary services hosted on Grid’5000. These websites are not available on the public Internet so the standard procedure is to connect to a gateway that sits at the edge of the intranet and route traffic through that gateway using either a VPN or SSH socks proxy. This solution isn’t ideal, as it requires setting up the connection and possibly reconfiguring my browser whenever I want to access a site on one of these intranets. I can’t access work and Grid’5000 systems at the same time, and, when traffic is being routed through Grid’5000, I can’t access sites on the public Internet (this was a big problem when trying to troubleshoot a web application with my teammates by screen sharing over Google Hangouts).
In July 2015, the HARNESS project hosted a Software Carpentry bootcamp/workshop at the SAP headquarters in Feltham, near London Heathrow. This was an outreach activity with three distinct, but related, objectives: 1) to disseminate HARNESS project outcomes to the research community, 2) to bring together researchers interested in topics relevant to HARNESS, and 3) provide skills and knowledge training as a public service to contribute toward the improvement computational science practice within the European Union. This activity was particularly targeted at the “HPC”, “cloud research”, and “heterogeneous compute research” communities discussed in the HARNESS deliverable documentation. In addition to the standard topics on task automation, version control, and programming in python, there were additional modules on cloud computing, FPGA data-flow engines, and distributed file systems. Representatives from both the Software Sustainability Institute and the European Grid Infrastructure attended to provide additional teaching support. The full list of topics covered was as follows:
- Automating Tasks with the Unix Shell (Alistair Grant, SSI)
- Version Control with Git (Alistair Grant, SSI)
- Building Programs with Python (Mark Stillwell, HARNESS, ICL)
- Managing Cloud Services with ConPaaS (Guillaume Pierre, HARNESS, UR1)
- Dataflow Programming with Maxeler (Peter Sanders, HARNESS, Maxeler)
- EGI Federated Cloud for Open Science (Diego Scardaci and Gergely Sipos, EGI)
- Distributed Filesystems with XtreemFS (Christoph Kleinweber, HARNESS, ZIB)
Lately I’ve been tinkering with some of my Ansible roles to improve the support for multiple-environment deployments. Previously, I’d been using a somewhat naive approach of simply including environment-specific task lists using conditionals in tasks/main.yml like so:
1 2 3 4 5
However there are a number of problems with this approach: 1) it requires keeping and maintaining separate task lists for each environment, 2) deployments to different environments are done sequentially rather than in parallel, and 3) it pollutes the output from ansible with lots of skipped tasks for unused environments.
In general, it is better to use environment-specific variables with the same task list for different environments, this reduces the maintenance overhead, allows for parallel deployment, and reduces the size of the output.
I’m very excited to announce that for the past month I have been working on the arrangements for a cloud-oriented software carpentry event, and we are now ready to start recruiting for participants! The event, which is going to be held at the SAP offices in Feltham (near London Heathrow) July 15-17, is being organized as a dissemination activity for the European FP7 HARNESS project. At this event we will extend the standard software carpentry curriculum of task automation, modular programming, and version control, with additional modules on cloud computing, including deployment, configuration, and management of virtual machines. End-user-ready cloud computing software projects supported by the project will be covered, particularly the ConPaaS runtime environment and the XtreemFS distributed filesystem. Participants will also learn about programming Dataflow engines from Maxeler and how to access the public EGI Federated Cloud infrastructure for research. There will be a keynote on the HARNESS project and how it is helping to bring the power of heterogeneous accelerators to the cloud.
Participation is free, but only 40 places are available, so sign up now!
Details, including a link to the Eventbrite signup page, are available at the event website: https://harnesscloud.github.io/2015-07-15-feltham/
The authors of a recent article featured in Times Higher Education, “Is ‘academic citizenship’ under strain?”, argue persuasively that changes in policy and the funding environment are pushing academics to focus more and more on the primary tasks of teaching and research, at the expense of a large class of less widely known activities that are nonetheless essential to the academic enterprise. The points they make will resonate with any academic who has tried to find the time to make significant real-world contributions in today’s metric-focused environment. However, despite its aspirations, the article is not truly comprehensive: One critical area of activity neglected in the piece is that of developing and maintaining the software and infrastructures that enable modern computing-based research.
I’ve been working with Ansible quite a lot for the past year, but until this weekend I really hadn’t had much of a chance to get acquainted with Docker. There’s a lot of talk on the Internet about how these two different pieces of technology “naturally” compliment each other, but clearly there’s also a lot of confusion as well, and there are still a few hard interface problems that need to be sorted out.
When I first hear that Ansible could be used to both provision and deploy Docker containers, I naturally assumed that this meant I’d be able to adapt my existing Ansible workflow, possibly with the addition of some new modules to use, and have access to the power of Docker, but this hasn’t proven to be the case at all.
In fact, the two functionalities, deployment and provisioning, are completely disconnected from each other. To build images and deploy containers there are two new modules: docker_image and docker, that work pretty much like you’d expect. It should be noted that the docker_image module can really only be used to build images using existing Dockerfiles, and in fact there is little that it can to do actually affect the internal workings of the image. To provision, that is configure, an image, an entirely different approach is required: the playbook or role needs to be copied into the container filesystem and then run as a deployment to localhost. The problem with this, of course, is that this makes it very difficult to coordinate configuration and state information. This probably has a lot to do with the approach to containerized deployment emphasized by Docker, wherein images are relatively static, and configuration is done at deploy time by running scripts in linked containers. This may even be the best way to do things, but it does make it difficult to adapt existing playbooks and roles to work with docker.
The central thesis of my talk was that a modern devops approach, based in version-controlled configuration management implemented using industry-standard deployment tools and making use of continuous integration and rolling updates is a must for creating robust distributed systems with lots of moving parts. Interactive configuration of cloud systems and virtual machines leads to unreproducible, poorly documented, error-prone “brittle” systems. It is the equivalent of collaborative authoring of documents by emailing word processor files, or software version control by having a backup/ directory containing numbered/dated zip archives. This leads to a “culture of fear” wherein people are afraid to try to fix known problems for fear of “making things worse”, and every roll out of of a new api or set of features becomes a dreaded nightmare.
Overall, the talk was well received. We also discussed the Vagrant workflow, and how, while Docker can be used to provision containers that function the same way as “lightweight” virtual machines, ideally each container should really only be running the minimum number of processes, following the single responsibility principle.
I had a great time yesterday at the Workshop for Research Software Engineers put on by the Sustainable Software Institute and hosted at the Oxford e-Research Centre. While there, I had an interesting conversation with Ian Gent, writer of “The Recomputation Manifesto”. You should head over to the site and read some of Ian’s articles, particularly this one, as he’s put quite a lot of thought into the topic. The gist of the idea is as follows (apologies Ian for any misunderstandings, you should really go to his site after this one): 1) computational experiments are only valuable if they can be verified and validated, 2) in theory, it should be fairly easy to make computational science experiments, particularly small-scale computer science experiments, perfectly repeatable for all time, 3) in practice this is never/rarely done and reproduction/verification is really hard, 4) the best or possibly only way to accomplish this goal is to make sure that the entire environment is reproducible by packaging it in a virtual machine.
I have a few thoughts of my own on how we can better accomplish these goals after the break. The implementation of these ideas should be fairly simple and basically add up to extending or developing a few system utilities and systematically archiving distributions and updates on a service like figshare, but I believe that the benefits to the cause of improving the reproducibility of computational experiments would be enormous.
Most of us are familiar with the classical narrative of ongoing scientific progress: that each new discovery builds upon previous ones, creating an ever upward-rising edifice of human knowledge shaped something like an inverted pyramid. There’s also an idea that, in the semi-distant past of a few hundred years ago one person could know all the scientific knowledge of his (or her) time, but today the vast and ever-expanding amount of information available means that scientists must be much more specialized, and that as time passes they will become ever more specialized.
There is some truth to these ideas, but there are problems as well: when new knowledge is created, how is it added to the edifice? How do we make sure that future scholars will know about it and properly reference it in their own works? If a scientist must be incredibly specialized to advance knowledge, then what does he (or she) do when just starting out? How does one choose a field of research? And what happens when the funding for research into that area dries up? Contrary to what we learned in grade school, a scientist cannot choose to simply study, say, some obscure species of Peruvian moth and spend the next 40 years of summers in South America learning everything there is to know about it without also spending some time justifying that decision to colleagues and funding bodies.
A lot of my research is based around developing and testing heuristic algorithms for various computation problems, in particular, I try to develop and adapt vector-packing based heuristics for scheduling virtual machines in distributed heterogeneous environments.
Toward this end, I’ve spent a fair amount of time investigating the large data trace released by google in 2011. The trace is available through Google Cloud Storage, and can be downloaded using the gsutil utility, which can be installed in a python virtual environment via pypi/pip. I want to be able to use this trace to generate synthetic problem instances to use to test my heuristics.
If you spot any mistakes or errors in my code then please leave a comment or email me.