All Posts

What is a Research Software Engineer?

What is a Research Software Engineer (RSE)? Too many things to define, you can find these definitions elsewhere. Maybe the question you would like to know is How do I get value from Research Engineers like Aalto Scientific Computing does? - that’s what we’ll try to answer here. Through that, we may learn a functional definition.

This page is written from the perspective of computational science - similar messages may apply to other fields. Note that computing and AI is in every field now.

When someone wants Research Engineers, it’s probably because they see something that is missing in the current academic system. Thus, to understand what we want, we need to understand the system. Below is rkdarst’s current mental model:

Academics are who we usually consider researchers. They do research, and are promoted based on articles published and citations received from other academics. Citations from academics tend to focus on innovation and novelness, so that’s what decides career paths.

Research Engineers (REs) focus on the practice and “structural integrity” of the research: the tools, the reproducibility, and more. They are more concerned with the work being done well, than pure novelness and citations. [0]

Research Software Engineers (RSEs) are a subset of Research Engineers, and I feel that the “software” is the least significant part there. Software is important, but so is data, computing, reproducibility, etc.

Particular examples of things that Research Engineers are good at include: Reproduciblity, maintaining software and data across academic generations, Open Science, programming, using large computer clusters, data security, and research ethics processes.

Researchers, in my mind, cover both of the above (and more). Industrial research teams would have both of the above and possibly even more different roles all working together on their problems. In universities, we tend to only consider the academics to researchers.

So what’s a Research Engineer? To me, it’s defined mostly in terms of what is missing from the typical academic career path (of undergraduate → junior researcher → senior researcher). At all levels, I’ve seen research engineering under-valued and under-taught (not necessarily because it’s not wanted, but because it’s not novel science and there’s no time). Senior researchers (group leaders) often see the value, but don’t have the ability to train and supervise research engineers well.

Years before Aalto RSE started (~2017-2018), I saw a need for more basic skills (for example: version control to manage code) and worked to promote them in undergraduate programs. This basically didn’t work, because they were seen as not scientific thus not something to be taught in academic courses (and if they were thought, the courses would be full of people looking for easy ECTS). While there certainly are study programs in software and software engineering, these are their own thing, and not part of data science, or other fields that need computation. Software engineering programs also aren’t adapted for the unpredictibility of research.

This was the prompting to start Aalto Research Software Engineers - if we can’t teach people skills in study programs, we have to support them when they become researchers (and teach it via practical mentoring). This has worked out very well, as you can see by our rapid expansion and heavy usage.

Aalto RSE is essentially the collaborator our research groups need to do their top-level work. This system works very well, but are there other options?

The above leads to various ideas. Take your pick for what angle you want to approach the problem:

Better RE teaching in undergraduate programs:

As part of existing programs (is there time to teach this? Is there desire? On the other hand, RE skills are great for employment prospects)

As dedicated majors? (Some people are trying to make dedicated RSE study programs at different universities, and there is a value there. But if you ask me the best value is learning RE along with academic research in a different field)

Better RE teaching in graduate programs:

Many of the same things as above apply here, mainly the lack of time, and the necessity to spend time on novel research, not learning existing best practices.

Nurture REs within existing research groups:

Nothing stops group leaders from hiring students and postdocs who have chosen to focus on research engineering. This often happens when supervisors hire technical postdocs to manage the RE side of things. (The question is: can they be supervised and mentored well by academic supervisors, if they need to be home-grown?)

If group leaders hire good candidates, Aalto RSE can help mentor them. See the companion blog post RSE work rotations for one idea.

Recruit REs as professor-level group leaders” similar to how senior academics are recruited:

These people would focus on collaborating with others to make projects possible.

The university systems don’t seem set up to value these people, thus they don’t appear among the ranks. They could appear if they spent their careers chasing academic citations, but then would they be able to spend enough time on research engineering?

I think this is what some people mean when they say they want a RSE career path: a way to recruit senior academics who lead research engineering groups. I think the idea is good but it’s not how universities are set up, so it’s a long way off. The values systems may not even match up.

Create parallel structures that support research engineering

That is what Aalto RSE has done. We are researchers, but we make new research possible by collaborating with academics, instead of trying to publish by ourselves. We are part of the services of the School of Science.

We also take it upon ourselves to do teaching and mentoring via co-working for all types of researchers (aspiring academic or research engineer). We can fill in the technical mentoring that’s missing by many supervisors.

I’ve seen many people interested in gaining research engineering competence for their organization. You need to develop an environment where they fill in the gaps you need.

Junior academics: encourage them to explore their technical skills. Show that there is value in this, even if it reduces the number of publications. Encourage them to get training (for example the Aalto RSE training). Give them time, encouragement, and career prospects to reach beyond the focus on papers.

Other support staff at universities and other organizations: don’t view them as limited-purpose supporters of an {infrastructure, service, process}. View them as supporters of research: let them holistically support research projects from many angles at once, rather than only in narrow silos with strict project reporting requirements.

You can hire dedicated staff to be REs, but it’s important that they are integrated into the local research environment. Most of our hires have been local staff who have grown into a new role, and I think this is how it should be.

Any of the above, especially the first two, require time being made available for RE work and a clear vision and network. Aalto RSE (with the help of others in Finland) is planning on making a networking and onboarding program for new research engineers who wish to adopt this vision.

If you read this far, you probably see the value in research engineers and want them yourself. Just hiring someone, or changing someone’s job to “RSE”, won’t magically solve the problem you need. It’s a whole mindset shift towards a multi-disciplinary research team.

What’s the right level of research engineers, permanent and experienced or junior and learning? Probably a bit of both.

I know that “Research Engineer” is a job title that can have other definitions.

Read more ...


How busy is the cluster? A discussion

We occasionally get some questions: how busy is the cluster? How long do I have to wait? Is there some dashboard that can tell me?

The answer is, unfortunately, not so easy. Our cluster uses dynamic scheduling with a fairshare algorithm. All users have a fairshare priority, which decreases the more you have recently run. Jobs are ranked by priority (including fairshare plus other factors), and scheduled in that order. If there are unschedulable holes between those jobs, it can take a job with a lower priority and fill them in (“backfilling”). So that gives us:

A small-enough job with a low priority might still be scheduled soon.

A higher priority user could submit something while you are waiting, and increase your wait time.

An existing job could end early, making other wait times shorter.

An existing job could end early, allowing some other higher priority jobs to run sooner, making backfilled jobs run later.

In short: there is no way to give an estimate of the wait time, in the way people want. We’ve tried but haven’t find a way to answer the question well.

What can we know?

You can compare your fairshare factor with other users. If you run sshare you can see the fairshare (higher means higher priority). sprio shows relatively priority for all jobs (here, the raw values are multiplied by some factor and added). On Triton (the new install since 2024 may), they mean the following:

at 7 days) (zero when first submitted, increasing to 10000 at 7 days old)

The fairshare factor is “1e7 × FairShare priority from sshare”

The FairShare value is computed based on the raw usage value: at each level of the share tree, it divides it up among the users so that those who have run less have a higher priority.

The usage value decays with a two-week half life.

The others are mostly constant.

Still: this is all very abstract and what others submit has more effect than your priority. The only thing you can control is using less resources.

This is quite cluster dependent so we’d recommend asking for help for how your own cluster is setup.

This may be your real question. There are two main things:

Use less resources. Make sure you don’t over-request more than you need (CPU, memory, GPUs) - this will affect your future fairshare less. Of course, use everything you need, “saving for later” doesn’t give you more resources than you save now.

Request less resources per job. This will let you be backfilled into the scheduling holes (see below).

In this case, if there is a slot for you, you are scheduled very soon. srun --test [RESOURCE_REQUESTS] might give you some hint about when a job would be scheduled - it basically tries to schedule an empty job and reports the currently estimated start time. (It uses a JobID though so don’t run it in a loop)

In this case, nothing can be said since the queue is always being re-shuffled. In the long-run, you get a fair share of resources. If you haven’t used much lately, you have more now. Your wait time depends more on what other users submit (and their priorities) than what you submit - and this is always changing. You can tell something about how soon you’d be scheduled by looking at your priority relative to other users. Make your jobs as small and efficient as possible to fit in between the holes of other jobs and get scheduled as soon as possible. If you can break one big job into smaller pieces (less time, less CPU, less memory) that depend on each other, then you can better fit in between all of the big jobs. See the Tetris metaphor here in TTT4HPC

If your need is “run stuff quickly for testing”, make sure the jobs are as short as possible. Hopefully, your cluster staff about development or debugging partitions that may be of use, because that’s the solution for quick tests.

This description was in an old version of our docs but has since been removed. The exact values are out of date. It’s included here for detailed reference anyway:

Triton queues are not first-in first-out, but “fairshare”. This means that every person has a priority. The more you run the lower your user priority. As time passes, your user priority increases again. The longer a job waits in the queue, the higher its job priority goes. So, in the long run (if everyone is submitting an never-ending stream of jobs), everyone will get exactly their share.

Once there are priorities, then: jobs are scheduled in order of priority, then any gaps are backfilled with any smaller jobs that can fit in. So small jobs usually get scheduled fast regardless.

Warning: from this point on, we get more and more technical, if you really want to know the details. Summary at the end.

What’s a share? Currently shares are based on department and their respective funding of Triton (sshare). It used to be that departments had a share, and then each member had a share of that department. But for complex reasons we have changed it so that it’s flat: so that each person has a share, and the shares of everyone in a department corresponds to that department’s share. When you are below your share (relative to everyone else), you have higher priority, and vice versa.

Your priority goes down via the “job billing”: roughly time×power. CPUs are billed at 1/s (but older, less powerful CPUs cost less!). Memory costs .2/GB/s. But: you only get billed for the max of memory or CPU. So if you use one CPU and all the memory (so that no one else can run on it), you get billed for all memory but no CPU. Same for all CPUs and little memory. This encourages balanced use. (this also applies to GPUs).

GPUs also have a billing weight, currently tens of times higher than a CPU billing weight for the newest GPUs. (In general all of these can change, for the latest info see search BillingWeights in /etc/slurm/slurm.conf).

If you submit a long job but it ends early, you are only billed for the actual time you use (but the longer job might take longer to start at the beginning). Memory is always billed for the full reservation even if you use less, since it isn’t shared.

The “user priority” is actually just a record how much you have consumed lately (the billing numbers above). This number goes down with a half-life decay of 2 weeks. Your personal priority your share compared to that, so we get the effect described above: the more you (or your department) runs lately, the lower your priority.

If you want your stuff to run faster, the best way is to more accurately specify your time (may make that job can find a place sooner) and memory (avoids needlessly wasting your priority).

While your job is pending in the queue SLURM checks those metrics regularly and recalculates job priority constantly. If you are interested in details, take a look at multifactor priority plugin page (general info) and depth-oblivious fair-share factor for what we use specifically (warning: very in depth page). On Triton, you can always see the latest billing weights in /etc/slurm/slurm.conf

Numerically, job priorities range from 0 to 2^32-1. Higher is sooner to run, but really the number doesn’t mean much itself.

These commands can show you information about your user and job priorities:

slurm s

list of jobs per user with their current priorities

slurm full

as above but almost all of the job parameters are listed

slurm shares

displays usage (RawUsage) and current FairShare weights (FairShare, higher is better) values for all users

sshare

Raw data of the above

sprio

Raw priority of queued jobs

slurm j <jobid>

shows <jobid> detailed info including priority, requested nodes etc.

tl;dr: Just select the resources you think you need, and Slurm tries to balance things out so everyone gets their share. The best way to maintain high priority is to use resources efficiently so you don’t need to over-request.

Read more ...


RSE work rotations

Let’s say you want to start a Research (Software) Engineer team in your own unit. How do you set your new hires off on the right path? A proposal is outlined below.

This is a companion post to Future RSE collaboration in Finland.

You need to find the right person to hire for the role. Most likely, this means someone with the skills you need but the mindset to transition from their own work to making other work possible. You can find hiring resources on the Aalto RSE page and some brief thoughts in the companion post Future RSE collaboration in Finland.

Let’s say you have hired someone. What’s next?

This proposal is much easier for someone inside of Aalto University than outside, but possibly could be negotiated for others.

Your new hire works as part of the existing School of Science RSE team initially, perhaps ~1 year.

The hire is paid, organizationally supervised in, and sits in your own unit. It is absolutely critical that they maintain close connections to your own unit, the membership in our team is only virtual. (Our team is remote-first, so this is easy).

They focus on projects from your own unit, but as part of our daily flow. This could mean asking your audience to join our Scicomp garage for help and requesting that new big projects come via our project management systems.

Your new hire will learn all about how we work.

Your new hire will experience a tremendous diversity of projects and work with experts on them.

After the initial ~1 year period, we sit down and decide what is next. Does your new hire stay working as part of our team (with a greater focus on your own unit’s projects)? Or do they split off and start doing their own thing in your unit? Or some combination?

This gives you the most important part of our onboarding and training. There is no better way to develop the right mindset. If we split later, your staff will know who to ask for harder problems that come up later.

If this sounds interesting to you, contact the author of this article (first.last@aalto.fi or various chat systems).

Read more ...


Future RSE collaboration in Finland

The Aalto University School of Science has a successful Research Software Engineering service serving the whole university. This service has proven its value and there are an increasing number of questions of how others can form their own teams in Finland and work together. This post gives some thoughts on the matter.

This page is the opinion of the author and not Aalto itself. It’s not an open offer for collaboration. The author is happy to help with any questions you may have (first.last@aalto.fi or various chat systems).

Universities have academics: the traditional core, making ideas and new results. Much research, even not “computer science”, needs computational tools. However, the skills needed even for basic computation can be so complex that not all academics can master it to do cutting-edge research. A Research (Software) Engineer (RSE) can bridge that gap: academics focus on their primary work, and the RSE makes the computing seamless.

For more info, see the Aalto RSE site. This is not that different from research engineers supporting complex physical equipment.

We’ve found there are plenty of qualified people to hire. The harder part is mentoring them to transition from a researcher (focused on single projects with emphasis on own publications) to supporter (supporting a wide variety of people with respect and compassion). This transition needs active mentoring.

See the companion post about work rotations for RSE mentoring - if you are in Aalto University then start there.

You should decide if you want (a) wide-ranging support which may include helping with basics or (b) specialist support for a limited audience. I would argue that our most important impact is (a): this has gotten us the most benefit overall, and a steady stream of more advanced projects as work advances.

Let’s say you want your own RSE team at your own organization. How can you and Aalto RSE work together?

Even without joint funding, some of us Aalto people would be happy to talk and give some advice, and be part of a broader general network. For example:

How our team works, what makes it work, advice for your team

Joint RSE seminars to build skills, for example as part of FCCI Tech (fka Behind Triton), Nordic-RSE seminar series, or something new. Both of these are good for professional development and community.

All the advice and practices on the Aalto RSE site.

Professional networking and so on.

However, without funding, Aalto needs to focus on its own work.

With joint funding, it might be possible to make a collaboration.

Any higher level collaboration needs to be discussed with management. Assuming these discussions go well, we might join a collaboration together so that we can actually share projects between the team. There should always be a strong local presence, because that gets the best value. This opens more possibilities.

The more experienced or larger teams could provide:

Mentoring possibilities for new research engineers and their teams (see RSE work rotations).

A base for professional networking.

A larger base of knowledge, for more advice and help with specialist problems. A very important part of our team is that for almost any problem, someone has seen it and can solve it quickly. Then we train others to solve it.

Joint support sessions such as our Scicomp garage, which allowed a wider support base for problems, covering the previous point.

The newer or smaller teams could provide:

Funding via some joint project.

More staff around to help fill in the gaps when needed (these staff also get training in these projects they experience).

Specialty domain knowledge (both for support of academics and for professional development).

A collaboration with larger funding could have a joint project flow: there is one place to submit new projects requests, and the right people in any organization will work on them.

We would welcome observers in our support sessions, especially from other staff at Aalto. The Nordic-RSE chat is also a good way to ask questions and see what we are up to for those outside Aalto University.

We know of various opportunities being considered for national (Finland) or international RSE collaborations. The above are some basic thoughts, but any model would be tailored to the actual funding and partners. There is definitely a benefit to starting off together.

For more information, contact the author at first.last@aalto.fi and read Research Software Engineers for more info.

Read more ...


Triton v3 is now default

Triton has a major update. You can read our previous info about this at Preparing for new Triton, and our “what has changed” in Triton issue #1593.

You might get SSH host key warnings.

It has the same name, and importantly the same user accounts and data, but all the software and operating system is changed. In particular:

All software modules are different

Any software which has been complied will need to be re-compiled.

Triton’s previous operating system was released in 2014. Security support runs out at the end of 2024 May, and it has to be updated. Stability is good for research, so we try to reduce the number of changes (compare)

We realize that a change is very disruptive and painful, especially since the expectation is that Triton never changes. But an old operating system makes problem for users too, and they have gotten more and more over the years.

Most of the transition for different types of software is described in Triton issue #1593.

Read more ...


Triton v3 SSH host key warnings

When updating Triton, many users will get a message like this (or similar things if you use other SSH clients like PuTTY):

SSH (Secure SHell) is made to be secure, and that means one it verifies the server you are connecting to via its ssh host key. The representation of this key is the fingerprint, like SHA256:OqCehC2lbHdl8mYGI/G9vlxTwew3H3KrvxKDkwIQy9Y. This means that the NSA or someone can’t intercept the connecting and get your password by pretending to be Triton. This is a good thing.

OpenSSH (the command line program on Linux, MacOS, Windows) saves these connection IDs (fingerprints) in $HOME/.ssh/known_hosts. Other programs may store the keys somewhere else.

The warning looks scary but the first thing to ask is “should the server I am connecting to have changed?”. If you have been directed to this blog post, then probably yes, it has. You should always think if the fingerprint should change, and if there is no reason for them to have changed, contact your administrators. You can usually verify the keys online, for example Triton ssh key fingerprints.

If you are on command line OpenSSH (Linux), it will propose a command that will remove the old host key:

For other programs, follow whatever prompts it might give to replace the host key fingerprint.

When you get a “The authenticity of host ‘triton.aalto.fi’ can’t be established”, verify the SSH key fingerprints that are presented, then click “yes” to permanently save them (until they change next, they can always be updated). The fingerprints for Triton v3 are:

Read more ...


Research Software Engineer project funding: what’s been working

The “Research Software Engineer” service provides technical collaborators for researchers to complement their scientific knowledge. Read about Aalto RSE here. The idea of this service was that it would be available to everyone, but some projects who made extensive use would fund it themselves.

If you are a group leader reading this, Aalto RSE can help you release research code, debug it, make it reusable, rescue old code from former members and make it usable again, make it run on our cluster or CSC’s clusters, manage data, prepare data for easy use, and so on. If it’s not long (less than a month), our work is free, if it’s more than a month, the below applies.

When we started, we hoped for around 50% project funding. The idea was that a lot of the funding for the service would come from the research projects themselves. This hasn’t really worked out so well, because a) we accomplish the vast majority of our projects quickly, in less than a few weeks, and b) finance would understandably not like to deal with small transactions for small amounts of time.

What actually happened was that we basically have received only a small amount of the project funding we would have wanted. On the other hand, this also means we have supported a far wider variety of projects than we would have otherwise. It also means we are better accomplishing our other goal: tactical support right where and when it’s needed most, with the least amount of administrative overhead. This actually better matches our mission of helping the researchers who need us most.

For any long projects (more than a month or so), we still follow do our original plan: we can receive funding from grants (or basic funding) to do long-term projects. This is usually 40-80% of a RSE’s time, spread out over more than a month (and it can also be bursty: lots of work at some times, waiting for the next task at other times). We have done this for projects, and we know we can do it in the future.

But there’s another thing that has worked well: retainer-type funding instead of project-based funding. You have extra funding that needs to be used? You know your group needs support but you can’t name a single specific project to use all the time? Hire RSEs on long-term retainers and we’re there for you as needed. You will always get priority for all the quick questions you have (in Scicomp garage or otherwise), and you get the highest priority for your medium projects, we can attend your other group meetings, and so on. As your team wants, we’ll make high-impact improvements here and there. This could be (for example) anywhere from 10-40% time over a long period.

We have worked out how to do both one-off projects and retainers with Finance. As for as external funders are concerned, our staff count as researchers, so we can use any funding you might have.

If you think you have a project or want us on retainer, let us know: Research Software Engineers or Starting a project with us.

This is a valid questions. Compared to many RSE groups, we seem to be focusing on many small questions for a broad audience that knows a lot about the problems they need to solve. Thus, we can come in to something already set up well, provide help, and mostly back off and be available for maintenance long-term. The units that fund us (schools, departments) have been happy with this, so we’ve kept it up. On the other hand, we are pretty fast. There have been projects where a summer worker was going to be hired, that we could end up doing (learning the framework + all the main tasks) in two weeks. The way we work together as a team also makes things quite fast. Thus, a project has to be quite deep in order to exceed a month of work.

Read more ...


Kickstart 2023 wrap-up and thoughts for the future

Our kickstart course came and went with very few problems. This post summarizes our general thoughts on the course and its format.

If you want to join the course next year (as an attendee, or as an organization who will send your learners to us (and maybe co-teach) follow us on Mastodon. This is the third year we’ve done the livestream format, and it’s not likely to stop anytime soon.

This was originally written in June 2023 but publication was forgotten until 2024.

The course has run since around 2015 or so. Until mid 2020, it was always in-person only. Until (and including) 2022, it ran twice a year, January and June, but now it runs only in June (increased availability of videos + the material compensates). It runs in June so that it aligns with new summer research interns starting. Until around 2020, it was mostly about using the HPC cluster at Aalto University, but since then there has been more emphasis on day 1 covering generic skills needed for scientific computing and the big picture of things.

Our general feedback remains quite positive. Our streaming + coteaching + collaborative notes format is still well received, and there seems to be little reason to go back for courses of smaller scale. Instead of just lectures, written material (tutorials in info on scicomp.aalto.fi) + livestream + videos is a good combination.

There is never enough time - not much else to say. Each year there is a different trade-off between how much we cover and how brief we are. (There are always people who say we should go more in-depth, and some who say we go too much in-depth. Such is life.)

Repetition is good, but not when it’s a sign that we can’t stop talking and keep saying the same thing over and over. The best lessons seemed to be the ones that were taught most quickly, since it has a high density of new information. We should strive to make more lessons faster, and leave details to the reading.

Because the teachers also do support, for anything difficult, we can easily tell learners: “Do what you can, come by our SciComp garage to ask for help with anything else. This overall reduces the demands from teaching: a person doesn’t have to know everything, but know enough to get started and to know when they may need more help for more advanced tools. This really is good for both of us.

As usual, we expected our learners to read our shell crash course in advance. We also had a new tutorial on using the cluster from the shell. This helped some, but it was still a problem.

Reflection: this will always be a problem in any course that has a wide enough audience. We should accept and provide positive support for those not ready, and not try to exclude them. It’s OK to see a course and then strive to get the prerequisites later.

Internally, we had this thought of dividing the course in two: a basic part at the start of the summer, and an advanced part at the end of the summer - since brand new researchers may have trouble understanding everything. On the other hand, the fact we have videos means that people can come back and review the material when they are ready. So in some sense, learners can divide the course however they would like by stopping when they think it’s no longer necessary and coming back. This could be mentioned more explicitly in our introductions.

Attendance goes down day-by-day. This is definitely OK - it doesn’t hurt anyone. It’s expected that day 1 was suitable for the most people (even those not doing HPC work), and then the course topics got continually more specific as we went further and further.

As mentioned above, this is even be expected and encouraged - better to have someone attend day 1, than not.

Our exercises are quite basic overall, but we got few complains about this. Basic exercises are better than something too advanced or realistic, that requires many things to come together.

This year, we tried to have a complete solution for every exercise (script and/or commands), even if it’s directly said above in the lesson. This seemed to be good, since for people very short of time, they still have some chance to copy and paste and do the exercises. For those passively following, they can at least see what would have been done.

Day 3 / end of course feedback positive feedback (o is the way a person votes for/agrees with that option:

it’s great that the material is so easily accessible also after the course to go through things in my own pace again oo

Really good format with the streaming and the shared document for questions. ooooo

The cat kept me focused in the lecture

Live interaction with the instructes were very helpful and exercises were nice

I really appreaciate the instructors took the time to explain the jargons, instead of just letting them fly around. o

The fact that the instructors were really nice contributed to the good course experience. Thanks for that! o

(day 1) After studying remotely for 1,5 year and having lots of online classes, I highly appreciate the amazing audio quality here. Many thanks for that!

(day 1) The framework is better than any other workshop I’ve ever attended - in terms of interaction and audio quality. HackMD is great.

(day 1) The (twitch) vertical screen thing is genius and should be used in way more (online) lectures o

Most common negative feedback: not enough time! In fact, that’s almost only thing to improve. Except we can’t, so I think we win pretty well. And videos/material allows follow-up.

Summer kickstart

How we did summer kickstart 2021

Read more ...


ASC development day, 2023 August

We had another development day (previous: ASC development day, 2023 March). It went mostly like the last one, and we have less important news for the world, but below is the summary anyway.

We have about 1550 people with accounts, with 202 new account requests in the last six months.

Most routine issues tend to be about software installation, which is good (this is the actually hard part, it’s good people ask us).

We are still on track for about 500 garage visits per year. We don’t try too hard to keep track of them all, we might get about 75% of them.

The number of interactive and Jupyter users are increasing, while Open OnDemand is decreasing. This is the wrong direction from what we’d like. We will open OOD to connections from all of Finland to make this easier.

Triton v3 is still on the way. This isn’t a new cluster, but a new operating system which individual nodes will be migrated to slowly (while maintaining the same accounts and data). Most of this happens in the background, but the change of base operating system images will require most code to be recompiled, which will require attention from many users. The transition can be made slowly, both old and new OSs will run for a time being. There won’t be a change in total amount of computing power.

An upcoming blog post will discuss this more, and the effects on users. Now is the time to start preparing. We still expect the transition to happen sometime in the autumn.

We are thinking to merge home and scratch directories, to make a common quota for both. This would improve usability by reducing the frequency of home quota affecting usage. We’d welcome any other usability suggestions.

Practically, we are using the chance to automate things even more, which should make it easier to manage in the future.

Teaching has gone well. For this academic year, we’d like to add back in a few smaller, special-purpose courses (not just to teach them, but also to get good quality video recordings for the future).

Goals:

Developing and delivering the “workflows” course with CodeRefinery

Short courses to record (e.g. rerun of debug series, once a week, record and publish).

Update Debugging linking the different debugging course repositories.

LUMI is the new EU cluster with plentyful GPU resources. A user can essentially get as many GPU resources as they need with no waiting, but since the GPUs are AMD, there is some initial barrier. Our general feeling remains: “we won’t recommend our users directly go and use LUMI, but we recommend they talk with us first and we help them use it”.

Next steps:

Continue encouraging users to contact us.

RSEs will ask the top GPU user each week if they would like support with taking LUMI into use. We’ll go and do all the setup for them.

Slide on infoscreens around the buildings?

Read more ...


libwebp security vulnerability and computational scientists

Recently, a major security vulnerability (CVE-2023-5129) has been found in libwebp, an image decoding library for the .webp format. This is major, since this library is embedded in many apps and web browsers and allows remote code execution just by opening a file. For computational scientists, there is still some impact - and it’s harder to compensate for. In short, just by processing an image in the .webp format, someone can take over your computer or session.

libwebp is the current issue, but the problem is general: computational scientists often create software environments and use them for a long time. These environments aren’t usually browsing the web (the most likely attack vector here), but they do involve lots of code installed from different projects. How does one manage security in this case?

This post may be updated

If you use web browsers or apps on your own desktops, laptops, phones, etc. - make sure update them!

If you don’t use images in your research, there probably isn’t much impact.

If you do, this is what could happen:

You make a Python / Anaconda environment which uses libwebp somehow - directly installed through Conda, or some other application.

You download a dataset containing images. You process them as part of your research with the old environment.

The malicious image runs an exploit. It has access to your whole user account on that computer: extract any data, add SSH keys for remote access, corrupt/delete data (which might not be backed up from the cluster…).

Many things have to happen here, but it’s very possible for it to happen. You could lose access to non-backed up data or code or other confidential or sensitive data could be compromised, since code from one project from your user account has access to all projects from your account.

One would normally fix things by updating software. But when you are dealing with a research environment that can’t easily be updated, what should you do? This is the real question here.

It’s a multi-layered problem, and the answer will depend on your work. libwebp is what we are thinking about now, but the problem is general: there are other security problems that occasionally come up that can affect more scientific code. How do you prepare for next time?

Update your environments (conda, virtualenv, etc). You could try to see if libwebp is inside of them (conda list | grep webp), but especially for Pip packages it might not be apparent.

Make your environments reproducible: If you define your dependencies in requirements.txt (Python), environment.yml (conda), or whatever is suitable for your language, you can easily re-generate environments to bring everything up to date. (delete old one, re-create).

If you pin versions of dependencies (like numpy==1.20.0), it’s possible it can pull in older versions of other dependencies.

Containerize your workflows. If code runs inside of a container, it keeps it isolated from the rest of the operating system and user account. (but containers aren’t usually designed for strict security, but it’s better than nothing).

If you use pre-built modules on the cluster, try not to use old versions. We’ll update some recent modules, but we can’t update all of the old ones. At least webp is in the default anaconda modules.

If you write or maintain software in general, keep it up to date as much as reasonable! Don’t make others get into a place where they are having to use old versions of libraries to make it work.

In general, think about your dependencies. Be at least a little bit suspicious before you install random other software, that may possibly pull in lots of other dependencies. Of course, as a researcher, you may not have much choice.

These commands seem to be able to update an environment to a newer libwebp. It seems to work on newer environments, but we don’t know for sure. Instead of mamba, conda in theory works but is to slow it may not be practical:

There is a major security vulnerability in libwebp. While the impact on computational scientists may not be that much, a bigger issue is the difficulty of keeping all of the environments up to date so that next time this happens, it’s easier to respond.

We hope to have more security recommendations for computational scientists in the future. If anyone is interested in collaborating on this, let us know.

Common apps which embed Chrome or libwebp: Chrome, Firefox, VSCode, Zulip, Slack, Discord… things that use Electron to embed a web browser are affected, and that’s many things.

Read more ...


Aalto public servers requiring passwords with SSH keys

From 2023-09-25, publicly accessible Aalto server login is changing and will now require a password in addition to SSH keys. This will have a significant usability impact on some users. This post is made as a landing page for users who need immediate, practical help and for whom the aalto.fi page isn’t findable or detailed enough. The official contact is the IT Services service desk

The reference page SSH has been updated to include detailed reference information for every common operating system and SSH client. Secure Shell is one of the standard methods of connecting to remote servers and it is important that users of all skill levels are able to use it securely.

This change is not from Science-IT, but since it will affect many of our users but is not being publicized or supported very much, we are preemptively doing some major user support.

What is not happening is: requiring locally encrypted SSH keys (although this is highly recommended).

What is happening: When you connect to an SSH server from outside Aalto networks, you will need to have an SSH key set up and send your Aalto password to the remote server interactively.

If you already have an SSH key set up, you’ll start to be asked to enter a password every time you connect.

You can always connect to the Aalto VPN in advance to prevent this, but there may be cases where this isn’t a practical solution.

If you do not have an SSH key set up, you should:

Follow SSH to generate an SSH key - we have heavily revised this page to cover almost every common SSH arrangement.

Place your SSH key on any common Aalto server (kosh, etc. - not Triton since that doesn’t share home directories with the public servers)

You could connect by VPN, and then use normal password to connect and add the key.

You could use https://vdi.aalto.fi with a Linux computer to place the key.

You could place the key while on an Aalto network (as usual, this means eduroam or aalto only from an Aalto computer).

You could use another computer that’s already set up with an SSH key to place the key.

The key will then be available on all common Aalto shell servers (and other workstations), since they share the home directory.

Re-read SSH, in particular the SSH key agent, ProxyJump and Multiplexing sections, to see how to configure your SSH to minimize the number of times you need to enter passwords.

This was needed for security as evidenced by recent history. Password-only login is simply not feasible anymore (nor for some time). Removing passwords as an option is good security practice that most organizations should adopt these days.

But why an ssh key and remote password instead of a properly encrypted SSH key? An SSH key requires something you have (the key) and something you know (the password), doesn’t it? And doesn’t require sending a plaintext password to the remote server. This was decided by whoever is setting this up, probably partly due to the fact that it is not possible to enforce passwords on SSH keys via the server config.

In general (outside of Aalto), you should use SSH keys everywhere and be wary of ever sending plaintext passwords to remote servers (even in conjunction with an SSH key). Security is important, and by using SSH keys with local encryption of the key you are doing your part.

We apologize for the difficulty in getting work done and want to help you as much as possible (though Science-IT was not the ones that designed this or communicated it).

There are, unfortunately, some trivial workarounds that involve putting your password in plain text on your computer to script things. However, please note that writing passwords down (outside of password managers) is bad security practise and against the Aalto password guidelines. It is better to contact us to help design a better and more secure workflow, or ask IT Services and ask them to consider other use cases.

Read more ...


Preparing for new Triton

Sometime in autumn of 2023 (e.g. October/November), we will do a major update of Triton: updating the basic operating system, and thus almost everything else. There are big benefits to this: newer basic operating system software, but also such a basic update affects almost every user. For a short time, this will make a lot of work for almost every user. This post gives advance warning and a chance of feedback of how to make the update most usable.

This post is just advance warning and things to prepare already. All actual instructions will come later.

We will update the basic operating system from CentOS 7 to something else (Red Hat 9). We’ve ordered all new management hardware to make the backend more reliable and manageable. Along with this comes with an update of the software build system, which should allow us to deploy software to our users even better. We’ll also update our configuration management system for more reproducibility.

We also hope to think about the usability of the new system: remove a lot of old options and add in new, simpler ways of doing what people need.

All data and storage will remain the same, so there is no big data migration needed.

The old and new clusters will be accessible at the same time (two different login nodes), with the same filesystems mounted (same data available) and some compute resources still available there, so that people can slowly migrate. But the old one won’t stay running too long, to avoid long maintenance effort or splitting of the resources.

The biggest problem with big cluster updates like this is reproducibility: does you work from a month ago still work in one month? If not, this is a big problem. It’s even worse if there is a much longer gap before you come back to it (paper revisions, anyone?).

You could say there are two things that can go wrong with a cluster upgrade or change:

Specific software/code that needs to be compiled and installed: Software needs re-compiling for new clusters or new cluster OS updates.

Whole workflows: you need to make all the pieces work together. Different paths and workflow managers may need updating.

What you can do:

Manage any messes you have earlier rather than later. It’s better if you slowly clean up over time, so you can focus on the differences once the change happens.

Know what software you are using. It’s easier for us to re-install something we have already installed when someone can tell us the exact name and version that they are using.

Tests for your software. Some way to validate that it works correctly.

Contact Aalto RSE for hands-on help supporting the transition. Come to the garage early and often.

If there are any annoyances about Triton that you’d like us to consider for the upgrade, now is the time to let us know so we can plan them. We especially value feedback on usability problems.

Discuss with us in our chat, or open a Triton issue.

This post has been updated with minor corrections, changes be found in git history.

Read more ...


The Aalto RSE hiring process

This post describes the hiring process of Aalto RSE. The goal is to make hiring more equitable by providing the background information so that everyone can apply successfully. For those not applying to us, it might still provide some valuable insight about how to market your skills as a PhD making a sideways career move. What’s said here may not apply to every organization, but it might give you some things to think about.

Disclaimer: This page is a rough average description of the past, not a promise to always do this in the future.

Aalto RSE has usually hired people who have postdoc experience and will transition to a more applied software/data/computing oriented role (as opposed to being focused on writing papers). For many people, we are the first experience of job applications post-degree and thus people have to learn how to present their skills in a new, non-academic context.

One should start by reading about us - we have lots of information publicly available about what we do and how we think. This should be understood in order to do the next steps well.

The cover letter is the most important thing we read, and the first and most important filter. It’s read before the CV.

At the level we are at, almost everyone’s CV and achievements are effectively equivalent. Does it matter who got the most fancy papers? Who has the most awards? The classes people took? When most of a person’s knowledge has come from self-study, probably not. The cover letter is the chance to interpret your skills in the context of the job you are applying for.

When reading the cover letter, the first question we ask is “does this person know what they are applying to and know why they think they are a good fit?” (It’s always interesting to get letters which clearly don’t understand the job, but on the other hand it’s an easy filter.) The first paragraph should answer this question and that the rest of the letter will go into detail about why. Start with the most important information, don’t make it hard for us.

Beyond that, talk about interests and skills as relevant to the organization. Discuss special projects, including non-academic ones or random things that you are interested in (this is especially true for us, since we are the transition from academia to practical work). Our job advertisement gives you some specific ideas that you can talk about. Anything specifically important to the job should be pointed out here and not just left in the CV.

If you don’t exactly fit the stated job requirements: here is the chance to explain it. The job requirement has to say roughly what we need (to not waste people’s time when applying, and because our hiring decisions must be justifiable based on the requirements), but there are many cases where someone with a different experience can accomplish our actual goal (as said in the job ad or found in your background research). A person that can say this, that they are adaptable, and will have a very good chance.

We have adopted some system of anonymous recruiting. We request that cover letters are submitted without identifying information (name, signature, etc) so that one person gives them numbers, and a broader group tries to take a non-biased look at them. After this initial impression, we bring in the rest of the application. Don’t make assumptions about what the reader will know about your background, just say it.

The letter should be as short as possible to get the information across. One page is usually about the shortest we get, and a bit less than two pages is typical. But if it’s engaging, we’ll read as much as you write. Remember, most important information first, don’t make us hunt for things.

Update 2024: Do you want to use AI to write your cover letter? Please think again. Since LLMs became a thing, cover letters have become harder to read, longer, and more generic-sounding. It’s better to write in your own voice and be shorter than rely on what AI gives you.

The CV serves as non-anonymous reference information, but they are hard to read and all look pretty similar. To be honest, we don’t worry that much about the format and contents here: get us basic factual information in the most efficient way. For our particular jobs, non-academic skills such as software/data tools are more important than scientific articles, etc. Remember, we are busy and have plenty of applications, make it easy to read.

Open Science isn’t just good for research, it’s good for you, too. If you can point to public repositories of work you have done, this is very useful. Things like Gitlab/Github profiles with activity and your own projects, links to data you have released, etc. They don’t have to be perfect - something is better than nothing. The best case would be a few projects which are well-done (and you know it and point them out to us), and plenty more stuff that may be of lower quality to show you can get simple stuff done simply. Not everyone is fortunate to have a field where they can practice open science throughout their career, but even publishing a project or two before they apply for a job with us is very useful.

Despite what the previous section said, we do try to dig through applications that seem on-topic but don’t say everything we are looking for, to give them the most fair shot we can.

We always need to heavily filter the list down. Some relevant filtering includes:

Do they know what job they are applying for? Can they connect their skills to the job?

Have they touched on the main points in our job advertisement and the linked “Become a RSE” page?

Are they interested in teaching, mentoring, and real collaborative projects? Do they know what kind of teaching and mentoring we do?

Is there enough knowledge about the research process?

Any relevant skills about this call’s particular topic (if there is any)?

How do their skills and experience match what our team is currently missing, regardless of the open call?

How similar has their previous work been to “research engineering” (helping the research process) instead of only focusing on academic promotion?

The recruitment team makes several passes over and we discuss how to filter down. We try to get a good variety of candidates.

Sometimes, there is some initial recorded “video interviews”, which provide some initial familiarity in both directions before the actual interviews. We know these are non-interactive and a recording isn’t a conversation so this is harder than an interview, but we consider that when watching them. One shouldn’t worry too much about these, if we do them.

Our actual interviews are not designed to be stressful. We have some prepared questions and go through them in a friendly manner. You have a chance to ask questions to use at the beginning and end (and any other time too). The questions are designed to hear about your experiences and not trick or test you.

We don’t currently ask technical challenge questions. The number of things which you’d need to know is so broad, it’s more important that you can learn things quickly. Since we usually interview relatively advanced people, we can instead look at existing projects they have done and check references, without having to do a technical challenge. This may change depending on the type of candidates we are interviewing, but just like the main interviews we are more interested in how people think, rather than raw knowledge.

In the future, there might be more “meet the team” kind of events.

We want to respond to people as soon as possible, but there’s a simple fact: we don’t want to tell anyone “no” until we are very sure we have an acceptance (we don’t want to tell someone “no” and then hire them later), and we have very many qualified candidates. So there is often an unfortunately long delay in hearing back. We hope that everyone knows within a month, though (and ideally ~2 weeks if all goes well).

We get a relatively large number of applications, with a lot of good people. So far (before 2023), we have been hiring at a relatively high level - researchers with postdoc experience who have been some sort of RSE-like experience with helping others with research (beyond only focusing on making papers for themselves) and technology. Don’t let this discourage you. There are many qualified applications, so if you don’t get selected, that doesn’t mean that you were unqualified. We look at everyone, regardless of their level, for every position. The fit to our particular job is more important that anything else, so keep trying until you get the right fit - it’s just a numbers game.

For reference, this is an older job application text, so that you can see how the things above are integrated. (to be updated with the 2023 version soon)

[ standard header removed ]

Aalto Scientific Computing is looking for a

Research Software Engineer/Supporter

To a permanent, full-time position.

Are you more of a programmer than your researcher colleagues? Are you more of a researcher than commercial developers? Do you fit in both, but have a home in neither? Be a Research Software Engineer with us and find your home. If you are looking for a career path which combines the interesting parts of both fields, this is a good choice.

Aalto Scientific Computing is an elite “special forces” unit of Research IT, providing high-performance computing hardware, management, research support, teaching, and training. Our team consists of a core of PhD staff working with top researchers throughout the university. Our services are used by every school at Aalto University and known throughout Finland and the Nordics. All our work is open-source by default and we take an active part in worldwide projects.

In this position, you will:

Provide software development and consulting as a service, depending on demand from research groups.

Provide one-on-one research support from a software, programming, Linux, data, and infrastructure perspective: short-term projects helping researchers with specific tasks, so that the researchers gain competence to work independently.

As needed and depending on interest, teaching and other research infrastructure support.

Continually learn new skills as part of our team.

Primary qualifications: There are two main tracks, and candidates of diverse backgrounds are encouraged to apply – every candidate will be evaluated according to their own unique experiences.

PhD degree with research experience in some computational field and much knowledge of practical computing strategies for research, or

Software developer or computational scientist with a strong software/open source/Linux background, scientific computing experience, and some experience in research. Masters degree or similar experience.

This particular call emphasizes the ability to work in machine learning and AI environments. The ideal candidate will be working closely with machine learning researchers, and thus a background in machine learning is highly desirable.

Important skills:

Ability to tackle any problem with a researcher’s mindset and a developer’s passion for technology.

Experience or knowledge of the principles of open source software, open science, and software development tools such as version control.

Please see https://scicomp.aalto.fi/rse/become-a-rse/ for more information on what kind of skills we value - or more precisely what you are likely to learn.

What we offer:

You will join the dynamic Aalto Scientific Computing team, where you will learn from some of the best research IT specialists in Finland.

Co-working within top-quality research groups, getting experience in a wide variety of fields and developing an extensive network of scientific contacts. This includes contacts to the Aalto startup scene and community.

A way to be close to the research process while focusing on interesting computational problems and not the publication process.

Our program will offer you a chance to improve your software skills – you are expected to engage in plenty of professional development.

Open Source is our expectation. All (or most) of your code may be open source and may be added to your public CV, depending on the needs of researchers.

Salary will be according to experience, for a recently graduated PhD similar to a postdoc salary. Work hours are flexible, but are expected to sync with the audience being served. Primary workplace is Otaniemi, Espoo (Helsinki region), Finland. Aalto University has a hybrid work policy which allows 60% remote work possibility, and our team takes good advantage of this flexibility.

To apply successfully:

Please include a separate cover letter (~1-2 pages). Please try to write your cover letter avoiding information like name, gender, nationality or other demographic information that is not directly related to why you would be the right person for this position (this includes, for example, a signature on the letter) unless you think it benefits you. This will assist in anonymous recruitment possibilities. The letter should include for example:

Why being a Research Software Engineer is for you,

past research experience, if any

past technical teaching or mentoring experience,

past software development experience (even informal self-learning),

past Linux, command line, or scripting experience,

highlight one (or a few) collaborative projects you have taken part in and your role within it, and

what you bring and what you intend to learn.

A normal professional or academic CV including

a list of your technical and programming tools and level of proficiency (e.g. basic/proficient/expert). This is the time to show the breadth of your experience.

Github link or other public sample code. If not available, whatever is possible to demonstrate past programming experience. Please highlight one or two of your outstanding research software projects.

[ standard footer removed ]

Read more ...


Whisper deployed on Triton, LLMs coming

Whisper on Triton documentation

OpenAI Whisper is a tool for speech transcription. It works well and has potential applications in many different research and non-research use cases. Using it isn’t too hard - if you can install it and if you have a GPU. Often, the installing can become a big barrier, especially for “just testing”.

Luckily, we have a cluster with GPUs and a way to provide software for researchers. We’ve made Whisper available on the cluster as a module, so it’s trivial to use it for any audio data you may have. All one needs to do is:

It might look complicated, but all you need to do is copy and paste. The first words request the resources, the middle specifies your file, and the last are some standard options to make it do things like use our pre-downloaded model files. Yes - this still requires knowledge of how to use a cluster in general, but once you’ve got that knowledge, transcribing audio is trivial. We have a self-study course on cluster usage, and users can always drop by and ask us for help, for example our daily garage each day.

See the Whisper on Triton documentation for more information on the use.

We are also preparing a way to do this through the cluster web interface Open OnDemand, which will remove most of the need to know how a cluster works and make the tool even more accessible to other communities.

We hope to make other tools available like this.

Whisper is just one of the latest tools, but you’ve probably noticed that large language models are very popular these days. There are, in fact, some that can run locally on our own cluster, and our goal is to deploy more of these so that they can be easily tested and used. The intention isn’t to make a replacement for existing LLM services, but make internal for testing, research, and development use easier.

Local installs have various benefits, including lower cost (since we already own the hardware), being able to ensure reproducibility longer-term (since models are locally downloaded and preserved), and being able to use without various registrations. The downside is that the most popular ones ones aren’t available for local use.

Contact us if you need other models deployed, or if you have trouble using what’s already out there. We are still in an early phase, and there will probably be some difficulties in availability, accessibility, and reusability. Contact us early if you notice anything that’s not right. We both help installing things and help using them as a research engineer partner.

It’s clear that artificial intelligence and machine learning tools will become more critical tools for other research. The difficulty in deploying and using them could become a barrier, and that is where Aalto Scientific Computing comes in. It’s our goal to make sure the infrastructure that researchers need is ready and able to be used by everyone, not just those with classic HPC experience.

Here we go over some implementation details, which may help others who want to deploy similar things on their own clusters. If you just want to use things, you don’t need to read on.

We installed whisper in a container, so that all dependencies are packaged together and things are portable. The model definitions themselves are not included in the container, but mounted in. We try to find options that allow one to specify the model and model directory, so that the user can try out different models without downloading each one. The Lmod module file prints out some help when loaded.

We’ve got two versions installed: normal Whisper, and Whisper-diarization (which can identify speakers in the transcript).

Whisper and diarization both have multiple different implementations. It’s bit of guesswork to try to see which one is the easiest to get running / works the best (not about quality of transcript, but easy of deployment in container and with local models). This led to a change to another implementation of diarization midway since the current one is more active in development and seems overall slightly better. A lot of the work was fortunately transferable to the new implementation.

There were the common issues with getting the right dependencies in a container and getting the GPUs to work there. This is pretty standard by now.

Most implementations of whisper want to download models when running it. This might make sense for general user, but doesn’t really make sense on cluster. Depending on the implementation, getting it to use local models is not always trivial. Since GPU execution of diarization uses several models at once, there doesn’t seem to be a simple way to have it use local models at all without changing the code. It also required some sleuthing to find where exactly the models are downloaded. If a code uses Hugging Face, these environment variables can be useful.

Making a module that is both easy/practical to use for users without also losing options is usually bit tricky: we want users to be able to do anything, for “the right thing” to happen automatically, and not build some opaque framework to make it happen. Singularity-wrapper fortunately helps quite a bit in doing lot of background stuff such as binding directories, gpu flags, etc. cleanly without users having to care about it, while still giving the option to run the container straight through Apptainer/Singularity if finer control is necessary.

Testing if the containers work is somewhat annoying. Diarization in particular saves a lot of cache files all over the place, which all need to be purged when testing GPU running. Otherwise the GPU will stay idle since everything it would do is already in cache. This also affects clean-up after users run the code.

A minor inconvenience for us (but possibly large for users) is that the syntax for each Whisper CLI implementation tends to differ slightly. This makes swapping between implementations slightly annoying since you have to check every time what was the syntax for flags.

Read more ...


SciComp Kickstart - 2023 plans and yearly strategy

It’s time for our “kickstart course” - let’s talk about what that is, why, and why you might want to attend.

The full name is “Introduction to scientific computing and HPC” (high-performance computing), and it used to be called “HPC Kickstart” and was taught without the first day, thus the short name “kickstart” we still use. Some years day 1 had a different name, but was still taught together with days 2-3 as a package.

Our goal isn’t just to teach some skills, but to form a community around scientific computing - with researchers who have a common language to work together and help each other, supported by Aalto Scientific Computing in the background.

Course page in 2023.

Day 1 is not about high-performance computing things, but the basic skills needed to do scientific computing: things like Linux usage, data management, the types of tools available for different problems. For almost anyone doing any kind of programming/scientific computing kind of work, regardless of background. These kind of skills aren’t taught in academic degree programs. We teach these on day 1 because otherwise, new researchers have to learn from each other or re-invent.

Days 2 and 3 are about high-performance computing, more precisely basic cluster usage (with a focus of the basics). This is focused on the kinds of tools our community usually uses.

The topics are refined after many years of both teaching and support of junior researchers. Because of the way academic careers work (much diversity of paths), these topics (even day 1) aren’t just for new researchers but everyone can find something to learn or brush up on.

For the past years, we have been trying to keep up this yearly summer schedule. This usually happens the first full workweek:

Monday: HR introductions, other formalities for new summer workers - many departments seem to something like this. This may happen early than Monday of the kickstart week, since sometimes that comes too late.

Tuesday afternoon: Kickstart course day 1, the general scientific computing introduction. Applicable to everyone doing scientific computing.

Wednesday-Thursday afternoons: The HPC cluster usage part, which fewer people will attend compared to Tuesday.

Friday: we don’t have scheduled programs on Fridays, but sometimes there are communities who host advanced tutorials here about what their local users need. In 2023, there is at least an advanced GPU course then.

We are aware that there is a scheduling conflict with the CS summer day which is scheduled on the Tuesday of the 2023 HPC kickstart course. We did contact every department in January/February, yet this was still a surprise to us. In past years, we have adjusted our schedule to similar events, but this is not possible this year despite our best efforts.

We will still try to support researchers as much as possible. Recordings of previous years are available on youtube, and we also release videos the same evening as the course precisely to support everyone regardless of these conflicts. Researchers can still join us for day 2 and 3 even if you did not join day 1. However, please pay particular care to the instructions about setting up the Triton connection in advance.

We hope that this blog post can explain our goals to a larger audience so that we can reach even more people in the future, so that we can expand to onboarding even more young researchers even more systematically. You can reach us at scip@aalto.fi, and each spring we reach out to the main departments to schedule each summer’s course.

Read more ...


ASC development day, 2023 March

We recently had an internal “development day”, which is a our new name for getting together to talk about longer term plans. This is our second “development day”. Overall, it went well, and we think that we are on an overall good path. There are three particular focus areas for the future:

Teaching: This was also a focus last time, and probably will still be in the future. We are overall happy with our decision last time to focus less on many small/medium courses, and instead focus on large, collaborative courses and then focused, individualized support for advanced use cases. Smaller courses happen mainly when we see specific needs that can’t be filled other ways (or we make them large, open, collaborative courses if there is a broad need).

Triton v3: The software/OS/management side of our cluster will be almost completely reworked in the next year (we aren’t getting rid of any hardware just for this). This will take a fair amount of our time, but is needed because existing systems are starting to show their age.

LUMI usage: LUMI is a flagship project of EuroHPC and provides huge resources available to the same people that can use Triton. Triton is still needed for ease of use of everyday projects, but we should actively look for people who can benefit from it and help them port to there. Our recent evaluations lead to the conclusion that our porting help is still needed there.

Teaching has long been one of the pillars of ASC’s support. It’s still needed, but the focus seems to be changing. No longer is a room with 10-20 (or ever 50) people considered a lot. People seem both more able and willing to find advanced material themselves, and more in need of basic principles (git, Python for SciComp, etc). Perhaps this is also partly caused by the remote work period emphasizing how all this material is available online anyway. Our basic philosophy:

Focus on large courses for new researchers, for example using the CodeRefinery MOOC strategy. This reaches the most people, helps the beginners the most, produces high-quality open source material for asynchronous reference, and has good possibilities for co-teaching. Example include CodeRefinery, our SciComp/HPC kickstart course, and Python for Scientific Computing.

Advanced, one-on-one, or small-group support via SciComp garage and the Research Software Engineering service. This isn’t just for projects, but is also a useful service for people learning from other advanced material in their work - basically, we work as mentors. One-on-one support is both more rewarding for us and probably more useful to the user (relative to time demands on both ends). Anyway, advanced courses often aren’t offered right when people need them, so we are left in this position anyway.

What about small/medium-sized courses, and advanced courses?

The first two points above squeeze out medium-sized courses for the most part, in our opinion. By the time our audience is an intermediate or advanced level, they seem to be able to figure things out themselves + ask for help when needed - if they can figure out what they need to do. This point deserves further study, though. Instead, we point to other existing material.

We will make sure that we have good recommendations for advanced self-study courses and generally chart out the resources so that our users don’t have to. This is mostly done by our Hands-on Scientific Computing course.

In the past, we have supported community members to give courses on topics of which they are experts. Continue this as appropriate (see the next point).

Continue the possibility of on-demand courses taught by us if someone requests them, and other smaller courses if we see a strong need. Contact us!

Triton is our HPC cluster, and is notable for being a Ship of Theseus: it’s continually upgraded while being the same cluster. This has resulted in the software running it getting a bit out of date. This software was originally developed as broader partnerships, and as these partnerships have changed, we need to take more responsibility for it ourselves.

Users shouldn’t see any major change from this, though part of it is improving our (user) software installation tools, which should make increased responsiveness to software installation requests.

As said above, Lumi is a significant resource, yet our users have not come to us asking for our help in using it. Over the past six months, we have found some Triton users who would benefit from it and helped extend their workflows to work on LUMI. We do this by first testing some applications ourselves, then looking at Triton usage for large users and reaching out directly.

Currently our focus is on GPU-intensive applications, which is made more interesting because LUMI has AMD GPUs. We’ve gotten local AMD GPUs for our own testing and in general are well prepared to support this.

While LUMI is a HPC system and has a typical HPC system interface, it serves so many different users that the software stack is very limited, so that most users need to install their own software and figure out how to run it on AMD GPUs. This is why we recommend most users access LUMI through us (we’re paid to save you time, after all), though of course anyone interested can use it directly.

Read more ...


Aalto SciComp stickers and patches

We have stickers (and patches!) to support Aalto Scientific Computing. (You can get them from our IT offices in CS, NBE, and Physics) But why invest in this? Well, it’s fun, but there should be a deeper reason.

While our main goal is to maintain Aalto University Triton HPC cluster, provide courses and direct support to researchers, we cannot scale to solve all problems and make the best decisions without a community: you! Thus, our new promotional material is designed so that the members of our community can show their support for scientific computing at Aalto University. We hope that by providing a way for the community to show this interest, people can find - and support - each other better.

We have the typical hexagonal stickers, which you can use on all the typical sticker things.

We also have patches, for those who are interested - in Finland they are a big thing on [student overalls](https://en.wikipedia.org/wiki/Student_boilersuit), but you could also sew them on your backpack or purse. Please send us pictures to inspire us all! (some have Velcro backing for that kind) of attachment, ask us for that style.

You may notice that for the patches some have a black background and some have a white background. Black-background means “Ask me anything about the tools of scientific computing, I am happy to help or at least point you the right direction (as much as I can)!”

Here’s our idea:

Anyone may take the white background ones

Black background is for:

Aalto Scientific Computing team staff

Volunteers at our events (for example helpers at our workshops)

Anyone who is interested in using their time to help others in scientific computing (regardless of their skills)

(clever people will notice that the first two are included in the third, and actually anyone can be the third if they want).

The idea is that we, and our community, can’t work alone. Everyone needs to support each other in order to work at the level we want. The in-group experts are an undervalued resource in this, often not getting the credit or recognition they deserve in supporting everyone. This is our small method of recognizing those supporters, and we hope that in the future we support them ever more - both career-wise and supporting them in supporting others.

Yes, we should have gotten black-background stickers. We’ll do that next time…

Read more ...


What code has to teach us #1: the impact of implicit behavior

As Research Software Engineers (RSEs), we read and write a lot of code. In this series of blog posts, we are going to share some snippets that taught us important lessons, and thereby impart that wisdom unto you. These snippets are taken from actual research code, responsible for producing results that end up in peer-reviewed scientific articles. That is to say, results that we should have some confidence in to be correct. However, problems have a way of cropping up in the most unexpected places and when they do, there is a chance to learn from them.

I was in the metro zooming through Lauttasaari when I received an email from my professor that made my heart skip a beat. We just submitted a paper to Nature Communications and were all still a little giddy about finally sending off the project we had been working on for 3 years. She and the first author had been chatting about the cool methods we had been using for the project and a question arose: were we 100% certain that we “removed copies of the selected stimuli from the train set”? If we hadn’t, we would have to quickly pull back our submission, but surely we had, right? I thought we did. At least, I distinctly remember writing the code to do it. Just to be on the safe side, I decided to double check the code.

Below is the analysis script in question. It reads some data, performs some preprocessing, feeds into the a machine learning algorithm called zero_shot_decoding, and stores the output. I present it here to you in full, because there are many subtleties working together that make this situation so scary. The question I pose to you, dear reader, is this: were the highlighted lines (118–120) executed, or did we have to pull our submission?

The first thing that went through my head, as it probably went through yours, was: this code is so long and complicated, answering this seemingly simple question is going to take some time to figure out. And I won’t blame you for giving up right then and there. Hunched over my laptop while the metro passed through Ruoholahti, I tried to trace the logic of the script.

First problem: much of the behavior of the script is dictated by the command line arguments. Luckily, their values are saved in the output file, so I could check that they were correct.

Lesson: always error on the side of caution when deciding whether it is worth storing something in the result file.

That brings us to the big if-statement. Did the correct branch execute? Well, that depends on what was in the m dictionary, which translates to what variables were defined in the MATLAB file used as input to the script. If we had used the wrong variable name, i.e. brainVecsReps instead of brainVecs, when creating the input file, the wrong branch would have executed and the script would have been happily computing the wrong thing. And we would never know. If we had used the wrong input file, or the wrong version of the input file, the wrong branch would have executed without any indication that something was wrong. So many opportunities for small mistakes to lead to a big error.

Lesson: have the user be explicit in what they want to do, so the script can check the user’s intent against the inputs and raise a nice big error if they screwed up. In this case, there should really have been either a command line parameter determining which branch to execute, or even better, this should have been four separate scripts.

In the end I ended up searching the logfile for the line Stability selection DISABLED, single-trial data which, thankfully, was there, so the correct branch did execute.

Lesson: be liberal with print-statements (or other logging directives) in your scripts; cherish the resulting logfiles.

I breathed a sigh of relieved as the metro pulled into the central railway station.

This if-statement is a work of insanity. What was I thinking determining what the script should be doing based on a mostly random naming scheme of some variables in a MATLAB file? I got lucky that time. But from that moment on, I would heed this lesson:

Read more ...