1 Introduction

Notes related to my learning and teaching interests in several fields related to scientific computing (mostly applied mathematics and machine learning) and related applications. This home page is the entry point and those interested in more content and interesting links navigate below between the multiple pages. It can also be used as a general guide for introducing scientific computing as it tries to introduce the minimal skill set any scientific computing engineer or scientist should have:

Version control comes first, everything else is worthless without it, currently that means Git.
Next comes software documentation with Doxygen, Sphinx, and/or Documenter.jl.
A low(er) level programming language among C, C++, and Fortran, preferably all of them.
Scripting languages, as of 2024, Python is mandatory, Julia highly recommended.
Basic machine learning in one of the above scripts, everything is ML these days.
Shell automation, basis of both Bash/other UNIX shell and PowerShell are required.
Typesetting equations reports and presentations (beamer) in LaTeX.
Domain specific skills related to the field of study (CFD, DFT, MD, ML, …).

A sample setup of an operating system for scientific computing and practicing the above skills is provided in a section dedicated to Ubuntu.

Some technologies have been mainstream or important in the past, but nowadays some of them have already died or are becoming too niche to be put in such a list. That is the case of SVN for version control. As for programming languages in science, that is the case of matlabish (MATLAB, Octave, Scilab) environments, which are still used by controls and automation people, but are mostly incompatible with good software practices and should be discouraged.

It is also worth getting familiar with high-performance computing (HPC); in the Top 500 page you can get to know the most powerful computers on Earth. The specification benchmarking page allows for the check of hardware specification, what is interesting when preparing investment in a computing structure. Lastly, when working in multi-user systems it is worth knowing about job management systems such as Slurm.

As a last word, I would like to remember that it is humanly impossible to master everything at once; even after more than 10 years in the field as of today I only have a grasp in the tools I do not use everyday. Software and methods evolve, and unless you keep using a specific tool you simply cannot afford to keep up to date with it. That should not be a roadblock for a scientist in the long term. As you get used to scientific software, getting back to a good level of some tool you used in the past is quick (but not extremely fast in some cases) and learning new tools for which you already know the science behind is trivial. Even exploring new fields become easy in some cases.

1.1 Scientific programming

At its core, serious scientific computing requires scientific programming. To that end, this page provides access to programming learning materials and related links. If this subject is new to you, to be able to successfully follow the contents you might learn a bit about the environment we will use, VS Code 3 and the minimum about command prompt on Chapter 2.1.

1.1.1 Coding practices

It is not worth learning any programming before being introduced to the good practices. Many programmers I know write garbage that works for them only. It is impossible to have a healthy collaboration if code is not standardized, reason why I place this highly biased introduction here.

One of the reasons Guido van Rossum created Python is because he wanted code to be readable. You should be able to guess what some code is doing even without specific technical knowledge about the language. This is probably the mean feature that made its creation so popular in the scientific world.

Although they are applicable to Python, the practices recommended in the famous PEP8 can be extended to other languages, including Julia. You should read PEP8 religiously. That document describes how to write clean and maintainable Python code. When transposing that to Julia, the minimum you are expected to do is:

lines are limited to 79 characters
use spaces around all operators
consistent indentation with spaces
blank lines around structural blocks
lower case variable names
Pascal-case structure names
use underscore to separate words
document functions properly

When you code, remember that most of the time what you are doing will be reviewed/used by somebody else and that person might not be in the mood to decipher the cryptic code you wrote; it that person is myself, I will promptly refuse to help you with badly written code. For newcomers, it is always better to talk about this before you write your first lines because once you stick to bad practices you will hardly ever leave them. Before you write something new, check if your ideas are also consistent with PEP20.

Julia has its own stylistic conventions that are simpler than PEP8; the main differences are the way to name functions (it recommends to glue words and use no underscore) and the exclamation mark ! indicating a function modifies it(s) argument(s). For function naming you may chose to stick to PEP8 recommendation, what is my personal choice. The detailed document is found here.

Python code documentation is generally done with Sphinx. Julia has its own syntax which can be used to generate package documentation with help of Documenter.jl.

Important: Julia supports Unicode input, but its use is highly discouraged in modules. Unicode characters are better suited to write application scripts such as notebooks (in Pluto or Jupyter).

1.1.2 Scientific publishing

The following tools might be of interest for creating scientific content with embedded code.

1.2 SSH key generation

There is something almost inevitable in scientific computing: you will need to connect to some remote machine at some point. Most of the time, that is a daily activity, whether you connect to a remote computer or HPC cluster. The most common protocol for such connections is SSH. This section guides you through the generation of a key pair generation and authentication through VS Code.

Creating the keys: generate the SSH key pair locally (i.e. on your workstation); common options are:

-t rsa: key type (RSA is widely supported)
-b 4096: key length (more bits = stronger, recommended 4096)
-C : comment (usually your email)

When running the command, accept defaults for storage at ~/.ssh/id_rsa[.pub]; optionally add a passphrase for additional security (but then you will need to enter it each time you need to connect, so that’s undesirable if the only reason you are creating the SSH key is to have quick access to the server).

ssh-keygen -t rsa -b 4096 -C "yourusername@your.server.com"
ssh-keygen -t ed25519 -b 4096 -C "yourusername@your.server.com"

If you have password access to the server and ssh-copy-id run the following:

ssh-copy-id -i ~/.ssh/id_rsa.pub user@remote_host

Alternatively (in Windows PowerShell for instance but reformat it in a single line or replace the pipes by backticks) manually append to the ~/.ssh/authorized_keys:

cat ~/.ssh/id_rsa.pub | \
    ssh yourusername@your.server.com \
    "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys"

As a last option do it by hand, but you risk breaking the format of authorized_keys.

Testing Linux server: before anything, try connecting with you identity:

ssh -i ~/.ssh/id_rsa yourusername@your.server.com

If that falls-back to your password connection, connect normally to the server and make sure the rights of both SSH directory and authorized keys file are right before trying again:

chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

Maybe the server SSH has not been enabled for key authentication, which can be inspected without opening the actual configuration file through (requires sudo rights):

sudo sshd -T | grep pubkeyauthentication

If it is not enabled, you can edit the file (find and modify PubkeyAuthentication yes) as follows and restart the service:

sudo vim /etc/ssh/sshd_config
sudo systemctl restart sshd

# Additional step for SELinux only:
restorecon -Rv ~/.ssh

Test again; upon new failure, try the verbose mode of SSH connection on your workstation:

ssh -v yourusername@your.server.com

while simultaneously connected to the server (sudo) reading the logs:

# Debian-based:
sudo tail -f /var/log/auth.log

# Under RHEL/CentOS/Fedora:
sudo tail -f /var/log/secure

Adding the key to VS Code by perform the following steps:

Install Remote-SSH extension
Press F1 and search for Remote-SSH: Open SSH Configuration File
Add an entry like the following (modifying the host name and user):

Host myserver
    HostName your.server.com
    User yourusername
    IdentityFile ~/.ssh/id_rsa

If the above fails to fill in your right user name (sometimes Windows username will appear) you can try the following workaround to enforce user:

Host yourusername@your.server.com
    HostName your.server.com
    User yourusername
    IdentityFile ~/.ssh/id_rsa

About cluster usage: that is the single case you might want to store both public and private keys at the same .ssh; to navigate across nodes (assuming your $HOME directory is the same) you need both keys. Please keep in mind to use a different key pair than the one you use to connect to the cluster for security reasons.

1.3 Git version control

1.3.1 Common activities

Upon connecting for the first time in a computer, remmeber to configure:

git config --global user.email "walter.dalmazsilva@gmail.com"
git config --global user.name "Walter Dal'Maz Silva"

# Do not track file modes (rwx):
git config core.fileMode false

# (GitHub CLI for Linux optional)
gh auth login

Some daily life reminders (you will repeat this so much it will impregnate in your brain):

git status

git add *

git commit -m "some message"

git checkout '<branch-name>'

git branch -d '<branch-name>'

Line ending normalization: instructions provided in this thread; do not forget to add a .gitattributes file to the project with * text=auto for checking-in files as normalized. Then run the following:

git add --update --renormalize

To create a GitHub pages (gh-pages) branch with no history do the following

git checkout --orphan gh-pages
git reset --hard
git commit --allow-empty -m "fresh and empty gh-pages branch"
git push origin gh-pages

1.3.2 Adding submodules

Generally speaking adding a submodule to a repository should be a simple matter of

git submodule add 'https://<path>/<to>/<repository>.git'

Nonetheless this might fail, especially for large sized repositories; I faced this issue which I tried to fix by increasing buffer size as reported in the link. This solved the issue but led me to another problem which could be solved by degrading HTTP protocol.

The reverse operation cannot be fully automated as discussed here. In general you start with

git rm '<path-to-submodule>'

and then manually remove the history with

rm -rf '.git/modules/<path-to-submodule>'

git config remove-section 'submodule.<path-to-submodule>'

For managing submodules, the following might be useful:

# You added the submodule in another computer, now sync it here:
git submodule update --init --recursive

# Sync submodule to head of remote:
git submodule update --remote --merge

1.3.3 Other tips

Version control in Windows: for Windows users, TortoiseGIT adds the possibility of managing version control and other features directly from the file explorer.

1.4 General Tips

1.4.1 Running Jupyterlab from a server

#programming/python/jupyter

Before running the server it is a good idea to generate the user configuration file:

jupyter-lab --generate-config

By default it will be located at ~/.jupyter/jupyter_lab_config.py. Now you can add your own access token that will simplify the following steps (and allow for reproducible connections in the future).

c.IdentityProvider.token = '<YOUR_TOKEN>'

The idea is illustrated in this thread; first on the server side you need to start a headless service as provided below. Once Jupyter starts running, copy the token it will generate if you skipped the user configuration step above.

jupyter-lab --no-browser --port=8080

On the host side (the computer from where you wish to edit the notebooks) establish a ssh tunel exposing and mapping the port chose to serve Jupyter:

# Notice that the ports below must be specified as:
# ssh -L <local_port>:localhost:<remote_port> <REMOTE_USER>@<REMOTE_HOST>
ssh -L 8080:localhost:8080 <REMOTE_USER>@<REMOTE_HOST>

Now you can browse to http://localhost:8080/ and add the token you copied earlier or your user-token you added to the configuration file. Notice that you need to keep the terminal you used to launch the port forwarding open while you work.

1.4.2 Using nteract with a virtual environment

# Activate the virtual environment:
. .venv\Scripts\Activate.ps1

# Give the kernel an unique name:
$kName = "your-kernel-name"

# Install the kernel:
python -m ipykernel install --user `
    --name $kName --display-name "Python ($kName)"

# Run nteract:
nteract.exe

1.4.3 Downloading from YouTube

#programming/python/tips

Retrieving a video or playlist from YouTube can be automated with help of yt-dlp.

To get the tool working under Ubuntu you can do the following:

# Install Python venv to create a local virtual environment:
sudo apt install python3-venv

# Create an homonymous environment:
python3 -m venv venv

# Activate the local environment:
source venv/bin/activate

# Use pip to install the tool:
pip install -U --pre "yt-dlp[default]"

NOTE: alternative applications as youtube-dl and pytube are now considered to be legacy as discussed in this post.

1.4.4 Installing Python packages behind proxy

#programming/python/tips

To install a package behind a proxy requiring SSL one can enforce trusted hosts to avoid certificate hand-shake and allow installation. This is done with the following options:

pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org <pkg>

1.4.5 Regular expressions

Regular expressions (or simply regex) processing is a must-have skill for anyone doing scientific computing. Most programs produce results or logs in plain text and do not support specific data extraction from those. There regex becomes your best friend. Unfortunately during the years many flavors of regex appeared, each claiming to offer advantages or to be more formal than its predecessors. Due to this, learning regex is often language-specific (most of the time you create and process regex from your favorite language) and sometimes even package-specific. Needless to say, regex may be more difficult to master than assembly programming.

Useful tools:

Matching between two strings: match all characters between two strings with lookbehind and look ahead patterns. Notice that this will require the enclosing strings to be fixed (at least under PCRE). For processing WallyTutor.jl documentation I have used a more generic approach but less general than what is proposed here.

Match any character across multiple lines: as described here with (.|\n)*.

Regex in Julia: currently joining regexes in Julia might be tricky (because of escaping characters); a solution is proposed here and seems to work just fine with minimal extra coding.