Notebook does not start¶

Short overview of notebook spawn process¶

When a notebook is spawned (Start server button), the jupyterhub submits a worker job to the NAF htcondor cluster on behalf of the user.

the user sees a progress page while the job is pending

The jupyterhub will periodically (once per second) query htcondor for the job state (initially idle) and progress once the job state changes.

The notebook spawn will fail if

job was rejected for some reason (state changed to removed)
job state stays in idle for too long (jhub removes the job)

The spawn progresses if the state changes to running, i.e. the notebook job has been started on the worker.

The jupyterhub now knows the worker the notebook is running on but does not yet know the port the notebook listens on.

The notebook now creates a server entry in the local database attached to the users spawner entry.

The spawn process progresses to the jupyterhub waiting for the notebook to communicate this port.

the port is determined randomly on notebook startup and is sent to the juypterhub immediately

When the jupyterhub receives the port it will complete the server entry and conclude the handshake by connecting to the notebook API exposed on that port.

Once that succeeds, the route to the notebook is added and relevant API requests from the client are now forwarded to that notebook.

The jupyterhub continues to periodically query the htcondor cluster for the job state (once every 30 seconds) and remove server and route once the job state changes to removed.

My notebook does not start¶

In general the returned errors from not starting notebooks in the browser are somewhat sparse and of a general nature.

For more information about what actually went wrong check the content of .jupyterhub.condor.err in your $HOME (usually that should be your AFS home)

Check your quota

The most common reason for notebooks not starting is missing space in AFS $Home - check your quota and delete some files if you are low ! Then restart the notebook process

**Below we list some known issues and solutions **

Known problem: deadlocked notebook entry in the hub¶

It has been observed in some cases that the users notebook state is deadlocked to the point that it cannot be removed without the help of an admin.

Notebook shutdown is generally scheduled to occur when the htcondor polling of the hub ends, i.e. when the htcondor does not have the job in running state anymore.

The scheduled routines differ for notebooks reconnected during server startup or notebooks created after server startup.

The shutdown process roughly consists of removing the route from the proxy first, then removing the server entry and its reference from the users spawner entry from the hub database.

One issue is definitely that the route deletion may throw an exception (e.g. timeout) which will prevent the removal of the server entry. Even if the route was removed successfully despite the timeout, it will be recreated quickly due to the still existing server entry.

An issue has been opened on the jupyterhub github, containing a workaround to at least catch this kind of exception (but not for "reconnected" notebooks).

Another observed cause is the following race-condition: when the htcondor scheduler removes a notebook (e.g. because of exceeded runtime) and the notebook slot happens to be reassigned to another notebook (too) quickly. This is not yet fully understood but probably has to do with two routes to the same target being stored in the proxy, preventing the removal of the "older" route. As a result, the original users route remains active, connecting the user to the new notebook, lacking proper authorization.

In either case, the user ends up with an obsolete server entry in the database that cannot be removed normally: removal via API (button in the HUB GUI) does not work, as the API handler first checks that the job is still running in htcondor and does not proceed if it doesn't (probably assuming it is already stopping or stopped).

A patch has been added (2026-02) to ignore this initial job check and remove the server entry anyway, which should at least allow the user to fix the problem on their own.

Before that the situation could only be resolved by the admins:

remove the user account - it is restored with defaults on next user login
restart the hub

During a hub restart all known servers are additionally checked for connectivity which will normally lead to bad server/routes being removed again.

Known problem: unresponsive CVMFS mounts¶

During December 2024 it has been observed that (certain?) unresponsive CVMFS mounts prevent notebooks from reporting the local server port back to the hub (for some unknown reason).

The jupyterhub will ultimately give up after 120 seconds and cancel its job (reporting a timeout to the client) even though the notebook is ready and just waiting to be used on a worker.

In addition, these nodes were not automatically removed from the server pool by the periodic healthcheck which itself was locked up by the hanging mount due to another bug (which should be fixed now).

This can cause users to repeatedly try to spawn a notebook only to land on the same faulty node and experiencing a string of timeouts.

With fixing the healthcheck bug the situation should have improved but the exact cause for the unresponsive mounts is still open.

Known problem: Spawn failed: Server at `http://<something>/api` didn't respond in 60 seconds¶

Check .jupyterhub.condor.err in your $HOME dir if you find something similar too:

[E 2025-04-14 11:54:31.812 SingleUserLabApp serverapp:2829] Failed to write server-info to /afs/desy.de/user/s/someone/.local/share/jupyter/runtime/jpserver-1898400.json: OSError(122, 'Disk quota exceeded')

Your disk quota is exceeded and $HOME needs a clean up or you need more quota ...

Another reason for this issue could be a 'false' entry in your local jupyter config that keeps the hub from connecting to your notebook look for:

the entry open_browser = False

in the file .jupyter/jupyter_server_config.json

and comment it out ...