condor_q errors and explanations¶

condor_q gives an error:

-- Failed to fetch ads from: <131.169.223.41:9618?addrs=131.169.223.41-9618+[2001-638-700-10df--1-29]-9618&alias=bird-htc-sched21.desy.de&noUDP&sock=schedd_1278605_4238> : bird-htc-sched21.desy.de SECMAN:2007:Failed to end classad message.

The reason behind this error is usually that the scheduler who is in charge of your job queue management got very busy doing something else.

The most common reason is that someone (might as well be you yourself ;)) is submitting a lot of jobs with a typo in the executable path or (worst case scenario) with a typo in the logging path, a full filesystem location for the log file or a 'logging file too large' problem e.g. in afs.

What happens in all these cases is that the scheduler tries to open the log file location but is not succeeding which creates another log event and so on. In the end the job will fail, go to hold and as a hold reason shows the failed path or logging situation. At this point no harm is done yet and the user should correct the problem and rerun the jobs. Unfortunately things start going south from time to time when useres keep submitting thousands of the faulty jobs. This keeps the sched very busy, in fact busy to a degree that not only you can not come through requesting the state of your queue but also the monitoring blacks out and no new jobs are negotiated or started. Running jobs are not affected though.

Hence in a similar situation a valid assumption is that your jobs that were running before the sched stopped answering are doing fine but other than that the sched is suffering a denial-of-service attack of some sort ...

Another common problem is that people are just overloading the sched with condor_q commands using submit frameworks that poll in high frequency or watch from the CMD line.