Long Term Archive Frequently Asked Questions

This page is targeted at users who want to learn more on the LOFAR LTA (Long Term Archive). It should answer the most common questions and provide help in case of difficulties in data retrieval. If the information on this page did not succeed to solve any issues you may encounter, please submit a support request to the Radio Observatory helpdesk as detailed here (Please read this before you do!).

Questions

General

Troubleshoot

Answers

General

I never worked with the LTA before. Where is a good starting point?

New users are advised to read through our tutorial page, the LTA How-To, which guides you through the whole process from getting a user account, over finding your data, to the actual download.

Why is data retrieval so difficult?

It is important to understand the the data volumes of LOFAR are pretty huge and handling them requires different technologies than what we all know and use in our everyday life. For instance, LTA data is stored on magnetic tape and has to be copied to a hard drive (getting 'staged') before it can be retrieved. To transfer these amounts of data within reasonable time requires careful consideration and special tools. We try to make the LTA as convenient to use as possible, e.g. by providing http downloads for users without Grid certificate and portable Grid tools for those who want or need the extra performance. We are aware of the fact that data retrieval is quite close to the backend technology and we hope to be able to provide solutions with higher abstraction in the future. But it will always be necessary to prepare data for the download, so will always have plan a bit ahead, sorry.

What is an appropriate amount of data to retrieve?

This depends. As a rule of thumb, we ask you to keep your requests below 5 TB in volume and smaller than 1'000 files. Also, the total file count in all your running requests should not exceed 5'000 files at any point in time.

In more detail, there are essentially two things to consider: The capabilities of your own system and the capabilities of the LTA services.

The most important thing to know about LTA capabilities, is that the disk pool that temporarily holds your data and from where it can be downloaded, is of limited capacity. This means that the data you requested is only available for download for a limited time (since the space is needed for new requests at some point). Your data is only guaranteed to stay available for 7 days. It can be re-requested after that, but you should never request more data than you can download within a few days. In most cases, this is limited by the capabilities of your own system, especially your network connection. (And available local storage space, of course.)

The second most important LTA limit is the number of files that can be processed at the same time. Some projects have not very much data volume, but the data is distributed over very many files. With large file counts, the management of the request itself puts a lot of load on the system. There is a maximum queue size of 10'000 files for all user requests together. So make sure to only occupy a fraction of that and wait until earlier submitted requests have finished (you got notified) before you submit new requests.

Note that the larger your request, the longer it takes until you can retrieve the first file. Also, please limit the number of requests running in parallel to a few, especially when they contain many files. In principle, we avoid introducing hard limits, but rely on reasonable user behavior. This also means that you can block the system for a long time or, in the worst case, even bring it down. So please act responsibly or we might have to enforce some limits in the future to keep the system available for other users. Be aware, that we may cancel your request(s) in excessive cases to maintain LTA operation.

If you, by accident, staged some 100'000 files or 100 TB of data, please contact the Radio Observatory helpdesk, so that we can stop these requests, thanks!

What is all this SRM / 'staging' stuff about?

These are technical terms that refer to the storage backend of the LTA. Each of the three LTA sites (in Amsterdam, Juelich and Groningen) operates an SRM (Storage Resource Management) system. Each SRM system consists of magnetic tape storage and hard disk storage. Both are addressed by a common file system, where each file has a specific locality: it can be either on disk ('online') or on tape ('nearline') or both. The usual case for LTA data is, that it is on tape only. Since the tape is not directly accessible but placed in a library shelf, the data on it first has to be copied from tape to disk, in order to retrieve it. This process is called 'staging'. Only while the data is (also) on disk, you will be able to download it. (In physics terms, think of it as an excited state.) To save cost, the disk pool is of limited capacity and only meant for temporary caching data that a user wants to access right now. After 7 days, all data is automatically 'released', which means that it may be deleted from the disk storage, as soon as the space is required for other data. It then has to be staged again in order to become accessible again.

Usually, you don't have to worry about the details. But be aware, that data retrieval is a two-step procedure: 1) preparation for download ('staging') and 2) the download itself. Also, take care not to request too much data at the same time.

Do I have to make new requests via the web catalog?

In principle, yes, this is the only supported procedure, at the moment. There are development versions of programming interfaces, with which it is possible to query the catalog and talk to the staging service, e.g. from scripts. But these are not generally made available, unsupported and still in development. If you are an 'expert user', are self-dependent enough to figure out how to work with this, and have a good reason, please contact the Radio Observatory helpdesk for some instructions and an emphatic admonition to take extra care.

There are different ways to download. Which one is the best?

That depends. In short terms: Http downloads are the easiest (e.g. via wget), downloads via SRM tools can be faster and are encouraged for large amounts.

The SRM systems which the LTA sites operate are integrated in the Grid. To work with them directly, you need a Grid certificate. To allow users without a Grid certificate to download LTA data, we operate webservers as a frontend to the SRM backend. These webservers provide the requested data via http downloads. The webservers are not excessively capable machines and meant for occasional users. If you retrieve huge amounts of data on a regular timescale, please work with SRM directly, especially if you own a Grid certificate. We provide a portable Grid toolkit to make it as easy to set up as possible.

You may want to read this FAQ Answer as well to make a decision: My downloads are too slow. What can I do?

For user instructions, refer to the LTA How-To page.

My downloads are too slow. What can I do?

First of all, you have to check how slow your download really is. If wget shows an estimated time of arrival of several hours, this does not necessarily mean that the download is 'slow': some files in the LTA are also just really huge. In most cases, your local network connection will be the bottleneck. For instance, a standard 'Fast Ethernet' network connection allows download speeds of around 12 MB/s at a maximum. Our systems are able to handle that, easily. In case you can rule out your network connection as the bottleneck: there are different ways to download your data and not all provide the same performance. By our experience, this is the order of performance:

  • Http downloads are the slowest option. The speed is limited by the server's network connection (~120 MB/s), which is shared by all users, and an upper limit per download (around ~30 MB/s) for technical reasons. If your download maxes out at the per-download limit, you may try to start up to four downloads in parallel. Note: There is no performance benefit to expect from more than four parallel transfers! However, there is a connection limit, which you may trigger if you start too many parallel downloads.
  • SRMCP is the faster option in most cases, since you work with the SRM backend directly. You may want to check out active gridftp transfer mode if you live remote.
  • SRMCP + copy-script seems to be the fastest solution available. It uses globus-url-copy, which is reported to have superior performance over the default srmcp transfer.

You may also want to read this FAQ Answer for further explanation: There are different ways to download. Which one is the best?

I want to contact Science Operations and Support. What information should I include?

You are welcome to contact Science Operations and Support in case of problems that you could not solve yourself. However, we kindly ask you to include all important information in your inquiry, so that we can quickly help you with your problem without too much back and forth:

  • It is absolutely essential, that you include a clear answer to the following:
    1. What exactly did you try to do?
    2. What went wrong?
    3. When exactly did it fail (so that we can check the logs)?
  • If you are asking about a command that failed, please copy-paste the exact command that you executed together with the full terminal output. (Some tools (like the srm commands) have a '-debug' option, which provides additional information, e.g. about your environment. It helps a lot if you could use that option when you copy-paste your command output.)
  • If you are using some script somebody gave you, please note that we are no clairvoyants and have no idea what the script you're using actually does. We can very likely not understand what went wrong from the output of some random script. Please check the lta_howto, whether the officially supported ways of data retrieval work for you. If they work, please ask the one who supplied you with the failing script, why his or her script fails. If the official ways don't work, please forget about your script for a moment and provide the output of the official tool that does not work for you.

Troubleshoot

I did not receive a mail notification that my request was scheduled!

If the LTA catalog did not show any error when you submitted your request, then it is safe to assume that your request was registered in our staging system. Usually, you should get a notification mail that this has happened within a few minutes. If you did not receive the notification within an hour, then our staging service may be down. Note that your request is not lost in this case and will be picked up after the service is back online. In urgent cases or if you are not sure that something went wrong while submitting your request, please contact the Radio Observatory helpdesk.

I did not receive a mail notification that my data is ready for retrieval! Has my request gone lost?

After you got a notification that your requests was scheduled, it is in our database and there's hardly a possibility that it got lost. Staging requests can take up to a day or two, but will finish a lot sooner in most cases. This depends on your request's size but also on how busy the storage systems are by other user's requests at the moment. Sometimes, the LTA storage systems are down for maintenance and this can delay the whole procedure. You can check for downtimes here.

It is not alarming when your request did not finish in 24 hours, even when your last request finished within 10 minutes. In urgent cases or if you did not receive a notification after 48 hours, please contact the Radio Observatory helpdesk.

I got an email that says my staging request has failed! What happened?

This means that the SRM server could not fulfill the request at all. This might mean that the system itself is fine, but none of the files from your request could be staged (e.g. missing files). Check the error message from your mail notification for details. The notification can also indicate that there is a general problem with the SRM system or with the staging service itself, i.e. something is broken or down for maintenance. We try to detect all temporary issues and only inform users in case that something is wrong with their request itself, but we cannot foresee all eventualities. If you cannot make sense out of the error message, or don't know how to deal with it, please contact the Radio Observatory helpdesk.

If you used the xmlrpc interface to submit your request, please first check whether you made a mistake and e.g. entered the wrong SURLs.

Note: We get notified of these issues as well and will usually re-schedule failed requests due to server issues after the problem was solved. So please first check whether you got a 'Data ready for retrieval' notification for the same request id after the error notification. If you did, the problem was already resolved.

I got an email that says my staging request was only partially successful! What's going on?

In general, this means that the SRM system works fine, but there was a problem processing your request. As a result, some of your files could be staged, some could not. Your mail notification should include a list of which files could not be prepared for download successfully and also include an error message to indicate the cause. If the error message says 'Incorrect URL: host does not match', this means that you combined files in a requests that are stored on two different SRM locations (e.g. one file at surfSARA and one file at Target). When one SRM location gets the request, it can only stage the local files. You have to request the files from different locations independently, to prevent this. Other messages should be self-explanatory, e.g. if a file is missing. If you cannot make sense out of the error message, or don't know how to deal with it, please contact the Radio Observatory helpdesk.

If you used the xmlrpc interface to submit your request, please first check whether you made a mistake and e.g. entered the wrong SURLs.

Note: We get notified of these issues as well and will usually re-schedule failed requests due to server issues after the problem was solved. So please first check whether you got a 'Data ready for retrieval' notification for the same request id after the error notification. If you did, the problem was already resolved.

Oops! I made a mistake! How can I stop a request?

Unfortunately, this is currently not possible for you as a user. Stay calm and ask Radio Observatory helpdesk to stop the request for you.

My files only contain some error message instead of data

Most errors should result in a 404/50x return code. However, some error messages are still returned as a message. Please read the error message carefully. In many cases, it should give you some indication of what went wrong. If this does not help you, please contact the Radio Observatory helpdesk or retry after a few hours.

Important: If you use wget with option '-c', please note the following: wget does not check the contents of an existing file, so when restarting wget with option '-c' (continue) to retrieve the failed files, it will append the later data chunk to the existing file that contains the error message (and not the first section of you data). Make sure to delete the existing error files (should be obvious by the small file size) before calling 'wget -ci' again, to avoid corrupted data. If you already ended up with a corrupted file, you have to delete that and re-retrieve the whole file.

My data files are corrupted

Check if the files are much smaller than you expect. Something might have gone wrong with the transfer. Please check the beginning of your files, e.g. with 'less'. If there is an error message, please refer to this answer. Otherwise, please try to re-retrieve an affected file. If this does not help, please contact the Radio Observatory helpdesk.

My downloads fail with error "All Ready slots are taken and Ready Thread Queue is full"

This usually means the SRM server system is overloaded and you should try again in a few hours.

My downloads don't start / time out

Maybe the SRM system is down for maintenance, please check http://web.grid.sara.nl/cgi-bin/lofar.py. If there is nothing going on, there is probably something wrong with the download service. Please try again a bit later and submit a support request to the Radio Observatory helpdesk, if the issue persists.

Http downloads randomly fail with "503 Service Temporarily Unavailable"

This can indicate too many users downloading at the same time. Please try again a bit later. There is also a limit of simultaneous downloads you are allowed to start yourself. Please limit yourself to four simultaneous downloads, the overall download rate will not improve with a larger number of connections.

When selecting a project it fails with "401 - No permission -- see authorization schemes"

This happens when you first tried to select a project when you were not logged in. Please first select another tab, e.g. search, then try to select your project again.

SRM commands fail with error containing "Java heap space"

The SRM tools ignore the system's default Java heap space settings and the default is not incredibly high. You are probably trying to process a long list of files. Either reduce the amount of files in that request or increase the SRM-specific heap space by setting an environment variable 'SRM_JAVA_OPTIONS' with a higher value (e.g. '-Xms256m -Xmx256m'; default is '-Xms64m -Xmx64m').

SRM commands fail with error '426 Connection refused'

Your firewall is probably not allowing active ftp transfers. Make sure that you call srmcp with option '-server_mode=passive'.

SRM commands fail with error 'srm client error: org.globus.gsi.CredentialException: proxy not found'

Ensure you have run 'voms-proxy-init to generate an up-to-date proxy file. In case the error persists: The SRM tools apparently do not always use the default proxy file location $HOME/.proxy or you used a non-standard proxy location in voms-proxy-init''.

  • Either set the X509_USER_PROXY environment variable to your .proxy file, e.g.
export X509_USER_PROXY=$HOME/.proxy
  • or pass -x509_user_proxy=<path-to-.proxy-file>, e.g.
srmcp -x509_user_proxy=$HOME/.proxy <rest-of-command>

SRM/Grid commands fail with error 'AC validation failed!' or 'no trusted path can be constructed'

This indicates an issue with creating a secure connection to the server. There is either an issue with your personal certificate/proxy/key or with the set of trusted server certificates.

  • Have you registered at the Lofar VO? You can do that at https://voms.grid.sara.nl:8443/voms/lofar. It is required that you have your Grid certificate installed in your browser for this (http://ca.dutchgrid.nl/info/browser).
  • Make sure your set of server certificates is up to date (see trusted CA certificates). If you use the portable Grid toolkit, you can use can use one of the included update scripts to update the certificates.
  • There also is a known issue with OpenJDK 7, which seems not to be capable of dealing with the certificates. Make sure to run Java provided by Oracle.
  • Maybe your private key uses an unsupported algorithm. You might want to try converting it with a command like this: 'openssl rsa -des3 -in .globus/userkey.pem -out .globus/userkey.pem'

SRM/Grid commands fail and I cannot figure out why!

Retry with option '-debug', which will print a lot of debug information to stdout. If this does not help yourself to figure out what is going wrong, submit a support request to the Radio Observatory helpdesk. (Please read this before that!).


QR Code
QR Code public:lta_faq (generated for current page)