Articles and Publications

List of research articles and other publications about checkpointing and related topics, such as process migration. Please inform us of new articles found by .
Thank you!

Scientific Publications on Checkpointing
Cooperative checkpointing: a robust approach to large-scale systems reliability, Adam J. Oliner, Larry Rudolph, Ramendra K. Sahoo, Proceedings of the 20th annual international conference on Supercomputing, 2006.
Cooperative Checkpointing Theory, Adam J. Oliner, Larry Rudolph, Ramendra K. Sahoo, IPDPS 2006.
Evaluating Cooperative Checkpointing for Supercomputing Systems, Adam J. Oliner, Ramendra K. Sahoo, IPDPS 2006.
On-the-Fly Kernel Updates for High-Performance Computing Clusters. Kristis Makris, Kyung Dong Ryu, The 2nd Workshop on System Management Tools for Large-Scale Parallel Systems. April 2006.
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems, Adam J. Oliner, Ramendra K. Sahoo, José E. Moreira, Manish Gupta, IPDPS 2005.
Portable Library of Migratable Sockets. M. Bubak; D. Zbik; G.D. van Albada; K.A. Iskra and P.M.A. Sloot, Scientific Programming, vol. 9, nr 4/2001 pp. 211-222. IOS Press, ISSN 1058-9244, Amsterdam, The Netherlands, 2002.
The implementation of Dynamite - an environment for migrating PVM tasks. K.A. Iskra; F. van der Linden; Z.W. Hendrikse; B.J. Overeinder; G.D. van Albada and P.M.A. Sloot. Operating Systems Review, vol. 34, nr 3 pp. 40-55. Association for Computing Machinery, Special Interest Group on Operating Systems, July 2000.
An Analysis of Communication-Induced Checkpointing. L. Alvisi, E. Elnozahy, S. Rao, S. Husain, A. de Mel. Fault-tolerant Computing Symposium, 1999.
Process Hijacking. V. Zandy, B. Miller, M. Livny. IEEE International Symposium on High Performance Distributed Computing (HPDC), 1999.
A User-level Checkpointing Library for POSIX Threads Programs. W. Dieter, J. Lumpp, Symposium on Fault-Tolerant Computing Systems, 1999.
A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems, V. Naik, S. Midlkiff, J. Moreira, Supercomputing 1997.
Deriving Optimal Checkpoint Protocols for Distributed Shared Memory Architectures. L. Alvisi, K. Marzullo. Workshop in Theory and Practice in Distributed Systems, 1995.
Compiler-Assisted Memory Exclusion for Fast Checkpointing., J. Plank, M. Beck, G. Kingsley, IEEE Technical Committe on Operating Systems and Application Environments, 1995.
Ickp: A Consistent Checkpointer for Multicomputers, J. Plank, K. Li, IEEE Parallel and Distributed Technologies, 1994.
Other Publications on Checkpointing
EPCKPT report, E. Pinheiro, COPPE/UFRJ internal report. 1998.
CRAK report, H. Zhong, Columbia Univiersity report, 2000.
Scientific Publications on Fault-tolerance and Reliability
Exploring failure transparency and the limits of generic recovery , D. Lowell, S. Chandra, P. Chen, Proceedings of the Fourth Symposium on Operating Systems Design and Implementation (OSDI 2000), 2000.