Data storage and preservation devices Risks for my data -Hardware For disks ● probability of losing a disk per year:few %up to 10% with 60K disks,it's around 10 per day and all files are lost o one unrecoverable bit error in 1014 bits read/written for 10GB files,that's one file corrupted per 1000 files written For tapes probability of losing a tape per year:10-4 and you recover most of the data on it o net result is 10-7 file loss per year one unrecoverable bit error in 1019 bits read/written for 10GB files,that's one file corrupted per 100M files written 6/62 S.Ponce-CERN
Data storage and preservation 6 / 62 S. Ponce - CERN devices // risks consistency safety c/c zoo Risks for my data - Hardware For disks probability of losing a disk per year : few %, up to 10% with 60K disks, it’s around 10 per day and all files are lost one unrecoverable bit error in 1014 bits read/written for 10GB files, that’s one file corrupted per 1000 files written For tapes probability of losing a tape per year : 10−4 and you recover most of the data on it net result is 10−7 file loss per year one unrecoverable bit error in 1019 bits read/written for 10GB files, that’s one file corrupted per 100M files written
Data storage and preservation 花5 Parallelizing files'storage Storage devices 2Parallelizing files'storage Striping o Introduction to Map/Reduce 3 Risks of data loss and corruption Data consistency Data safety Conclusion 世nping mapred 7/62 S.Ponce-CERN
Data storage and preservation 7 / 62 S. Ponce - CERN devices // risks consistency safety c/c striping mapreduce Parallelizing files’ storage 1 Storage devices 2 Parallelizing files’ storage Striping Introduction to Map/Reduce 3 Risks of data loss and corruption 4 Data consistency 5 Data safety 6 Conclusion
Data storage and preservation Why to parallelize storage to work around limitations o individual device speed(think disk) .a file is typically stored on a single device ·network cards'speed 1 Gbit network still present network congestion on a node reduces bandwidth per stream o core network throughput o switches/routers are expensive o machines may have less throughput than their card(s)allow(s) ●hot data congestions o and the black hole it can generate as slower tranfers allow to accumulate more transfers strping mapreduce 8/62 S.Ponce-CERN
Data storage and preservation 8 / 62 S. Ponce - CERN devices // risks consistency safety c/c striping mapreduce Why to parallelize storage ? to work around limitations individual device speed (think disk) a file is typically stored on a single device network cards’ speed 1 Gbit network still present network congestion on a node reduces bandwidth per stream core network throughput switches / routers are expensive machines may have less throughput than their card(s) allow(s) hot data congestions and the black hole it can generate as slower tranfers allow to accumulate more transfers
Data storage and preservation Parallelizing through striping Main idea o use several devices in parallel for a single stream o moving the limitations up by summing performances Basic striping:Divide and conquer for storage o split data into chunks aka stripes on different devices o access in parallel File A.1 File A.2 File A.3 File A.4 File B.1 File B.2 File B.3 File C.1 File C.2 File C.3 File C.4 File C.5 File C.6 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 nping mapreduce 9/62 S.Ponce-CERN
Data storage and preservation 9 / 62 S. Ponce - CERN devices // risks consistency safety c/c striping mapreduce Parallelizing through striping Main idea use several devices in parallel for a single stream moving the limitations up by summing performances Basic striping : Divide and conquer for storage split data into chunks aka stripes on different devices access in parallel Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 File C.4 File C.5 File C.6 File B.2 File B.3 File C.1 File C.2 File C.3 File A.1 File A.2 File A.3 File A.4 File B.1
Data storage and preservation RAID O RAID stands to "Redundant Array of Inexpensive Disks" o set of configurations that employ the techniques of striping, mirroring,or parity to create large reliable data stores from multiple general-purpose computer hard disk drives(Wikipedia) Useful RAID levels RAID 0 striping RAID 1 mirroring RAID 5 parity See Data Safety part RAID 6 double parity Can be implemented in hardware or software striping mapreduce 10/62 S.Ponce-CERN
Data storage and preservation 10 / 62 S. Ponce - CERN devices // risks consistency safety c/c striping mapreduce RAID 0 RAID stands to “Redundant Array of Inexpensive Disks” set of configurations that employ the techniques of striping, mirroring, or parity to create large reliable data stores from multiple general-purpose computer hard disk drives (Wikipedia) Useful RAID levels RAID 0 striping RAID 1 mirroring RAID 5 parity RAID 6 double parity See Data Safety part Can be implemented in hardware or software