linuxfs / LSF07 Workshop Notes
 | 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

LSF07 Workshop Notes

Page history last edited by Brandon Philips 17 years, 1 month ago

  1. Monday Morning: Joint Session
    1. 09:00 Introduction - Ric Wheeler
    2. 09:15 EXT4 Updates - Mingming Cao
    3. 10:00 FS Repair & Scalability - Val
    4. Zach Brown - blktrace of fsck
    5. Zach: Vectorized file reads conversation
    6. 11:00 Libata Update - Tejun Heo & Jeff Garzik
    7. 11:45 RAID Updates - Mark Lord, Dan Williams & Yanling Qi
  2. Monday Afternoon: I/O Track
    1. 13:30 FC Storage Updates from Vancouver - James Smart
    2. 14:15 SATA/SAS Convergence - James Bottomley, Brian King, Doug Gilbert, Darrick Wong and Tejun Heo (scribbles below by djwong)
    3. 15:30 Request Based Multipathing - Hannes Reinecke
    4. 16:15 Reinit of Device After kexec/kdump - Fernando
    5. 17:30 I/O Support for Virtualization
    6. 18:15 Open Discussion/Wrap-Up
  3. Monday Afternoon: File System Track
    1. 13:30 FS Scalability & Storage Needs at the Bleeding Edge
    2. 14:15 EXT4 Online defragmentation - Takashi Sato (poor scribbles below by djwong)
    3. 15:30 Security Attributes - Michael Halcrow
    4. 15:30 Why Linux Sucks for Stacking - Josef Sipek
    5. 16:15 B-trees for shadowed FS - Ohad Rodeh
    6. 17:30 Explode - Fault Injection for Storage - Can Sar
    7. 18:15 Open Discussion/Wrap Up?
  4. Tuesday Morning: I/O Track
    1. 09:00 Unifying the Block Layer APIs - Tomonori Fujita
    2. 09:45 RDMA Applications to Storage - Roland
    3. 11:00 Block Guard - Martin K Petersen (scribbles below by djwong)
    4. 11:45 Changes to Storage Standards - Doug Gilbert (scribbles below by djwong)
  5. Tuesday Morning: FS Track
    1. 09:00 NFS Topics: Chuck Lever, James Fields, Sai Susarla & Trond
    2. 09:45 GFS Updates: Steven Whitehouse
    3. 09:45 OCFS2 Updates: Mark Fasheh
    4. 11:00 Enhancing the Linux Memory Arch for Hetrogeneous Devices - Alexandros Batsakis
    5. 11:45 DualFS & Integration with High End Arrays - Juan Piernas Canovas
  6. Tuesday Afternoon: Joint Session
    1. 13:30 pNFS Object Storage Driver - Benny Halevy
    2. 14:15 OSD - API's & Justification Object Based Disks
    3. 14:15 SNIA - Erik Riedel
    4. 15:30 Hybrid Disks - Timothy Bisson
    5. 15:30 SSD Dongjun Shin
  7. Topics for next meeting

 

Monday Morning: Joint Session

09:00 Introduction - Ric Wheeler

 

Introduction to the workshop

  • Looking for a workshop feel instead of presentations.
  • 50 people at the conference.
  • Ric picked the slots so he could bounce around between I/O and FS

 

Basic user requirements of I/O

  • Set complete
  • Bytes placed in files are correct and in order
  • Utilize storage as completely as possible

 

Open Questions

  • How do you validate no lost files or objects?
    • How do you optimize?
  • Verifying data integrity
    • How do you know the bytes are right
    • Played with reverse mappings in ReiserFS4
  • Utilize disk
    • High count of small objects kills utilization

 

How IO can help meet these user requirements

  • Communicate non-retryable requests back to user space
  • Technology by Seagate to validate in the disk
  • Survival modes for drives
    • Val: Never spin down, backups of dying drives is possible
    • Ric: IO Coalescing is great - ???
    • T'so: 8 Retries on 4kb blocks when 64k goes bad is terrible need to communicate this
    • Retry handling needs to be more robust

 

Performance testing

  • Testing high file counts
  • Simulating customer file systems is hard
    • Need nasty filesystems that have been in service for a long time
    • Anonymizing data difficult

 

09:15 EXT4 Updates - Mingming Cao

 

Motivation for EXT4

  • Primary Purpose: to achieve greater than 16TB we need to move from 32-bit block numbers
  • Add second resolution timestamps
  • Fix 32768 limit on subdirectories
    • T'so: Stupid limit in ext3 - easy to fix
  • Fix performance limitations

 

Why fork?

  • On disk format changes required - Linus said no changes in production FS
  • Observation: 2.5 style development for filesystems lots of experimentation

 

News

  • indirect block map moved to extents
  • 48 bit block numbers
  • JBD2 - ???

 

Removing indirect block map (ext2/3)

  • inefficient for large files
  • extra read for every 1024 blocks
  • disk extents format used for new inodes w/ -o extents
  • 12 bytes ext4_extent structure

 

Extent tree required for more than 3 extents in i_data

  • Root and leaf nodes
  • inode flag to mark extents vs ext3 indirect
  • convert to b-tree extents for >3 extenets
  • last found extent cached

 

 

Block number size

  • 64 bit block number considered initially
  • 48bit is large enough for 2**60 1EB

 

Meta-data changes

  • 64 meta data changes
  • ??? missed all points

 

64 bit JBD2

  • Forked jbd to support 64 bit block numbers

 

New defaults for ext4

  • some ext3 feature enabled by default on ext4

 

Plan

  • WIP proposed on ext3 mailing list
    • efficient multiple block allocation
    • Persistant file allocation - don't need to write zeros to guarantee space
    • nanosecond timestamps
  • Other
    • greater than 32k subdir
      • Ping T'so on the issue of 32k subdirs
    • discussion on scaling fsck - specify initialized groups
      • check on the ext3 mailing list for this
    • Larger file of 16TB - limited by

 

Extended attributes

  • some folks want more than 4kb attributes - Coming from Samba group

 

T'so: Vista is using more extended attributes - storing ACLs in

T'so: Storing ext. attributes in inode makes ACLs access faster instead

Mingming: doubling the size of the inode to store ext. attributes in the

inode. inode in ext3 has a pointer available to point to a data block to

store up to 4k of additional attributes.

 

  • Fancy security modules, need to store ext attributes

 

Dave C.: XFS can be tuned to 2k inodes.

Val: What is the performance of 2k inodes?

Russel: it sucks pay when you stat

 

T'so: Offline tasks 1) Collate data on use of extended attributes, SELinux

small, ACLs big 2) This is what Samba is using, etc.

Dave C.: Some filesystems have ext. attribute limits, e.g 64k on XFS

Halcrow: eCryptFS could have an arbitrary length

 

  • Applications need to pay attention to the limit of the ext. attributes

 

  • Other directions - Scale
  • 64 bit inode number?
    • Userspace may assume 32 bit inode from stat()

 

Questions

 

Dave C.: Cache most recently read extent? Curious what input went into that descion

Mingming: Very simple caching scheme

Unknown: Compare performance of JFS extents to ext4 extents? Extents less

efficient on fragmented filesystems.

T'so: How fragmented are you going to get? Extents are relatively compact

Halroy: Need to see if it sucks badly vs indirect nodes

Chuck Lever: Growing and shrinking online?

T'so: Online defrag is the first step towards online shrinking

Lever: Thought about how we are going to manage large amounts of storage

T'so: ZFS taught us that we need to look at it from the sys admins point of

view Desperately need admin tools

Hellwig: It is a user space problem, we need hooks here and there for user

space to manage storage better

Henson: We disagree, need to get to the mindset that "I am not going to make

the user partition the disk"

Caching dir contents in memory

Read: http://lwn.net/Articles/194868/

 

 

10:00 FS Repair & Scalability - Val

 

Waiting for Val's laptop to boot

  • Rine stone for laptops - val
  • fsck data from runs on her laptop
  • details in the paper
  • Seek latency is the dominating factor

 

Presentation of results and conversation

 

T'so: Doing very aggressive caching in ex2fsck

David C.: Memory usage is a problem looked at this in 1998 for XFS

Val: Has a story about fsck running out of address space- Ping for story

Val: ReiserFS doesn't think about fsck problem

David C.: What type of disk? There are differences between OS disk and data disks

A: 10% better fsck on OS disk

Crowd: Every 30 boots bad for desktops - instructions to fix

Val: metadata bitmap is a new favorite idea- Ping

Ming: How do we test these performance hints?

 

 

Zach Brown - blktrace of fsck

 

Introduction

  • Background: did OCFS2 repair tool
  • Unpack kernel src tree
  • Point: fsck averages 12MB/s while streams have 26MB/s

 

T'so: Basically any speed ups we want for FSCK need to be disk layout

changes. With extents we don't have to iterate over these horrible indirect

blocks. Tradeoff: stream vs fsck times.

Zach: system call: to push disjoint reads. Need vector block read.

T'so: if I had working read ahead in block devices I could do better for fsck

Paper on faster BSD fsck from 1983

Dave: Threaded XFS repair. Does internal cache, does direct I/O.

Ric: Does the fsck know how many spindles it has?

Val: Parallize limit is number of disk arms

Dave: Do get to the point of diminishing returns

Val: I/O bound?

Dave: Yes

T'so: Want to have a working read ahead call.

Q: Why not RAID so there are no errors

Val: The idea of having, don't have any errors. Doesn't work because A)

Filesystem bugs B) Sys admin errors.

Russell: Absolutely ....

Issue: trying to parallelize I/O

Val: Need to have I/O people tell us about the number of spindles

Ric: We lie to you on purpose

Ric: what we want is the places where we can do parallelization

Ric: goes back to the idea of reverse mappings

T'so: want to layout your FS so that you are driving your disks equally.

Dave: XFS will tell you

T'so: If you have the bitmaps on disk0 that one gets hammered.

Zach: (pointing to the slide) this is just a piece of the terrible story that

is Linux FS repair.

 

Zach: Vectorized file reads conversation

Need to create a system call for describing all of the blocks of memory that

you want to read and where in memory you want to put the individual blocks.

Application: e2fsck where you know the block locations of the dentries but

you have to read the block locations of the dentry one at a time. During

zach's blktrace the plateau of throughput is due to this problem.

Application: oracles perspective: a DB is just a big on disk file. It would

be helpful to have this system call available also.

 

11:00 Libata Update - Tejun Heo & Jeff Garzik

 

Support

  • ATAPI-
  • C/H/S support - two people use ancient
  • NCQ- queuing for ATA, 32 tag command queuing
  • FUA- Force to the media immediately, barrier implementations
  • IDE is famous for its caching.
  • SCSI SAT- reuse error handling, most installers know how to handle SCSI no

need to go to distros for installer support

  • CompactFlash, adding support for devices that perform well in the ATA space

 

Hardware support

  • Most of the drivers are PCI
  • AHCI - has a DMA ring, on old devices used to write to I/O ports and hope
  • SATA II 3.0 Gbps, NCQ, ..., link speed to
  • eSATA - transparent to kernel support
    • SATA cables are lame, not well shielded and bending causes errors

 

Software Features

  • Conversion to new EH???? almost complete - done be Tejun
  • Hotplug, "warm plug"
  • Improved diagnostics
  • Suspend/resume- not alot of driver coverage, future work here?
  • HDIO_xxx compatibility. Mark Lord: not quite 100%

 

Accomplishments

  • Fedora 7 test 1 disabled IDE driver, using libata for PATA and SATA
  • 60+ host controller
  • Engineering support from
    • controller vendors
    • hard driver vendors
    • integrators
    • large users
  • Limitation we would like to get rid of: partition labels ??
  • Q: Partition issue for PATA vs SATA?
  • Jeff: one solution overflow into 32 bits
  • Hellwig: Problem is in block layer, wants to have all FTs in contiguous space.
  • ATA community is coming together, happy pills for all

 

Future

  • Driver API
    • sane initialization model, like net driver or SCSI model,
      • allocate, register, unallocate, free
    • Greg blessed kobjs all over the place.
  • Refine error handling, ton of errors that drive, bus, system can throw.
    • Should we retry for 5mins or pass off to filesystem in 5 seconds.
  • Sysfs support, coming
  • Port multipliers
    • big on SCSI sat, it is a eth hub for SATA.
    • SCSI SAS is more interesting, has expanders, network like routing table.
  • Powersave, Pavel Machek
  • Host protected area - ATA has a window, and then a special area on disk. ???

 

Block layer future

  • NV cache
    • pinned means that those sectors are connected to some blocks.
    • unpinned: caching spins down the hard drive when idle.
  • Synchronization between request queues
    • Need to deal with Simplex: multiple ATA ports can only do one command at a time
    • host queue, SATA, etc. "Request queue group"
  • Move ATA block devices from SCSI to block layer
    • Get rid of overhead of emulating SCSI
  • Make SCSI block devs, transports more generic.
  • Barriers suck currently.
  • I/O rate guarantees
  • Better error information back to filesystem

 

Jeff: I think that I/O and filesystems need to share more information, not

just throwing EIO"

Dave: Knowing what type of error, like hard media errors vs soft errors. hard

errors are lost blocks. Soft error like a path error require some time to

recover.

T'so: Pull a fibre channel disk and plug into hub. Maybe down for 30 seconds,

and in the mean time ext3 has remounted itself read only. But, you can't block

forever,

Dave: XFS on fibre channel we had a configurable timeout. Doesn't throw EIO until

the timeout is reached.

Matin K.: Tried to decouple errors from transport.

Dave: The danger: if we have to shut down a filesystem, then we have to fix

it.

Hellwig: alot of the soft shutdown things are racey.

Lord: what additional information should we pass?

Dave: what type of error, persistent or temporary, media or otherwise, device

unplugged.

Dave: We don't get an unplug at all.

Martin K.: Volunteered to write up a list of the type of errors that we need.

T'so: Have a certain support in the VFS.

Dave: right now we have a 1to1 mapping between a block device and filesystem.

We need infrastructure to support different mappings.

Hellwig: we need to separate out the unplug/plug events.

T'so: Do we support having a USB key and replug with a new major and minor

dev?

Jeff: Two types of unplug events. Hardware locking vs just unplug hardware.

What information is relevant to pass up to the filesystem.

Chris Mason: Question of who owns the spindles? FS or IO

Dave: we need to get a range of addresses of where the failure is happening.

Dave: You have a hardware RAID 5 that has a disk failure. Tell the filesystem

that one of the spindles is down.

Val: the idea of having a pipe.

Hellwig: If we have this idea of having an I/O path or pipe we can also

include information about performance information and errors.

Chris Mason: It is funny we aren't really good at handling the information we

have now.

 

Barriers

  • They suck.
    • Right now flush cache is our barrier, very painful for performance
    • Wants communication from hardware
      • when data hits write-back cache
      • when it hits disk
      • We need SCSI link commands.

 

Ric: Linux community not good at driving the standards community.

Val: Anyone willing to throw themselves on that grenade.

Jeff: Currently we are flushing to the disk.

Mason: We should think about cache management as a whole.

Ric: The way forward is to model this with an emulation of a disk. And then

propose to getting this into hardware.

Jeff: FUA bit

Chuck Lever: support for iSCSI in libata.

Chuck Lever: support for SATA over Ethernet.

Jeff: it is lame.

 

11:45 RAID Updates - Mark Lord, Dan Williams & Yanling Qi

 

Lord: Much of what I wanted to talk about is mostly error handling.

Dan: Particularly interested in MD raid. Offloading the XOR and ... into

acceleration engine. Now he has a generic memory offload system.

Requires changes to MD to tell the system to use these engines async.

async_acopy async_xor

Dave: Offload API would be helpful for many applications. SGI hardware has

support for hardware zeroing. Background page zeroing would be a case

Jeff: Promise XF4 - alot of this conversation missed

Hellwig: What we want is a common API to talk to all raids. We want a raid

class.

Hellwig: We separate out the RAID operations from the device. raid-ops?

Dan: iop13xx is the device with the XOR/Copy engine

Ric: You are offloading because the NAS device is a really low powered device

and you want to offload this expensive work.

Ric: we could do checksumming in hardware for speeding up FSCK?

 

Yanling: Auto RAID - LSI

 

Hellwig: we have support for filesystem freezing already.

Need Ric to explain RAID sparse allocation

 

Monday Afternoon: I/O Track

13:30 FC Storage Updates from Vancouver - James Smart

14:15 SATA/SAS Convergence - James Bottomley, Brian King, Doug Gilbert, Darrick Wong and Tejun Heo (scribbles below by djwong)

 

  • libata will become a block layer client _only_ for ATA disks; the SCSI interface will remain for ATAPI devices
  • Need to implement a stackable EH for SCSI/SAS/ATA to route exceptions to the appropriate parties
  • dougg: The SATL spec lists some translations of SAS <-> ATA error codes
  • jgarzik (?): Marvell SAS driver coming = wider use of libsas in kernel
  • Refactor the libata EH to be able to deal with individual scsi_cmnds coming from libsas (instead of being one big function like it is now)
  • libata wants to use the new libata EH handling scheme that sas_ata doesn't use right now
  • Would it be useful to translate the ata$HOST:$DEV in printks into $host:$channel:$target:$lun format?
  • Other chatter about using the IDENTIFY command and the D2H FIS so that libsas can acknowledge the existence of an ATA device even if it's currently reserved by something else

 

15:30 Request Based Multipathing - Hannes Reinecke

16:15 Reinit of Device After kexec/kdump - Fernando

17:30 I/O Support for Virtualization

18:15 Open Discussion/Wrap-Up

 

Monday Afternoon: File System Track

13:30 FS Scalability & Storage Needs at the Bleeding Edge

14:15 EXT4 Online defragmentation - Takashi Sato (poor scribbles below by djwong)

 

  • Three types of defrag: a single file, a whole directory, or free space
  • Roughly 15-30% speed increase by defragging files
  • Free space defrag useful for shrinking filesystems
  • Strategy: Make new inode, copy data into sequential runs of blocks, then reassign the file
  • Q: What if someone modifies the file during defrag? Kill the temporary inode
  • Other questions: OSX hot file clustering/bootcache

 

15:30 Security Attributes - Michael Halcrow

 

  • Original design was that every file would have ext. attribute with encryption information.
  • Stacked filesystem sits on top of lower FS
  • Duplicates every inode, dentry, etc of lower FS
  • XML language to define the users/objects and encrypt policy.
  • Open Question: Is it a good idea to overload SELinux for eCryptFS?
  • Most common question on SELinux mailing list is how to disable

 

Val: It is doesn't make sense to do the mapping on execution. Instead, I want

to do it on a per directory, per mount, time of day, but not execution.

Hellwig: we really want the user to get a namespace for encryption on login.

Most common use case.

What would you want to ensure certain files are encrypted.

 

Trying to find use cases:

  • Lost USB keys
  • Lost laptop problem

 

T'so: Maybe the threat model actually pushes this up to the application

level.

Halcrow: Pushing the problem to the application layer is a key management

nightmare.

T'so: The question is where the right layer is. Maybe this is a question of

having a key management library for Linux.

Halcrow: This is safety of secondary storage issue

Halcrow: We have a problem of losing laptops and losing financial data. I

want to be able to push a policy a laptop for running a trusted configuration.

Argument ensues.

Val: Maybe we should change the use case to protecting against a stolen

environment. The per mount point use case is the simplest.

Halcrow: Per user per mount point name spaces.

 

 

15:30 Why Linux Sucks for Stacking - Josef Sipek

 

Page cache coherency

  • mucking with lower filesystem can cause things can go bad
  • the upper filesystem doesn't know when the lower was changed

 

Hellwig: The user interface is totally broken. Layered filesystems should be

mounted internally only.

Val: delete test?

Hellwig: Much better user interface to not expose the lower filesystem by

default.

Halcrow: there are patches pushed by Morton that does that.

Sipek: Stacking can get arbitrarily complex

Hellwig: No they can't, we have this thing called kernel stacks

Sipek: Beware of cycles, must be a DAG. Walk up, sync down.

 

 

Code sharing for stacking file systems

  • fsstack (fs/stack.c)
  • Simple inode attribute copying functions
  • What should be added?

 

Hellwig: most everything in eCryptFS should be pushed into stack.c

 

Other issues

  • Lockdep doesn't like us right now. dget() happening recursively.

 

Hellwig: Different key for separate super blocks. ???

Halcrow: No propagation of locks to the lower filesystem

Hellwig: What do you lock for range locks on the underlying filesystem.

Mason: I would only do locks at the top layer.

 

on disk format

  • At OLS T'so suggested having an on disk format
  • storing whiteouts and persistent inode data
  • Currently white outs are stored as .wh.
  • Have a prototype based on ext2

 

16:15 B-trees for shadowed FS - Ohad Rodeh

 

Motivation

  • useful for ZFS and WAFL
  • used in research prototype of an object-disk

 

Current methods

  • Filesystem is a tree of fixed sized pages
  • In case of crash: revert to previous stable checkpoint and replay the log

 

Shadowing

  • Two roots, some pages are shared.
  • Snapshots are easy with shadowing, create new root

 

B-Trees

  • B-trees are used by many filesystems to represent file and directories
    • XFS, JFS, ReiserFS, SAN.FS
    • Guarantee logarithmic-time key-search, insert, remove

 

Challenges

  • Challenges to multi-threading
    • changes propagate up to the root
    • the root is a contention point
  • In regular b-tree leaves can be linked to their neighbors
    • if you are doing shawowing, you end up shadowing all of its neighbors
    • this means you copy the entire tree

 

Write in place b-tree

  • just modify the tree in place

 

Alternate shadowing approach

  • Pages all have a virtual address that never change.
  • There is a table of virtual to physical
    • In order to modify page P at L1 you copy, update, swap
  • Pros
    • Avoids the ripple effect of shadowing
    • Used b-link trees, very good concurrency
  • Cons
    • Requires an additional persistent data structure
    • Performance of accessing map is critical

 

Requirements from shadowed b-tree

  • need good concurrency
  • work well w/ shadow
  • deadlock avoidance
  • guaranteed space/memory
  • Solution: tree has to be balanced

 

Intuition: Shadow from the top down

  • lock-coupling for concurrency, grab parent, then child, vice versa
  • Proactive splits, split nodes that are full while recursing down

 

remove-key

  • lock coupling
  • proactive merge/shuffle
  • shadow on the way down
  • P/C of the scheme
    • Effectively lose two keys per node due to proactive split/merge
    • Need loose bounds on number of entries per node

 

Cloning

  • p is a b-tree, q a clone
    • p and q should share as many pages as possible
    • speed creating q from p should have little overhead
    • number of clones, clone p many times
    • clones should be first class, should be possible to clone q as well as p

 

Naive clone

  • make a complete copy

 

WAFL free-space

  • map of 32 bits per data block we get 32 clones
  • support 256 clones, 32 bytes needed per data-block
  • to clone a volume we need to make pass through entire free-space

 

Challenges

  • How do you support a million clones w/o huge free-space map

 

Main idea

  • modify free space so it will keep a ref count per block
  • ref count counts how many times a page is pointed to
  • zero means free

 

Cloning a tree

  • Copy root p into a new root
  • increment free-space counter for first child
  • Before modifying page N, it is marked dirty
    1. informs run-time system that N is about to be modified
    2. gives a chance to shadow if necessary
  • If ref-count == 1, page can be modified in place

 

Deleting a tree

  • ref-count > 1: decrement the ref-count and stop downward traversal
    • Node is shared with other trees
  • ref-count(N) == 1: delete the node

 

T'so: this is also known as... ???

Zach: It looks hard to do repair.

Hellwig: Nick Piggins implemented lock coupling for page cache

 

Useful Papers Ref 23 and 25:

http://www.cs.huji.ac.il/%7Eorodeh/papers/ibm-techreport/H-0245.pdf

 

17:30 Explode - Fault Injection for Storage - Can Sar

 

Introduction

  • idea: model checking
  • fast, easy to use
    • Run on linux and freebsd
    • 200 lines of C++ code
  • general real: check live systems
  • effective:
    • check 10 Linux FS, 3 version control software
  • Bugs in all, 36 in total, mostly data loss

 

core idea: explore all choices

  • bugs triggered by corner cases drive execution down tricky corner cases
  • choose(N): conceptual N-way fork, return K in Kth kid
    • instrumented 7 functions w/ choose
  • found JFS bug with simple test

 

Discussion: May be ported to python, c

 

ext2 fsync bug

 

load A into memory, truncate, create B uses page from A. fsync causes B to

point to the same indirect block as A.

Dave: Problem is that we have small tests for individual problems, and this

seems to be a very general solution.

Val: Package it up and make it easy to use.

T'so: It would be interesting to look into how hardware failure type errors

affect filesystems.

T'so: Didn't know whether having bit flipping error case was interesting

Dave: Filesystems are getting large enough that bit errors are getting

important.

Dave: Barrier support would be interesting.

 

http://marc.theaimsgroup.com/?l=linux-fsdevel&m=117148291716485&w=2

 

18:15 Open Discussion/Wrap Up?

 

Dave: Academics in Australia they feel that a publication is incomplete if

they don't open source the code. You can't reproduce the code.

Val: Make the change at Freenix in the fall.

Can: It is often a bit of work to open source and support the code.

Val: If you release useful code then people will contribute

Sunnyvale/UnionFS guy: SNIA BOF on Tracing and replay

 

Research topics

  • Filesystem scrubbers
  • Create a wiki page with ResearchTopics on linuxfs.pbwiki.com

 

Tuesday Morning: I/O Track

09:00 Unifying the Block Layer APIs - Tomonori Fujita

09:45 RDMA Applications to Storage - Roland

11:00 Block Guard - Martin K Petersen (scribbles below by djwong)

 

  • 520 byte sectors with a twist--the extra 8 bytes are for protection data ("DIF")
    • 2 bytes for a CRC
    • 4 bytes to list the LBA sector number (apparently writes to the wrong sector are common)
    • 2 bytes for an application-specific tag
  • How do we use the 2 byte app-specific tag? Maybe we need input from, say, filesystem people?
  • Various levels of protection: none; guard+crc; guard+crc+lba; guard
  • Different HBAs allow or disallow user access to the DIF areas
  • DIF requires new commands--can we emulate this with older 520B sector disks? Probably not too hard to modify
  • Block layer changes -- per bio callback to ask for protection data and/or calculate the tags. There's no need to tie ourselves to SCSI
  • The standard CRC algorithm is pretty slow; could we use SSE4 instructions?
  • What if we want more protection data? 4K protection page per 256K of I/O, maybe?

 

11:45 Changes to Storage Standards - Doug Gilbert (scribbles below by djwong)

 

  • List of standards committees:
    • T10: SCSI/SAS/SPI @ t10.org
    • T11: FC @ t11.org
    • T13: PATA/SATA @ sata-io.org
    • SFF: plugs, small form factor devices, etc
    • IETF: iSCSI, iSER, IEEE, Infiniband, SNIA (RAID)
  • Don't look at just the published standards; have a look at the last few drafts that have revision history and links to the justifications for the changes made
  • SAM-4: > 16,000 LUNs, change "initiator port" to "I_T nexus"
  • SSC3: Security chatter (dougg didn't elaborate much more than that)
  • OSD2: Obsolete OSD1 calls; align commands to certain byte boundaries
  • SPC4:
    • "initiator port" -> "I_T nexus" change
    • Access to log subpages: statistics and performance data
    • Report identifying information (labels printed outside, I think?)
    • Encapsulated SCSI opcodes
    • Obsolete items: linked commands and basic task management
  • SBC-3:
    • Background media verification
    • 4K sectors
    • log pages
    • Grouping of commands to simplify (perhaps allow aggregation of?) logging and data collection purposes
    • ORWRITE
    • Write uncorrectable sector
  • SAS-2:
    • 6Gbps links that are 2 multiplexed 3Gbps connections
    • Zones in the SAS domain
    • Self-configuring expanders
    • Multiple affiliations for SATA devices
  • SATA 2.6:
    • Slimline and micro connectors; mini SATA
    • NCQ command priority
    • Enhanced BIST activate FIS/signature FIS
  • SAT-2:
    • ATAPI translations
    • NCQ control functions
    • Persistent reservations
    • ATA security
    • LUN to port multiplier mappings
    • 4K blocks and ORWRITE
    • NV cache translation
    • End to end data protection

 

Tuesday Morning: FS Track

09:00 NFS Topics: Chuck Lever, James Fields, Sai Susarla & Trond

 

Server Issues

 

Bruce: Race: When there are two writes within the same second, the second

client doesn't see the second write when doing its initial open.

 

  • Nanosecond time granularity does not resolve race
    • Actual timesource is just jiffies

 

NSFv4 change attribute requirements

  • Must change whenever ctime changes
  • Must be consistent across server reboots
  • Not consistent across files

 

NFSv4 change attribute non-requirements

  • Not required to be consistent across files (unlike ctime)
  • No units (doesn't have to measure number or size of writes)

 

Bruce: Easy to come up with solutions that don't quite work. Need to work

across server reboots.

Val: Solaris, you could cheat and make sure that the nano time field at least

changes

Crowd: Counters are the solution.

Dave: the problem with that is that you have to write that disk every time you

modify.

Val: Should we update the number on reboot.

Dave: The fact that it has to be on disk makes it a bit more expensive.

Erez: It seems like the problem should be a callback

Dave: Imagine every client does that

Mason: If you have something that changes every time the server reboots. And

keeping the counter in memory.

Val: This is an expensive way to do it for legacy filesystems.

Bruce: What you don't want is the client to see the data that is half way is

written without also seeing the change attribute change.

Ohad: Isn't this supposed to be solved with delegations?

Trond: Delegation isn't a cache validation but a cache acceleration tool.

Mason: If the FS crashes you want to invalidate all of the caches.

Trond: Consider the case where you have hundreds of NFS roots, it is the

pathological case.

Bruce: Samba and apache has the same problem.

 

NFSv4 ACLs

Trond: The basic message we have about this is we may want v4 ACLs on Linux.

It looks like they are being the new standard.

Dave: there is a patch for ext3 for nfs v4 acls. The question is how we match

posix and nfsv4 ACLs.

Bruce: difficulty is the synchronization of the mode

Val: Use the POSIX standard recommendations

Bruce: The nice thing with POSIX ...

Val: We have the implementation of POSIX ACLs but no one uses it.

BSD is considering the use of NFSv4 ACLs for all filesystems. They don't have

the legacy problem we have.

 

Exporting clster-coherent locks

  • already a file->lock method
  • need to modify NLM and NFS to call it
  • allow async return of results
  • Bruce proposes a new async API for locking

 

Client scalability

  • readdir scalalbility issues
  • get rid of redundant storage
  • dentry+cookie lookup table

 

VFS: intents

  • get rid of redundant lookups in operations like rename and link

 

Trond: Main issue is that you have to pass them along

 

pNFS

Raul: pnfs is a method that v4 clients can directly access data by directly

accessing the storage device. You ask for the layout and get it. A number of

companies are looking at pnfs.

 

  • The sort of things we will wind up seeing
    • the storage backend could go through the block subsystem
      • potentially hit the elevators
    • could end up getting bursty traffic patterns

 

09:45 GFS Updates: Steven Whitehouse

 

GFS Introduction

  • 64 bit symmetric cluster fs
  • Took over development in Oct 2005
  • About 1.2Mb of code
  • Various changes and cleanups as a result of upstream
  • Accepted into 2.6.19
  • bug fixes
  • Current code 716k

 

important changes

  • GFS used to have a journal system that appened a header to the files
  • different file journal layout, same sort of system as ext3
    • allows mmap, splice, etc to journaled files
  • A metadata filesystem for access to special files
    • journals, rindex, fuzzy statfs, quotas
  • locking is at the page cache level (GFS was at syscall level)
    • faster, supporting new syscalls, eg splice
  • readpages() support, some writepages() support
    • to be expanded in the future
  • supports the ext3 standard ioctl lsattr, chattr

 

problem we have been looking at is directory structure

  • GFS2 structure based on "Extendible Hasing", by Fagin Sept 1979
  • Small directories are packed into the directory inode
  • Hashed dirs based on extendible hasing, but
    • Uses CRC32 hash in dcache to reduce he number of times we hash a filename

 

Problems with readdir

  • The way we do the splitting of the dentries we could get a reorder on a new directory insert
  • Have to sort each and every hash chain even if only one entry is read

 

Val: One of the most annoying things about filesystem is getting readdir to

work properly. Shrinking and growing directories can break it.

Val: You can do extendible hashing and have stable points.

Ohad: the cursor is only 32 bits so if you have more than 4Gb of entries than

it is broken, it will truncate.

Val: I think we are kind of saying that readdir is broken.

 

readdir is also used in other places aside from getdents64

  • NFS readdirplus
  • NFS getname doesn't need a defined order of entries, so we could potentially avoid the sorting operation

 

T'so: I can think of two things to do

1) Substantially increase the size of cookie retunred by telldir

2) Get into the right commitees to deprecate telldir seekdir in POSIX

I would not be oppossed to eliminate the call alltogether

 

09:45 OCFS2 Updates: Mark Fasheh

 

Introduction

  • Shared disk cluster filesystem
  • Took a number of idea from other filesystems

 

  • Try to get node ops local to one node. Allocation hashing.
  • Development focus on adding features
  • Generic OCFS2 b-tree code, will make ext. attr support with b-trees easy
  • Mounting OCFS2 as local makes OCFS2 act local

 

  • Filesystem is designed for large allocations

 

Fasheh: peform_write is great

http://www.ussg.iu.edu/hypermail/linux/kernel/0612.2/0218.html

Dave: batches prepare_write and ...

Fasheh: writes an entire range, no page locks

Dave: No documentation on perform_write,

Fasheh: ocfs2.git has an example implementation

Fasheh: One thing I am concerned with perform_write is that other paths in the

kernel can use it.

Dave: What does invalidate page mean? *laughs*

Ric: Forced unmount is a very useful thing.

Dave: XFS allows you to shut it down and then unmount.

Val: There should be a generic support in vnode

Fasheh: We need to see if we can avoid fencing in some cases.

Ric: Fencing is useful in some cases

Ric: Fencing is telling the storage device to just ignore foo host.

Fasheh: OCFS2 needs better fencing. I guess I was trying to see if there is any

queued IO for a device.

Hellwig: We should have fencing in VFS. I liked the way that GFS did it before

it went closed.

Ric: Most popular user request?

Fasheh: Forced unmount

Hellwig: funmount should be a vfs feature

Ric: Favorite feature?

Fasheh: Feature they like is easy setup.

Russell: Performance hot buttons?

Fasheh: Our inodes don't fit in a block.

 

  • Each node has its own journal

 

Future work

  • Mixing extent data and ext. attributes
    • can make for very complex code
  • I would like to move to GFS's DLM

 

GFS vs OCFS2

  • we have the edge in stability
  • we handley beat GFS in performance

 

http://en.wikipedia.org/wiki/Distributed_lock_manager

 

Business case

  • Original business case is designed to host an oracle home on it.

 

Interesting conversation would be on which filesystem are better for what.

Particularly focusing on the clustering filesystems.

 

ANNOUNCE: Val, Nick Piggins is interested in a VM/FS workshop.

 

11:00 Enhancing the Linux Memory Arch for Hetrogeneous Devices - Alexandros Batsakis

 

  • IPoIB is 3 times slower than RDMA
  • writes that occur at the same time as pdflush syncs take up to 1.5 sceonds

 

Val: How much time is elapsed in both

 

  • Traded one big write congestion w/ small ones by tuning pdflush ratio

 

  • Clients are heterogeneous
  • Client-server network is heterogenous
  • But... flushing policy is a system wide, static

 

Dave: you can set the pdflush ratio on per cpu set thanks to a patch by Hellwig

Alexander: wants to have it per device not per cpu set

 

  • ratio makes rdma faster on 1 Gb RAM systems then on 2Gb RAM systems.

 

Lever: We need to have some sort of self tuning system for this issue.

 

  • Can writeback be storage-aware?
    • Not only an RDMA issue - 10GbE etc

 

Dave: People tend to expect the clients to do the caching.

 

  • Need communication from the server about how often it can write, the load, etc.

 

Erez: Where is the pathological case that is causing this.

Dave: Why is pdflush blocking the other writes from going on?

Trond: Hopefully Peter Zijlstra's patch will fix the situation? Should be

setting the non blocking flag for the flush.

PATCH nfs: fix congestion control

Perhaps the pathological case is that we are contending on the waitqueue

implemented in the NFS client.

 

11:45 DualFS & Integration with High End Arrays - Juan Piernas Canovas

 

Introduction

  • Better performance almost all cases than traditional journal filesystems
  • must design filesystem to better take advantage of storage tech
  • meta-data management is key design issue
  • traditonally write meta-data in syncrounous way, use fsck
  • Current: log last meta-data updates, asyncchronous meta-data writes
  • DualFS uses a log
  • Seperation papers
    • Muller y Pasquale
    • ruemmler y wilkes

 

Design

  • Seperates data and metadata
  • Proves seperation can improve performance

 

Motivation

  • Presented a distribution of data/metadata traffic for different workloads

 

Conclusions

    • Meta data represents high percentage of total I/O time
    • Writes predominant
    • Almost always request are not sequential

 

Diagram: Two seperate filesystems for data and metadata. Seperate drives,

partitions or zones. Problem with this layout is reading a normal file is

inefficient; solution to problem later.

 

Data Device

  • Like Ext2 w/o meta-data blocks
  • Groups
    • Grouping is per dir
    • Related blocks kept together
    • File layout for optimizing seq. access.

 

Dir. affinity

  • Select the parent's dir if the best one is not good enough
    • (does not have at lesast %foo free blocks)
  • Data blocks

 

Val: We are trying to figure out if the allocator is the reason you are seeing

the difference in performance. What percentage of the improvements are due to

those changes.

 

Meta-data Device

  • Meta-data: i-nodes, indirect blocks, directory data block and symbolic links bitmaps, superblock copies
  • Organized as a log-structured file system, like BSD-LFS
  • Meta-data elements have same format as ext2/ext3
  • big change is how the meta data is written to disk.

 

Diagram: Layout of meta-data, divided into number of segments

Erez: Found a case where log structured filesystems make sense. The cost of

the cleanup won't kill the log filesystem because meta data is much smaller.

T'so: the one exception of small meta-data is one large directory, like squid.

How many entries in the directory?

Sorin: We tested with 50,000 1mb files

T'so: Ok, so small

Juan: The win is that you can write the metadata sequentially on disk.

T'so: The question is traditionally log fs have traded read performance for

write performance. I am trying to find bad cases, like git on a cold cache.

Ric: Can we dynamically grow and shrink the meta-data and data partition? You

have to decide the use of the FS ahead of time.

Ric: Worst case, find -exec md5 {}, perhaps? Did the tar speed go fast?

Dave: If you get over 100mb extent it doesn't matter anyways because you have

reached an I/O limit of the disk.

Dave: How do you fix the deadlock condition that you have no free segments

left to run the cleaner?

T'so: Basically you need to freeze the filesystem and let the cleaner do its

work, reserving enough memory for the cleaner to do its work.

Val: I would like to talk about the fact that you have a meta data block that

is static.

 

http://sourceforge.net/project/showfiles.php?group_id=187143

 

VERY quick rundown of the performance

 

Tuesday Afternoon: Joint Session

13:30 pNFS Object Storage Driver - Benny Halevy

 

What is the problem we are trying to solve

  • Part of the IETF NFSv4.1 draft
  • Scaling out problem
    • Want many clients to talk directly to storage devices in parallel w/o server
  • Skipping the server has been availble in a number of clustered file systems
  • Why implement this again.
    • Proprietary protocols bad
    • Interoperabiliity is good for everybody

 

http://playground.sun.com/pub/nfsv4/webpage/

 

pNFS comes in three different flavours:

  • Files
  • Blocks
  • Objects

 

http://www.nfsv4-editor.org/draft-08/draft-ietf-nfsv4-minorversion1-08.txt

 

Explanation of layout4 data structure and pnfs_osd_layout4

 

Object based storage overview

  • Basically decoupling the storage and the namespace
  • OSDs are cool

 

Diagram: client asks the security manager for authorization. SM hands client a

capability and uses it to sign requests to object store. object store and SM

have a shared secret.

 

  • OSD commands are a bit chubby. 200bytes long.

 

What we need as far as Kernel support

  • Linux wants bi-dir SCSI commands
  • Emulex wants to replace its own proprietary protocol
    • should see them in kernel in another month.
    • patches for block, scsi, iscsi
    • tested on iscsi -> IET and IBM OSD initiator -> IBM OSD target sim
  • need support for large variable length cdbs

 

idea of the patches

- idea: add an API to access the current I/O related information as

uni-directional w/ little existing code change. Then add bi-directional read

and write buffers.

 

Todo

  • bidi residual bytes
  • osd initiiator library

 

Design

  • (p)NFS client
  • pnfs-obj layout driver
  • OBJ RAID
  • Flow control (global and per-devuce)
  • OSD initiator

 

Jeff: Doesn't this explode the complexity of NFS alot. Why not stay with files?

Benny: The applications see a POSIX fs.

Jeff: Wonderful thing about NFS was that it was entirely interoperable and now

we have to support a ton of device.

Benny: Benefits are scalability because of local block and the security model.

Jeff: This is not optional, we have support all of this crap. Now NFS has to

support SCSI and RAID.

Ric: What I am hearing is that you want performance proof?

T'so: There is a real question of who is going to be using this. This may end

up as a high end plaything.

Erik: Our intent is to have objects be the interface to storage devices.

T'so: Linux works really well when we have alot of developers with commodity

hardware in their hands.

Erik: pNFS is taking alot bigger problems than OSD. The three layers of

pNFS are there for legacy usage.

T'so: I will eagerly wait the day when I can go to Fry's and pick up an OSD.

Hellwig: There is actually very little code to support object based storage.

Dave: What is the complexity of the pNFS server? We need the security manager,

object handling, etc. There are some interesting complexity questions there.

 

14:15 OSD - API's & Justification Object Based Disks

 

OSDs are cool

  • Block based storage doesn't make it possible to do the offload of processing.
  • OSD is about pushing a bit of the filesystem down to the hardware.

 

Hellwig: This is alot of marketing bullshit.

Trond: You should be careful of talking about pNFS security and OSD secuirty in

the same breath.

Jeff: What we I really want to see is a OSD fs.

T'so: I agree that this linear 512 or 4,096 byte is so 1970s but I need the

cheaper hardware.

 

Upper level driver discussion

 

Jeff: An upper level driver isn't needed. OSDFS would talk SCSI directly.

Erik: Are we missing out erorr handling by not having an upper level driver?

Hellwig: You basically want a library for handling the OSD.

T'so: He is objecting to putting it in the SCSI layer.

Hellwig: We should sgio and sg devices for passing down the scsi commands to

the hardware. We have a SCSI pass through that can be used for management in

userspace.

 

Number of interesting reasons of OSD applications for research.

  • sf.net/projects/intel-iscsi
    • User level OSD damon
    • kernel modules for
    • iSCSI initiator code

 

T'so: Interesting research question about GIT in an object store

Ric: companies do reduncy based hardware

 

14:15 SNIA - Erik Riedel

 

Introduction

  • SCSI has served us well for 25 years
  • Moving from SBC scsi block command to OSD
  • We will make these devices real

 

OSD Commands, OSD-1 r10 as ratified

  • Important: Read, write, create, remove, get attr, set attr
  • Imp. security: Auth, integrity, set key, set master key

 

Motiviation

  • Basically we feel we need to continue increasing the capacity of drives.
  • Objects have attributes attached to them, like extended attributes on disk.

 

Security

  • seperation of the security manager and the object drive
    • leave alot of the complexity of security mechanisms off object drive
  • Keys are per partition.

 

Dave: How does the OSD deal with fragmentation.

Dave: If there is something wrong with the fragmentation then we have to rely

on the disk to defragement. How do you avoid fragmentation?

T'so: If you tell a hard core filesystem person that you are going to take

care of all of the inode allocation problems then they will ask how it is

implemented.

Dave: What you are telling me is that you are holding us hostage to implement

what we need.

Lord: What happens when you create one large object and manage like a regular

disk.

Erik: That is fine.

 

Future

  • D to D migration of data
  • Snapshots

 

www.snia.org/tech_activities/workgroups/osd/

 

15:30 Hybrid Disks - Timothy Bisson

 

NVCache split into two sections

  • pinned and unpineed.
  • Host controls the pinned sets
    • Pin one or more LBAS to flash
  • Device control unpinned set as a cache

 

New mode - NV Cache power mode

  • Redirect I/O to NVCache, manufacturer spin-down algorithm
  • Pinned set management independent of this mode

 

Host commands

  • add logical blocks to pinned set
  • remove ""
  • query pinned set
  • query pinned set misses
  • flush nv cache

 

Ric: How big areas?

Timothy: 128-256Mb

 

Leveraging Hybrid Disks: Power

  • Block Layer
    • integrate w/ laptop mode
      • Enable NV cache power mode
      • Remove 10 minute write-back threshold
  • FS Layer
    • Pin all metadata
      • Issue: spin-up command on any metadata read request

 

Leveraging Hybrid Disks: Performance

  • Issue: how do you share this space w/ two different filesystems
  • FS Layer
    • journal, rand. access files, boot files, weap file
  • Block layer
    • Selective write cache for rotating media
    • Pin LBAs resulting in long seeks and leave in the request queue

 

Lord: Does the pinned section survive reboots?

Timothy: Unknown

Waiting on Erik for the answer.

Q: How many write cycles can it survive.

Timothy: it takes about one year to fry the disk if you are using it as a

write cache.

Jeff: It is all about capacity. I can see the journal and boot files in the

nvcache but there is too little space for use as a good journal.

Chuck: I see the wear leveling thing is going to make it impossible to use this

as a place for the journal. The journal is write mostly.

T'so: I am really excited about a technology but it needs more space in next

generations.

 

Open Questions

  • Layers conflicts
  • Who should be incharge, block layer or FS layer
  • FS APIs to allow filesystems to leverage MD cache

 

Pinned and unpinned is specified in the specified in the spec. ATA spec.

 

15:30 SSD Dongjun Shin

  • SSD is data storage on RAM or flash

 

  • Why SSD? low power consumption, performance
    • Characteristic: no spin delay, no seek time, lower power consumption, read is faster

 

16:15 Scaling Linux Storage to Petabytes - Sage Weil

 

  • Based on object storage paradigm.
  • Goal: scale to exabytes, tens of thousands of driver, posix
  • Scalability forces to solve mngmnt problems: Highly tolerant failure

trivially add and remove storage or server

system rebalances data/meta data

  • Most FS components in userspace

 

CRUSH

  • data distribution with a function
    • Function to calculate the data distribution of the OSDs

 

Audience: Sounds like Lustre

 

OSDs have awareness of where they are and distribute the work of health,

recovery and rebalance.

 

http://ceph.sourceforge.net

 

Highlight key problems in the notes

 

17:30 Open Discussion/Wrap Up?

 

Discussion point: Was it a good idea to bring together the SATA/SCSI/FS people

together?

Show of hands: most people was good

Hellwig: We need to try and do a storage code workshop.

Ric: We stayed remarkably on time, but we didn't go into the depth when we

could have.

T'so: We brought together a number of groups, developers, fs, storage, should

we start a new mailing list.

Val: We need someone to keep the people motivated.

Discussion: How do we reach out to students looking for work?

What happened to comp.research

Storage research mailing list, Erik.

Ric: May need to be more clear on how we select and choose people. Was this

too big?

Audience: Ok, but not good for hacking

Ric: Maybe we should have 3 days. One on I/O, one on fs, one combined.

*agreeing noises*

T'so: Cabaret seating was more effective at Kernel Summit

Russell: White boards are nice

Ric: BOFs will be hard without those, heh

Ric: FS is in a bit more fragile state than storage.

Val: General announcement: Working on ChunkFS full time now.

Ric: Bringing up the point of how error values from disk should be handled.

T'so: One issue is that we need to know what we should be pushing out to

standards bodies 1) telldir/seekdir 2) 4k/1k sectors. How do we get engaged?

Do we have a few sacrificial people who do this work for us?

Erik: If there are proposals that people would like to have brought forward,

Seagate can help with that to some degree. We have put proposals throught the

system on behalf of others.

Ric: Are people going to die when we change sector sizes?

Everyone: nope nope

T'so: I want to make sure we have a forum to continue this going forward.

Discussion: Best place to bring everyone together again?

Ric: OLS?

T'so: I don't know how many people will show up to OLS because of cambridge

move to kernel summit. USENIX would be willing to help out and with rooms a

day or two before at OLS. Need to know in next couple of weeks.

Val: IEEE mass storage, Kernel Summit

T'so: Annual technical conference for USENIX in June

 

Topics for next meeting

  • VFS extensions
  • VM/FS issues

Comments (0)

You don't have permission to comment on this page.