Monday Morning: Joint Session
09:00 Introduction - Ric Wheeler
Introduction to the workshop
- Looking for a workshop feel instead of presentations.
- 50 people at the conference.
- Ric picked the slots so he could bounce around between I/O and FS
Basic user requirements of I/O
- Set complete
- Bytes placed in files are correct and in order
- Utilize storage as completely as possible
Open Questions
- How do you validate no lost files or objects?
- Verifying data integrity
- How do you know the bytes are right
- Played with reverse mappings in ReiserFS4
- Utilize disk
- High count of small objects kills utilization
How IO can help meet these user requirements
- Communicate non-retryable requests back to user space
- Technology by Seagate to validate in the disk
- Survival modes for drives
- Val: Never spin down, backups of dying drives is possible
- Ric: IO Coalescing is great - ???
- T'so: 8 Retries on 4kb blocks when 64k goes bad is terrible need to communicate this
- Retry handling needs to be more robust
Performance testing
- Testing high file counts
- Simulating customer file systems is hard
- Need nasty filesystems that have been in service for a long time
- Anonymizing data difficult
09:15 EXT4 Updates - Mingming Cao
Motivation for EXT4
- Primary Purpose: to achieve greater than 16TB we need to move from 32-bit block numbers
- Add second resolution timestamps
- Fix 32768 limit on subdirectories
- T'so: Stupid limit in ext3 - easy to fix
- Fix performance limitations
Why fork?
- On disk format changes required - Linus said no changes in production FS
- Observation: 2.5 style development for filesystems lots of experimentation
News
- indirect block map moved to extents
- 48 bit block numbers
- JBD2 - ???
Removing indirect block map (ext2/3)
- inefficient for large files
- extra read for every 1024 blocks
- disk extents format used for new inodes w/ -o extents
- 12 bytes ext4_extent structure
Extent tree required for more than 3 extents in i_data
- Root and leaf nodes
- inode flag to mark extents vs ext3 indirect
- convert to b-tree extents for >3 extenets
- last found extent cached
Block number size
- 64 bit block number considered initially
- 48bit is large enough for 2**60 1EB
Meta-data changes
- 64 meta data changes
- ??? missed all points
64 bit JBD2
- Forked jbd to support 64 bit block numbers
New defaults for ext4
- some ext3 feature enabled by default on ext4
Plan
- WIP proposed on ext3 mailing list
- efficient multiple block allocation
- Persistant file allocation - don't need to write zeros to guarantee space
- nanosecond timestamps
- Other
- greater than 32k subdir
- Ping T'so on the issue of 32k subdirs
- discussion on scaling fsck - specify initialized groups
- check on the ext3 mailing list for this
- Larger file of 16TB - limited by
Extended attributes
- some folks want more than 4kb attributes - Coming from Samba group
T'so: Vista is using more extended attributes - storing ACLs in
T'so: Storing ext. attributes in inode makes ACLs access faster instead
Mingming: doubling the size of the inode to store ext. attributes in the
inode. inode in ext3 has a pointer available to point to a data block to
store up to 4k of additional attributes.
- Fancy security modules, need to store ext attributes
Dave C.: XFS can be tuned to 2k inodes.
Val: What is the performance of 2k inodes?
Russel: it sucks pay when you stat
T'so: Offline tasks 1) Collate data on use of extended attributes, SELinux
small, ACLs big 2) This is what Samba is using, etc.
Dave C.: Some filesystems have ext. attribute limits, e.g 64k on XFS
Halcrow: eCryptFS could have an arbitrary length
- Applications need to pay attention to the limit of the ext. attributes
- Other directions - Scale
- 64 bit inode number?
- Userspace may assume 32 bit inode from stat()
Questions
Dave C.: Cache most recently read extent? Curious what input went into that descion
Mingming: Very simple caching scheme
Unknown: Compare performance of JFS extents to ext4 extents? Extents less
efficient on fragmented filesystems.
T'so: How fragmented are you going to get? Extents are relatively compact
Halroy: Need to see if it sucks badly vs indirect nodes
Chuck Lever: Growing and shrinking online?
T'so: Online defrag is the first step towards online shrinking
Lever: Thought about how we are going to manage large amounts of storage
T'so: ZFS taught us that we need to look at it from the sys admins point of
view Desperately need admin tools
Hellwig: It is a user space problem, we need hooks here and there for user
space to manage storage better
Henson: We disagree, need to get to the mindset that "I am not going to make
the user partition the disk"
Caching dir contents in memory
Read: http://lwn.net/Articles/194868/
10:00 FS Repair & Scalability - Val
Waiting for Val's laptop to boot
- Rine stone for laptops - val
- fsck data from runs on her laptop
- details in the paper
- Seek latency is the dominating factor
Presentation of results and conversation
T'so: Doing very aggressive caching in ex2fsck
David C.: Memory usage is a problem looked at this in 1998 for XFS
Val: Has a story about fsck running out of address space- Ping for story
Val: ReiserFS doesn't think about fsck problem
David C.: What type of disk? There are differences between OS disk and data disks
A: 10% better fsck on OS disk
Crowd: Every 30 boots bad for desktops - instructions to fix
Val: metadata bitmap is a new favorite idea- Ping
Ming: How do we test these performance hints?
Zach Brown - blktrace of fsck
Introduction
- Background: did OCFS2 repair tool
- Unpack kernel src tree
- Point: fsck averages 12MB/s while streams have 26MB/s
T'so: Basically any speed ups we want for FSCK need to be disk layout
changes. With extents we don't have to iterate over these horrible indirect
blocks. Tradeoff: stream vs fsck times.
Zach: system call: to push disjoint reads. Need vector block read.
T'so: if I had working read ahead in block devices I could do better for fsck
Paper on faster BSD fsck from 1983
Dave: Threaded XFS repair. Does internal cache, does direct I/O.
Ric: Does the fsck know how many spindles it has?
Val: Parallize limit is number of disk arms
Dave: Do get to the point of diminishing returns
Val: I/O bound?
Dave: Yes
T'so: Want to have a working read ahead call.
Q: Why not RAID so there are no errors
Val: The idea of having, don't have any errors. Doesn't work because A)
Filesystem bugs B) Sys admin errors.
Russell: Absolutely ....
Issue: trying to parallelize I/O
Val: Need to have I/O people tell us about the number of spindles
Ric: We lie to you on purpose
Ric: what we want is the places where we can do parallelization
Ric: goes back to the idea of reverse mappings
T'so: want to layout your FS so that you are driving your disks equally.
Dave: XFS will tell you
T'so: If you have the bitmaps on disk0 that one gets hammered.
Zach: (pointing to the slide) this is just a piece of the terrible story that
is Linux FS repair.
Zach: Vectorized file reads conversation
Need to create a system call for describing all of the blocks of memory that
you want to read and where in memory you want to put the individual blocks.
Application: e2fsck where you know the block locations of the dentries but
you have to read the block locations of the dentry one at a time. During
zach's blktrace the plateau of throughput is due to this problem.
Application: oracles perspective: a DB is just a big on disk file. It would
be helpful to have this system call available also.
11:00 Libata Update - Tejun Heo & Jeff Garzik
Support
- ATAPI-
- C/H/S support - two people use ancient
- NCQ- queuing for ATA, 32 tag command queuing
- FUA- Force to the media immediately, barrier implementations
- IDE is famous for its caching.
- SCSI SAT- reuse error handling, most installers know how to handle SCSI no
need to go to distros for installer support
- CompactFlash, adding support for devices that perform well in the ATA space
Hardware support
- Most of the drivers are PCI
- AHCI - has a DMA ring, on old devices used to write to I/O ports and hope
- SATA II 3.0 Gbps, NCQ, ..., link speed to
- eSATA - transparent to kernel support
- SATA cables are lame, not well shielded and bending causes errors
Software Features
- Conversion to new EH???? almost complete - done be Tejun
- Hotplug, "warm plug"
- Improved diagnostics
- Suspend/resume- not alot of driver coverage, future work here?
- HDIO_xxx compatibility. Mark Lord: not quite 100%
Accomplishments
- Fedora 7 test 1 disabled IDE driver, using libata for PATA and SATA
- 60+ host controller
- Engineering support from
- controller vendors
- hard driver vendors
- integrators
- large users
- Limitation we would like to get rid of: partition labels ??
- Q: Partition issue for PATA vs SATA?
- Jeff: one solution overflow into 32 bits
- Hellwig: Problem is in block layer, wants to have all FTs in contiguous space.
- ATA community is coming together, happy pills for all
Future
- Driver API
- sane initialization model, like net driver or SCSI model,
- allocate, register, unallocate, free
- Greg blessed kobjs all over the place.
- Refine error handling, ton of errors that drive, bus, system can throw.
- Should we retry for 5mins or pass off to filesystem in 5 seconds.
- Sysfs support, coming
- Port multipliers
- big on SCSI sat, it is a eth hub for SATA.
- SCSI SAS is more interesting, has expanders, network like routing table.
- Powersave, Pavel Machek
- Host protected area - ATA has a window, and then a special area on disk. ???
Block layer future
- NV cache
- pinned means that those sectors are connected to some blocks.
- unpinned: caching spins down the hard drive when idle.
- Synchronization between request queues
- Need to deal with Simplex: multiple ATA ports can only do one command at a time
- host queue, SATA, etc. "Request queue group"
- Move ATA block devices from SCSI to block layer
- Get rid of overhead of emulating SCSI
- Make SCSI block devs, transports more generic.
- Barriers suck currently.
- I/O rate guarantees
- Better error information back to filesystem
Jeff: I think that I/O and filesystems need to share more information, not
just throwing EIO"
Dave: Knowing what type of error, like hard media errors vs soft errors. hard
errors are lost blocks. Soft error like a path error require some time to
recover.
T'so: Pull a fibre channel disk and plug into hub. Maybe down for 30 seconds,
and in the mean time ext3 has remounted itself read only. But, you can't block
forever,
Dave: XFS on fibre channel we had a configurable timeout. Doesn't throw EIO until
the timeout is reached.
Matin K.: Tried to decouple errors from transport.
Dave: The danger: if we have to shut down a filesystem, then we have to fix
it.
Hellwig: alot of the soft shutdown things are racey.
Lord: what additional information should we pass?
Dave: what type of error, persistent or temporary, media or otherwise, device
unplugged.
Dave: We don't get an unplug at all.
Martin K.: Volunteered to write up a list of the type of errors that we need.
T'so: Have a certain support in the VFS.
Dave: right now we have a 1to1 mapping between a block device and filesystem.
We need infrastructure to support different mappings.
Hellwig: we need to separate out the unplug/plug events.
T'so: Do we support having a USB key and replug with a new major and minor
dev?
Jeff: Two types of unplug events. Hardware locking vs just unplug hardware.
What information is relevant to pass up to the filesystem.
Chris Mason: Question of who owns the spindles? FS or IO
Dave: we need to get a range of addresses of where the failure is happening.
Dave: You have a hardware RAID 5 that has a disk failure. Tell the filesystem
that one of the spindles is down.
Val: the idea of having a pipe.
Hellwig: If we have this idea of having an I/O path or pipe we can also
include information about performance information and errors.
Chris Mason: It is funny we aren't really good at handling the information we
have now.
Barriers
- They suck.
- Right now flush cache is our barrier, very painful for performance
- Wants communication from hardware
- when data hits write-back cache
- when it hits disk
- We need SCSI link commands.
Ric: Linux community not good at driving the standards community.
Val: Anyone willing to throw themselves on that grenade.
Jeff: Currently we are flushing to the disk.
Mason: We should think about cache management as a whole.
Ric: The way forward is to model this with an emulation of a disk. And then
propose to getting this into hardware.
Jeff: FUA bit
Chuck Lever: support for iSCSI in libata.
Chuck Lever: support for SATA over Ethernet.
Jeff: it is lame.
11:45 RAID Updates - Mark Lord, Dan Williams & Yanling Qi
Lord: Much of what I wanted to talk about is mostly error handling.
Dan: Particularly interested in MD raid. Offloading the XOR and ... into
acceleration engine. Now he has a generic memory offload system.
Requires changes to MD to tell the system to use these engines async.
async_acopy async_xor
Dave: Offload API would be helpful for many applications. SGI hardware has
support for hardware zeroing. Background page zeroing would be a case
Jeff: Promise XF4 - alot of this conversation missed
Hellwig: What we want is a common API to talk to all raids. We want a raid
class.
Hellwig: We separate out the RAID operations from the device. raid-ops?
Dan: iop13xx is the device with the XOR/Copy engine
Ric: You are offloading because the NAS device is a really low powered device
and you want to offload this expensive work.
Ric: we could do checksumming in hardware for speeding up FSCK?
Yanling: Auto RAID - LSI
Hellwig: we have support for filesystem freezing already.
Need Ric to explain RAID sparse allocation
Monday Afternoon: I/O Track
13:30 FC Storage Updates from Vancouver - James Smart
14:15 SATA/SAS Convergence - James Bottomley, Brian King, Doug Gilbert, Darrick Wong and Tejun Heo (scribbles below by djwong)
- libata will become a block layer client _only_ for ATA disks; the SCSI interface will remain for ATAPI devices
- Need to implement a stackable EH for SCSI/SAS/ATA to route exceptions to the appropriate parties
- dougg: The SATL spec lists some translations of SAS <-> ATA error codes
- jgarzik (?): Marvell SAS driver coming = wider use of libsas in kernel
- Refactor the libata EH to be able to deal with individual scsi_cmnds coming from libsas (instead of being one big function like it is now)
- libata wants to use the new libata EH handling scheme that sas_ata doesn't use right now
- Would it be useful to translate the ata$HOST:$DEV in printks into $host:$channel:$target:$lun format?
- Other chatter about using the IDENTIFY command and the D2H FIS so that libsas can acknowledge the existence of an ATA device even if it's currently reserved by something else
15:30 Request Based Multipathing - Hannes Reinecke
16:15 Reinit of Device After kexec/kdump - Fernando
17:30 I/O Support for Virtualization
18:15 Open Discussion/Wrap-Up
Monday Afternoon: File System Track
13:30 FS Scalability & Storage Needs at the Bleeding Edge
14:15 EXT4 Online defragmentation - Takashi Sato (poor scribbles below by djwong)
- Three types of defrag: a single file, a whole directory, or free space
- Roughly 15-30% speed increase by defragging files
- Free space defrag useful for shrinking filesystems
- Strategy: Make new inode, copy data into sequential runs of blocks, then reassign the file
- Q: What if someone modifies the file during defrag? Kill the temporary inode
- Other questions: OSX hot file clustering/bootcache
15:30 Security Attributes - Michael Halcrow
- Original design was that every file would have ext. attribute with encryption information.
- Stacked filesystem sits on top of lower FS
- Duplicates every inode, dentry, etc of lower FS
- XML language to define the users/objects and encrypt policy.
- Open Question: Is it a good idea to overload SELinux for eCryptFS?
- Most common question on SELinux mailing list is how to disable
Val: It is doesn't make sense to do the mapping on execution. Instead, I want
to do it on a per directory, per mount, time of day, but not execution.
Hellwig: we really want the user to get a namespace for encryption on login.
Most common use case.
What would you want to ensure certain files are encrypted.
Trying to find use cases:
- Lost USB keys
- Lost laptop problem
T'so: Maybe the threat model actually pushes this up to the application
level.
Halcrow: Pushing the problem to the application layer is a key management
nightmare.
T'so: The question is where the right layer is. Maybe this is a question of
having a key management library for Linux.
Halcrow: This is safety of secondary storage issue
Halcrow: We have a problem of losing laptops and losing financial data. I
want to be able to push a policy a laptop for running a trusted configuration.
Argument ensues.
Val: Maybe we should change the use case to protecting against a stolen
environment. The per mount point use case is the simplest.
Halcrow: Per user per mount point name spaces.
15:30 Why Linux Sucks for Stacking - Josef Sipek
Page cache coherency
- mucking with lower filesystem can cause things can go bad
- the upper filesystem doesn't know when the lower was changed
Hellwig: The user interface is totally broken. Layered filesystems should be
mounted internally only.
Val: delete test?
Hellwig: Much better user interface to not expose the lower filesystem by
default.
Halcrow: there are patches pushed by Morton that does that.
Sipek: Stacking can get arbitrarily complex
Hellwig: No they can't, we have this thing called kernel stacks
Sipek: Beware of cycles, must be a DAG. Walk up, sync down.
Code sharing for stacking file systems
- fsstack (fs/stack.c)
- Simple inode attribute copying functions
- What should be added?
Hellwig: most everything in eCryptFS should be pushed into stack.c
Other issues
- Lockdep doesn't like us right now. dget() happening recursively.
Hellwig: Different key for separate super blocks. ???
Halcrow: No propagation of locks to the lower filesystem
Hellwig: What do you lock for range locks on the underlying filesystem.
Mason: I would only do locks at the top layer.
on disk format
- At OLS T'so suggested having an on disk format
- storing whiteouts and persistent inode data
- Currently white outs are stored as .wh.
- Have a prototype based on ext2
16:15 B-trees for shadowed FS - Ohad Rodeh
Motivation
- useful for ZFS and WAFL
- used in research prototype of an object-disk
Current methods
- Filesystem is a tree of fixed sized pages
- In case of crash: revert to previous stable checkpoint and replay the log
Shadowing
- Two roots, some pages are shared.
- Snapshots are easy with shadowing, create new root
B-Trees
- B-trees are used by many filesystems to represent file and directories
- XFS, JFS, ReiserFS, SAN.FS
- Guarantee logarithmic-time key-search, insert, remove
Challenges
- Challenges to multi-threading
- changes propagate up to the root
- the root is a contention point
- In regular b-tree leaves can be linked to their neighbors
- if you are doing shawowing, you end up shadowing all of its neighbors
- this means you copy the entire tree
Write in place b-tree
- just modify the tree in place
Alternate shadowing approach
- Pages all have a virtual address that never change.
- There is a table of virtual to physical
- In order to modify page P at L1 you copy, update, swap
- Pros
- Avoids the ripple effect of shadowing
- Used b-link trees, very good concurrency
- Cons
- Requires an additional persistent data structure
- Performance of accessing map is critical
Requirements from shadowed b-tree
- need good concurrency
- work well w/ shadow
- deadlock avoidance
- guaranteed space/memory
- Solution: tree has to be balanced
Intuition: Shadow from the top down
- lock-coupling for concurrency, grab parent, then child, vice versa
- Proactive splits, split nodes that are full while recursing down
remove-key
- lock coupling
- proactive merge/shuffle
- shadow on the way down
- P/C of the scheme
- Effectively lose two keys per node due to proactive split/merge
- Need loose bounds on number of entries per node
Cloning
- p is a b-tree, q a clone
- p and q should share as many pages as possible
- speed creating q from p should have little overhead
- number of clones, clone p many times
- clones should be first class, should be possible to clone q as well as p
Naive clone
WAFL free-space
- map of 32 bits per data block we get 32 clones
- support 256 clones, 32 bytes needed per data-block
- to clone a volume we need to make pass through entire free-space
Challenges
- How do you support a million clones w/o huge free-space map
Main idea
- modify free space so it will keep a ref count per block
- ref count counts how many times a page is pointed to
- zero means free
Cloning a tree
- Copy root p into a new root
- increment free-space counter for first child
- Before modifying page N, it is marked dirty
- informs run-time system that N is about to be modified
- gives a chance to shadow if necessary
- If ref-count == 1, page can be modified in place
Deleting a tree
- ref-count > 1: decrement the ref-count and stop downward traversal
- Node is shared with other trees
- ref-count(N) == 1: delete the node
T'so: this is also known as... ???
Zach: It looks hard to do repair.
Hellwig: Nick Piggins implemented lock coupling for page cache
Useful Papers Ref 23 and 25:
http://www.cs.huji.ac.il/%7Eorodeh/papers/ibm-techreport/H-0245.pdf
17:30 Explode - Fault Injection for Storage - Can Sar
Introduction
- idea: model checking
- fast, easy to use
- Run on linux and freebsd
- 200 lines of C++ code
- general real: check live systems
- effective:
- check 10 Linux FS, 3 version control software
- Bugs in all, 36 in total, mostly data loss
core idea: explore all choices
- bugs triggered by corner cases drive execution down tricky corner cases
- choose(N): conceptual N-way fork, return K in Kth kid
- instrumented 7 functions w/ choose
- found JFS bug with simple test
Discussion: May be ported to python, c
ext2 fsync bug
load A into memory, truncate, create B uses page from A. fsync causes B to
point to the same indirect block as A.
Dave: Problem is that we have small tests for individual problems, and this
seems to be a very general solution.
Val: Package it up and make it easy to use.
T'so: It would be interesting to look into how hardware failure type errors
affect filesystems.
T'so: Didn't know whether having bit flipping error case was interesting
Dave: Filesystems are getting large enough that bit errors are getting
important.
Dave: Barrier support would be interesting.
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=117148291716485&w=2
18:15 Open Discussion/Wrap Up?
Dave: Academics in Australia they feel that a publication is incomplete if
they don't open source the code. You can't reproduce the code.
Val: Make the change at Freenix in the fall.
Can: It is often a bit of work to open source and support the code.
Val: If you release useful code then people will contribute
Sunnyvale/UnionFS guy: SNIA BOF on Tracing and replay
Research topics
- Filesystem scrubbers
- Create a wiki page with ResearchTopics on linuxfs.pbwiki.com
Tuesday Morning: I/O Track
09:00 Unifying the Block Layer APIs - Tomonori Fujita
09:45 RDMA Applications to Storage - Roland
11:00 Block Guard - Martin K Petersen (scribbles below by djwong)
- 520 byte sectors with a twist--the extra 8 bytes are for protection data ("DIF")
- 2 bytes for a CRC
- 4 bytes to list the LBA sector number (apparently writes to the wrong sector are common)
- 2 bytes for an application-specific tag
- How do we use the 2 byte app-specific tag? Maybe we need input from, say, filesystem people?
- Various levels of protection: none; guard+crc; guard+crc+lba; guard
- Different HBAs allow or disallow user access to the DIF areas
- DIF requires new commands--can we emulate this with older 520B sector disks? Probably not too hard to modify
- Block layer changes -- per bio callback to ask for protection data and/or calculate the tags. There's no need to tie ourselves to SCSI
- The standard CRC algorithm is pretty slow; could we use SSE4 instructions?
- What if we want more protection data? 4K protection page per 256K of I/O, maybe?
11:45 Changes to Storage Standards - Doug Gilbert (scribbles below by djwong)
- List of standards committees:
- T10: SCSI/SAS/SPI @ t10.org
- T11: FC @ t11.org
- T13: PATA/SATA @ sata-io.org
- SFF: plugs, small form factor devices, etc
- IETF: iSCSI, iSER, IEEE, Infiniband, SNIA (RAID)
- Don't look at just the published standards; have a look at the last few drafts that have revision history and links to the justifications for the changes made
- SAM-4: > 16,000 LUNs, change "initiator port" to "I_T nexus"
- SSC3: Security chatter (dougg didn't elaborate much more than that)
- OSD2: Obsolete OSD1 calls; align commands to certain byte boundaries
- SPC4:
- "initiator port" -> "I_T nexus" change
- Access to log subpages: statistics and performance data
- Report identifying information (labels printed outside, I think?)
- Encapsulated SCSI opcodes
- Obsolete items: linked commands and basic task management
- SBC-3:
- Background media verification
- 4K sectors
- log pages
- Grouping of commands to simplify (perhaps allow aggregation of?) logging and data collection purposes
- ORWRITE
- Write uncorrectable sector
- SAS-2:
- 6Gbps links that are 2 multiplexed 3Gbps connections
- Zones in the SAS domain
- Self-configuring expanders
- Multiple affiliations for SATA devices
- SATA 2.6:
- Slimline and micro connectors; mini SATA
- NCQ command priority
- Enhanced BIST activate FIS/signature FIS
- SAT-2:
- ATAPI translations
- NCQ control functions
- Persistent reservations
- ATA security
- LUN to port multiplier mappings
- 4K blocks and ORWRITE
- NV cache translation
- End to end data protection
Tuesday Morning: FS Track
09:00 NFS Topics: Chuck Lever, James Fields, Sai Susarla & Trond
Server Issues
Bruce: Race: When there are two writes within the same second, the second
client doesn't see the second write when doing its initial open.
- Nanosecond time granularity does not resolve race
- Actual timesource is just jiffies
NSFv4 change attribute requirements
- Must change whenever ctime changes
- Must be consistent across server reboots
- Not consistent across files
NFSv4 change attribute non-requirements
- Not required to be consistent across files (unlike ctime)
- No units (doesn't have to measure number or size of writes)
Bruce: Easy to come up with solutions that don't quite work. Need to work
across server reboots.
Val: Solaris, you could cheat and make sure that the nano time field at least
changes
Crowd: Counters are the solution.
Dave: the problem with that is that you have to write that disk every time you
modify.
Val: Should we update the number on reboot.
Dave: The fact that it has to be on disk makes it a bit more expensive.
Erez: It seems like the problem should be a callback
Dave: Imagine every client does that
Mason: If you have something that changes every time the server reboots. And
keeping the counter in memory.
Val: This is an expensive way to do it for legacy filesystems.
Bruce: What you don't want is the client to see the data that is half way is
written without also seeing the change attribute change.
Ohad: Isn't this supposed to be solved with delegations?
Trond: Delegation isn't a cache validation but a cache acceleration tool.
Mason: If the FS crashes you want to invalidate all of the caches.
Trond: Consider the case where you have hundreds of NFS roots, it is the
pathological case.
Bruce: Samba and apache has the same problem.
NFSv4 ACLs
Trond: The basic message we have about this is we may want v4 ACLs on Linux.
It looks like they are being the new standard.
Dave: there is a patch for ext3 for nfs v4 acls. The question is how we match
posix and nfsv4 ACLs.
Bruce: difficulty is the synchronization of the mode
Val: Use the POSIX standard recommendations
Bruce: The nice thing with POSIX ...
Val: We have the implementation of POSIX ACLs but no one uses it.
BSD is considering the use of NFSv4 ACLs for all filesystems. They don't have
the legacy problem we have.
Exporting clster-coherent locks
- already a file->lock method
- need to modify NLM and NFS to call it
- allow async return of results
- Bruce proposes a new async API for locking
Client scalability
- readdir scalalbility issues
- get rid of redundant storage
- dentry+cookie lookup table
VFS: intents
- get rid of redundant lookups in operations like rename and link
Trond: Main issue is that you have to pass them along
pNFS
Raul: pnfs is a method that v4 clients can directly access data by directly
accessing the storage device. You ask for the layout and get it. A number of
companies are looking at pnfs.
- The sort of things we will wind up seeing
- the storage backend could go through the block subsystem
- potentially hit the elevators
- could end up getting bursty traffic patterns
09:45 GFS Updates: Steven Whitehouse
GFS Introduction
- 64 bit symmetric cluster fs
- Took over development in Oct 2005
- About 1.2Mb of code
- Various changes and cleanups as a result of upstream
- Accepted into 2.6.19
- bug fixes
- Current code 716k
important changes
- GFS used to have a journal system that appened a header to the files
- different file journal layout, same sort of system as ext3
- allows mmap, splice, etc to journaled files
- A metadata filesystem for access to special files
- journals, rindex, fuzzy statfs, quotas
- locking is at the page cache level (GFS was at syscall level)
- faster, supporting new syscalls, eg splice
- readpages() support, some writepages() support
- to be expanded in the future
- supports the ext3 standard ioctl lsattr, chattr
problem we have been looking at is directory structure
- GFS2 structure based on "Extendible Hasing", by Fagin Sept 1979
- Small directories are packed into the directory inode
- Hashed dirs based on extendible hasing, but
- Uses CRC32 hash in dcache to reduce he number of times we hash a filename
Problems with readdir
- The way we do the splitting of the dentries we could get a reorder on a new directory insert
- Have to sort each and every hash chain even if only one entry is read
Val: One of the most annoying things about filesystem is getting readdir to
work properly. Shrinking and growing directories can break it.
Val: You can do extendible hashing and have stable points.
Ohad: the cursor is only 32 bits so if you have more than 4Gb of entries than
it is broken, it will truncate.
Val: I think we are kind of saying that readdir is broken.
readdir is also used in other places aside from getdents64
- NFS readdirplus
- NFS getname doesn't need a defined order of entries, so we could potentially avoid the sorting operation
T'so: I can think of two things to do
1) Substantially increase the size of cookie retunred by telldir
2) Get into the right commitees to deprecate telldir seekdir in POSIX
I would not be oppossed to eliminate the call alltogether
09:45 OCFS2 Updates: Mark Fasheh
Introduction
- Shared disk cluster filesystem
- Took a number of idea from other filesystems
- Try to get node ops local to one node. Allocation hashing.
- Development focus on adding features
- Generic OCFS2 b-tree code, will make ext. attr support with b-trees easy
- Mounting OCFS2 as local makes OCFS2 act local
- Filesystem is designed for large allocations
Fasheh: peform_write is great
http://www.ussg.iu.edu/hypermail/linux/kernel/0612.2/0218.html
Dave: batches prepare_write and ...
Fasheh: writes an entire range, no page locks
Dave: No documentation on perform_write,
Fasheh: ocfs2.git has an example implementation
Fasheh: One thing I am concerned with perform_write is that other paths in the
kernel can use it.
Dave: What does invalidate page mean? *laughs*
Ric: Forced unmount is a very useful thing.
Dave: XFS allows you to shut it down and then unmount.
Val: There should be a generic support in vnode
Fasheh: We need to see if we can avoid fencing in some cases.
Ric: Fencing is useful in some cases
Ric: Fencing is telling the storage device to just ignore foo host.
Fasheh: OCFS2 needs better fencing. I guess I was trying to see if there is any
queued IO for a device.
Hellwig: We should have fencing in VFS. I liked the way that GFS did it before
it went closed.
Ric: Most popular user request?
Fasheh: Forced unmount
Hellwig: funmount should be a vfs feature
Ric: Favorite feature?
Fasheh: Feature they like is easy setup.
Russell: Performance hot buttons?
Fasheh: Our inodes don't fit in a block.
- Each node has its own journal
Future work
- Mixing extent data and ext. attributes
- can make for very complex code
- I would like to move to GFS's DLM
GFS vs OCFS2
- we have the edge in stability
- we handley beat GFS in performance
http://en.wikipedia.org/wiki/Distributed_lock_manager
Business case
- Original business case is designed to host an oracle home on it.
Interesting conversation would be on which filesystem are better for what.
Particularly focusing on the clustering filesystems.
ANNOUNCE: Val, Nick Piggins is interested in a VM/FS workshop.
11:00 Enhancing the Linux Memory Arch for Hetrogeneous Devices - Alexandros Batsakis
- IPoIB is 3 times slower than RDMA
- writes that occur at the same time as pdflush syncs take up to 1.5 sceonds
Val: How much time is elapsed in both
- Traded one big write congestion w/ small ones by tuning pdflush ratio
- Clients are heterogeneous
- Client-server network is heterogenous
- But... flushing policy is a system wide, static
Dave: you can set the pdflush ratio on per cpu set thanks to a patch by Hellwig
Alexander: wants to have it per device not per cpu set
- ratio makes rdma faster on 1 Gb RAM systems then on 2Gb RAM systems.
Lever: We need to have some sort of self tuning system for this issue.
- Can writeback be storage-aware?
- Not only an RDMA issue - 10GbE etc
Dave: People tend to expect the clients to do the caching.
- Need communication from the server about how often it can write, the load, etc.
Erez: Where is the pathological case that is causing this.
Dave: Why is pdflush blocking the other writes from going on?
Trond: Hopefully Peter Zijlstra's patch will fix the situation? Should be
setting the non blocking flag for the flush.
PATCH nfs: fix congestion control
Perhaps the pathological case is that we are contending on the waitqueue
implemented in the NFS client.
11:45 DualFS & Integration with High End Arrays - Juan Piernas Canovas
Introduction
- Better performance almost all cases than traditional journal filesystems
- must design filesystem to better take advantage of storage tech
- meta-data management is key design issue
- traditonally write meta-data in syncrounous way, use fsck
- Current: log last meta-data updates, asyncchronous meta-data writes
- DualFS uses a log
- Seperation papers
- Muller y Pasquale
- ruemmler y wilkes
Design
- Seperates data and metadata
- Proves seperation can improve performance
Motivation
- Presented a distribution of data/metadata traffic for different workloads
Conclusions
- Meta data represents high percentage of total I/O time
- Writes predominant
- Almost always request are not sequential
Diagram: Two seperate filesystems for data and metadata. Seperate drives,
partitions or zones. Problem with this layout is reading a normal file is
inefficient; solution to problem later.
Data Device
- Like Ext2 w/o meta-data blocks
- Groups
- Grouping is per dir
- Related blocks kept together
- File layout for optimizing seq. access.
Dir. affinity
- Select the parent's dir if the best one is not good enough
- (does not have at lesast %foo free blocks)
- Data blocks
Val: We are trying to figure out if the allocator is the reason you are seeing
the difference in performance. What percentage of the improvements are due to
those changes.
Meta-data Device
- Meta-data: i-nodes, indirect blocks, directory data block and symbolic links bitmaps, superblock copies
- Organized as a log-structured file system, like BSD-LFS
- Meta-data elements have same format as ext2/ext3
- big change is how the meta data is written to disk.
Diagram: Layout of meta-data, divided into number of segments
Erez: Found a case where log structured filesystems make sense. The cost of
the cleanup won't kill the log filesystem because meta data is much smaller.
T'so: the one exception of small meta-data is one large directory, like squid.
How many entries in the directory?
Sorin: We tested with 50,000 1mb files
T'so: Ok, so small
Juan: The win is that you can write the metadata sequentially on disk.
T'so: The question is traditionally log fs have traded read performance for
write performance. I am trying to find bad cases, like git on a cold cache.
Ric: Can we dynamically grow and shrink the meta-data and data partition? You
have to decide the use of the FS ahead of time.
Ric: Worst case, find -exec md5 {}, perhaps? Did the tar speed go fast?
Dave: If you get over 100mb extent it doesn't matter anyways because you have
reached an I/O limit of the disk.
Dave: How do you fix the deadlock condition that you have no free segments
left to run the cleaner?
T'so: Basically you need to freeze the filesystem and let the cleaner do its
work, reserving enough memory for the cleaner to do its work.
Val: I would like to talk about the fact that you have a meta data block that
is static.
http://sourceforge.net/project/showfiles.php?group_id=187143
VERY quick rundown of the performance
Tuesday Afternoon: Joint Session
13:30 pNFS Object Storage Driver - Benny Halevy
What is the problem we are trying to solve
- Part of the IETF NFSv4.1 draft
- Scaling out problem
- Want many clients to talk directly to storage devices in parallel w/o server
- Skipping the server has been availble in a number of clustered file systems
- Why implement this again.
- Proprietary protocols bad
- Interoperabiliity is good for everybody
http://playground.sun.com/pub/nfsv4/webpage/
pNFS comes in three different flavours:
http://www.nfsv4-editor.org/draft-08/draft-ietf-nfsv4-minorversion1-08.txt
Explanation of layout4 data structure and pnfs_osd_layout4
Object based storage overview
- Basically decoupling the storage and the namespace
- OSDs are cool
Diagram: client asks the security manager for authorization. SM hands client a
capability and uses it to sign requests to object store. object store and SM
have a shared secret.
- OSD commands are a bit chubby. 200bytes long.
What we need as far as Kernel support
- Linux wants bi-dir SCSI commands
- Emulex wants to replace its own proprietary protocol
- should see them in kernel in another month.
- patches for block, scsi, iscsi
- tested on iscsi -> IET and IBM OSD initiator -> IBM OSD target sim
- need support for large variable length cdbs
idea of the patches
- idea: add an API to access the current I/O related information as
uni-directional w/ little existing code change. Then add bi-directional read
and write buffers.
Todo
- bidi residual bytes
- osd initiiator library
Design
- (p)NFS client
- pnfs-obj layout driver
- OBJ RAID
- Flow control (global and per-devuce)
- OSD initiator
Jeff: Doesn't this explode the complexity of NFS alot. Why not stay with files?
Benny: The applications see a POSIX fs.
Jeff: Wonderful thing about NFS was that it was entirely interoperable and now
we have to support a ton of device.
Benny: Benefits are scalability because of local block and the security model.
Jeff: This is not optional, we have support all of this crap. Now NFS has to
support SCSI and RAID.
Ric: What I am hearing is that you want performance proof?
T'so: There is a real question of who is going to be using this. This may end
up as a high end plaything.
Erik: Our intent is to have objects be the interface to storage devices.
T'so: Linux works really well when we have alot of developers with commodity
hardware in their hands.
Erik: pNFS is taking alot bigger problems than OSD. The three layers of
pNFS are there for legacy usage.
T'so: I will eagerly wait the day when I can go to Fry's and pick up an OSD.
Hellwig: There is actually very little code to support object based storage.
Dave: What is the complexity of the pNFS server? We need the security manager,
object handling, etc. There are some interesting complexity questions there.
14:15 OSD - API's & Justification Object Based Disks
OSDs are cool
- Block based storage doesn't make it possible to do the offload of processing.
- OSD is about pushing a bit of the filesystem down to the hardware.
Hellwig: This is alot of marketing bullshit.
Trond: You should be careful of talking about pNFS security and OSD secuirty in
the same breath.
Jeff: What we I really want to see is a OSD fs.
T'so: I agree that this linear 512 or 4,096 byte is so 1970s but I need the
cheaper hardware.
Upper level driver discussion
Jeff: An upper level driver isn't needed. OSDFS would talk SCSI directly.
Erik: Are we missing out erorr handling by not having an upper level driver?
Hellwig: You basically want a library for handling the OSD.
T'so: He is objecting to putting it in the SCSI layer.
Hellwig: We should sgio and sg devices for passing down the scsi commands to
the hardware. We have a SCSI pass through that can be used for management in
userspace.
Number of interesting reasons of OSD applications for research.
- sf.net/projects/intel-iscsi
- User level OSD damon
- kernel modules for
- iSCSI initiator code
T'so: Interesting research question about GIT in an object store
Ric: companies do reduncy based hardware
14:15 SNIA - Erik Riedel
Introduction
- SCSI has served us well for 25 years
- Moving from SBC scsi block command to OSD
- We will make these devices real
OSD Commands, OSD-1 r10 as ratified
- Important: Read, write, create, remove, get attr, set attr
- Imp. security: Auth, integrity, set key, set master key
Motiviation
- Basically we feel we need to continue increasing the capacity of drives.
- Objects have attributes attached to them, like extended attributes on disk.
Security
- seperation of the security manager and the object drive
- leave alot of the complexity of security mechanisms off object drive
- Keys are per partition.
Dave: How does the OSD deal with fragmentation.
Dave: If there is something wrong with the fragmentation then we have to rely
on the disk to defragement. How do you avoid fragmentation?
T'so: If you tell a hard core filesystem person that you are going to take
care of all of the inode allocation problems then they will ask how it is
implemented.
Dave: What you are telling me is that you are holding us hostage to implement
what we need.
Lord: What happens when you create one large object and manage like a regular
disk.
Erik: That is fine.
Future
- D to D migration of data
- Snapshots
www.snia.org/tech_activities/workgroups/osd/
15:30 Hybrid Disks - Timothy Bisson
NVCache split into two sections
- pinned and unpineed.
- Host controls the pinned sets
- Pin one or more LBAS to flash
- Device control unpinned set as a cache
New mode - NV Cache power mode
- Redirect I/O to NVCache, manufacturer spin-down algorithm
- Pinned set management independent of this mode
Host commands
- add logical blocks to pinned set
- remove ""
- query pinned set
- query pinned set misses
- flush nv cache
Ric: How big areas?
Timothy: 128-256Mb
Leveraging Hybrid Disks: Power
- Block Layer
- integrate w/ laptop mode
- Enable NV cache power mode
- Remove 10 minute write-back threshold
- FS Layer
- Pin all metadata
- Issue: spin-up command on any metadata read request
Leveraging Hybrid Disks: Performance
- Issue: how do you share this space w/ two different filesystems
- FS Layer
- journal, rand. access files, boot files, weap file
- Block layer
- Selective write cache for rotating media
- Pin LBAs resulting in long seeks and leave in the request queue
Lord: Does the pinned section survive reboots?
Timothy: Unknown
Waiting on Erik for the answer.
Q: How many write cycles can it survive.
Timothy: it takes about one year to fry the disk if you are using it as a
write cache.
Jeff: It is all about capacity. I can see the journal and boot files in the
nvcache but there is too little space for use as a good journal.
Chuck: I see the wear leveling thing is going to make it impossible to use this
as a place for the journal. The journal is write mostly.
T'so: I am really excited about a technology but it needs more space in next
generations.
Open Questions
- Layers conflicts
- Who should be incharge, block layer or FS layer
- FS APIs to allow filesystems to leverage MD cache
Pinned and unpinned is specified in the specified in the spec. ATA spec.
15:30 SSD Dongjun Shin
- SSD is data storage on RAM or flash
- Why SSD? low power consumption, performance
- Characteristic: no spin delay, no seek time, lower power consumption, read is faster
16:15 Scaling Linux Storage to Petabytes - Sage Weil
- Based on object storage paradigm.
- Goal: scale to exabytes, tens of thousands of driver, posix
- Scalability forces to solve mngmnt problems: Highly tolerant failure
trivially add and remove storage or server
system rebalances data/meta data
- Most FS components in userspace
CRUSH
- data distribution with a function
- Function to calculate the data distribution of the OSDs
Audience: Sounds like Lustre
OSDs have awareness of where they are and distribute the work of health,
recovery and rebalance.
http://ceph.sourceforge.net
Highlight key problems in the notes
17:30 Open Discussion/Wrap Up?
Discussion point: Was it a good idea to bring together the SATA/SCSI/FS people
together?
Show of hands: most people was good
Hellwig: We need to try and do a storage code workshop.
Ric: We stayed remarkably on time, but we didn't go into the depth when we
could have.
T'so: We brought together a number of groups, developers, fs, storage, should
we start a new mailing list.
Val: We need someone to keep the people motivated.
Discussion: How do we reach out to students looking for work?
What happened to comp.research
Storage research mailing list, Erik.
Ric: May need to be more clear on how we select and choose people. Was this
too big?
Audience: Ok, but not good for hacking
Ric: Maybe we should have 3 days. One on I/O, one on fs, one combined.
*agreeing noises*
T'so: Cabaret seating was more effective at Kernel Summit
Russell: White boards are nice
Ric: BOFs will be hard without those, heh
Ric: FS is in a bit more fragile state than storage.
Val: General announcement: Working on ChunkFS full time now.
Ric: Bringing up the point of how error values from disk should be handled.
T'so: One issue is that we need to know what we should be pushing out to
standards bodies 1) telldir/seekdir 2) 4k/1k sectors. How do we get engaged?
Do we have a few sacrificial people who do this work for us?
Erik: If there are proposals that people would like to have brought forward,
Seagate can help with that to some degree. We have put proposals throught the
system on behalf of others.
Ric: Are people going to die when we change sector sizes?
Everyone: nope nope
T'so: I want to make sure we have a forum to continue this going forward.
Discussion: Best place to bring everyone together again?
Ric: OLS?
T'so: I don't know how many people will show up to OLS because of cambridge
move to kernel summit. USENIX would be willing to help out and with rooms a
day or two before at OLS. Need to know in next couple of weeks.
Val: IEEE mass storage, Kernel Summit
T'so: Annual technical conference for USENIX in June
Topics for next meeting
- VFS extensions
- VM/FS issues
Comments (0)
You don't have permission to comment on this page.