aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2016-03-25ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ↵Xue jiufei1-28/+54
ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et I found that jbd2_journal_restart() is called in some places without keeping things consistently before. However, jbd2_journal_restart() may commit the handle's transaction and restart another one. If the first transaction is committed successfully while another not, it may cause filesystem inconsistency or read only. This is an effort to fix this kind of problems. This patch (of 3): The following functions will be called while truncating an extent: ocfs2_remove_btree_range -> ocfs2_start_trans -> ocfs2_remove_extent -> ocfs2_truncate_rec -> ocfs2_extend_rotate_transaction -> jbd2_journal_restart if jbd2_journal_extend fail -> ocfs2_rotate_tree_left -> ocfs2_remove_rightmost_path -> ocfs2_extend_rotate_transaction -> ocfs2_unlink_subtree -> ocfs2_update_edge_lengths -> ocfs2_extend_trans -> jbd2_journal_restart if jbd2_journal_extend fail -> ocfs2_et_update_clusters -> ocfs2_commit_trans jbd2_journal_restart() may be called and it may happened that the buffers dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in ocfs2_et_update_clusters() are not, the total clusters on extent tree and i_clusters in ocfs2_dinode is inconsistency. So the clusters got from ocfs2_dinode is incorrect, and it also cause read-only problem when call ocfs2_commit_truncate() with the error message: "Inode %llu has empty extent block at %llu". We should extend enough credits for function ocfs2_remove_rightmost_path and ocfs2_update_edge_lengths to avoid this inconsistency. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: move lock to the tail of grant queue while doing in-place convertxuejiufei1-0/+6
We have found a bug when two nodes doing umount one after another. 1) Node 1 migrate a lockres that has 3 locks in grant queue such as N2(PR)<->N3(NL)<->N4(PR) to N2. After migration, lvb of the lock N3(NL) and N4(PR) are empty on node 2 because migration target do not copy lvb to these two lock. 2) Node 3 want to convert to PR, it can be granted in __dlmconvert_master(), and the order of these locks is unchanged. The lvb of the lock N3(PR) on node 2 is copyed from lockres in function dlm_update_lvb() while the lvb of lock N4(PR) is still empty. 3) Node 2 want to leave domain, it will migrate this lockres to node 3. Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration() when adding the lock N4(PR) to mres with the following message because the lvb of mres is already copied from lock N3(PR), but the lvb of lock N4(PR) is empty. "Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u" [akpm@linux-foundation.org: tweak comment] Signed-off-by: xuejiufei <xuejiufei@huawei.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: solve a problem of crossing the boundary in updating backupsjiangyiwen1-1/+1
In update_backups() there exists a problem of crossing the boundary as follows: we assume that lun will be resized to 1TB(cluster_size is 32kb), it will include 0~33554431 cluster, in update_backups func, it will backup super block in location of 1TB which is the 33554432th cluster, so the phenomenon of crossing the boundary happens. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix occurring deadlock by changing ocfs2_wq from global to localjiangyiwen7-33/+32
This patch fixes a deadlock, as follows: Node 1 Node 2 Node 3 1)volume a and b are only mount vol a only mount vol b mounted 2) start to mount b start to mount a 3) check hb of Node 3 check hb of Node 2 in vol a, qs_holds++ in vol b, qs_holds++ 4) -------------------- all nodes' network down -------------------- 5) progress of mount b the same situation as failed, and then call Node 2 ocfs2_dismount_volume. but the process is hung, since there is a work in ocfs2_wq cannot beo completed. This work is about vol a, because ocfs2_wq is global wq. BTW, this work which is scheduled in ocfs2_wq is ocfs2_orphan_scan_work, and the context in this work needs to take inode lock of orphan_dir, because lockres owner are Node 1 and all nodes' nework has been down at the same time, so it can't get the inode lock. 6) Why can't this node be fenced when network disconnected? Because the process of mount is hung what caused qs_holds is not equal 0. Because all works in the ocfs2_wq are relative to the super block. The solution is to change the ocfs2_wq from global to local. In other words, move it into struct ocfs2_super. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_listJoseph Qi2-1/+13
When master handles convert request, it queues ast first and then returns status. This may happen that the ast is sent before the request status because the above two messages are sent by two threads. And right after the ast is sent, if master down, it may trigger BUG in dlm_move_lockres_to_recovery_list in the requested node because ast handler moves it to grant list without clear lock->convert_pending. So remove BUG_ON statement and check if the ast is processed in dlmconvert_remote. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: fix race between convert and recoveryJoseph Qi1-1/+10
There is a race window between dlmconvert_remote and dlm_move_lockres_to_recovery_list, which will cause a lock with OCFS2_LOCK_BUSY in grant list, thus system hangs. dlmconvert_remote { spin_lock(&res->spinlock); list_move_tail(&lock->list, &res->converting); lock->convert_pending = 1; spin_unlock(&res->spinlock); status = dlm_send_remote_convert_request(); >>>>>> race window, master has queued ast and return DLM_NORMAL, and then down before sending ast. this node detects master down and calls dlm_move_lockres_to_recovery_list, which will revert the lock to grant list. Then OCFS2_LOCK_BUSY won't be cleared as new master won't send ast any more because it thinks already be authorized. spin_lock(&res->spinlock); lock->convert_pending = 0; if (status != DLM_NORMAL) dlm_revert_pending_convert(res, lock); spin_unlock(&res->spinlock); } In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set (res is still in recovering) or res master changed (new master has finished recovery), reset the status to DLM_RECOVERING, then it will retry convert. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()Ryan Ding1-4/+8
The code should call ocfs2_free_alloc_context() to free meta_ac & data_ac before calling ocfs2_run_deallocs(). Because ocfs2_run_deallocs() will acquire the system inode's i_mutex hold by meta_ac. So try to release the lock before ocfs2_run_deallocs(). Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix disk file size and memory file size mismatchRyan Ding1-10/+17
When doing append direct write in an already allocated cluster, and fast path in ocfs2_dio_get_block() is triggered, function ocfs2_dio_end_io_write() will be skipped as there is no context allocated. As a result, the disk file size will not be changed as it should be. The solution is to skip fast path when we are about to change file size. Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_writeRyan Ding1-6/+17
Take ip_alloc_sem to prevent concurrent access to extent tree, which may cause the extent tree in an unstable state. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix ip_unaligned_aio deadlock with dio work queueRyan Ding5-36/+9
In the current implementation of unaligned aio+dio, lock order behave as follow: in user process context: -> call io_submit() -> get i_mutex <== window1 -> get ip_unaligned_aio -> submit direct io to block device -> release i_mutex -> io_submit() return in dio work queue context(the work queue is created in __blockdev_direct_IO): -> release ip_unaligned_aio <== window2 -> get i_mutex -> clear unwritten flag & change i_size -> release i_mutex There is a limitation to the thread number of dio work queue. 256 at default. If all 256 thread are in the above 'window2' stage, and there is a user process in the 'window1' stage, the system will became deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio lock, while there is a direct bio hold ip_unaligned_aio mutex who is waiting for a dio work queue thread to be schedule. But all the dio work queue thread is waiting for i_mutex lock in 'window2'. This case only happened in a test which send a large number(more than 256) of aio at one io_submit() call. My design is to remove ip_unaligned_aio lock. Change it to a sync io instead. Just like ip_unaligned_aio lock, serialize the unaligned aio dio. [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi] Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: code clean up for direct ioRyan Ding2-144/+10
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write: * remove append dio check: it will be checked in ocfs2_direct_IO() * remove file hole check: file hole is supported for now * remove inline data check: it will be checked in ocfs2_direct_IO() * remove the full_coherence check when append dio: we will get the inode_lock in ocfs2_dio_get_block, there is no need to fall back to buffer io to ensure the coherence semantics. Now the drop dio procedure is gone. :) [akpm@linux-foundation.org: remove unused label] Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix sparse file & data ordering issue in direct ioRyan Ding1-517/+346
There are mainly three issues in the direct io code path after commit 24c40b329e03 ("ocfs2: implement ocfs2_direct_IO_write"): * Does not support sparse file. * Does not support data ordering. eg: when write to a file hole, it will alloc extent first. If system crashed before io finished, data will corrupt. * Potential risk when doing aio+dio. The -EIOCBQUEUED return value is likely to be ignored by ocfs2_direct_IO_write(). To resolve above problems, re-design direct io code with following ideas: * Use buffer io to fill in holes. And this will make better performance also. * Clear unwritten after direct write finished. So we can make sure meta data changes after data write to disk. (Unwritten extent is invisible to user, from user's view, meta data is not changed when allocate an unwritten extent.) * Clear ocfs2_direct_IO_write(). Do all ending work in end_io. This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4 test cases of ltp. For performance improvement, see following test result: ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/. The original way: + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct 16384+0 records in 16384+0 records out 4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s After this patch: + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct 16384+0 records in 16384+0 records out 4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: record UNWRITTEN extents when populate write descRyan Ding4-5/+106
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. There is still one issue in the direct write procedure. phase 1: alloc extent with UNWRITTEN flag phase 2: submit direct data to disk, add zero page to page cache phase 3: clear UNWRITTEN flag when data has been written to disk When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first, it will zero the region (4~7KB). Before request A enter to phase 3, request B arrive phase 2, it will zero region (0~3KB). This is just like request B steps request A. To resolve this issue, we should let request B knows this cluster is already under zero, to prevent it from steps the previous write request. This patch will add function ocfs2_unwritten_check() to do this job. It will record all clusters that are under direct write(it will be recorded in the 'ip_unwritten_list' member of inode info), and prevent the later direct write writing to the same cluster to do the zero work again. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: return the physical address in ocfs2_write_clusterRyan Ding1-15/+13
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Direct io needs to get the physical address from write_begin, to map the user page. This patch is to change the arg 'phys' of ocfs2_write_cluster to a pointer, so it can be retrieved to write_begin. And we can retrieve it to the direct io procedure. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: do not change i_size in write_end for direct ioRyan Ding1-33/+46
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Append direct io do not change i_size in get block phase. It only move to orphan when starting write. After data is written to disk, it will delete itself from orphan and update i_size. So skip i_size change section in write_begin for direct io. And when there is no extents alloc, no meta data changes needed for direct io (since write_begin start trans for 2 reason: alloc extents & change i_size. Now none of them needed). So we can skip start trans procedure. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: test target page before change itRyan Ding1-6/+26
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Direct io data will not appear in buffer. The w_target_page member will not be filled by direct io. So avoid to use it when it's NULL. Unlinke buffer io and mmap, direct io will call write_begin with more than 1 page a time. So the target_index is not sufficient to describe the actual data. change it to a range start at target_index, end in end_index. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: use c_new to indicate newly allocated extentsRyan Ding1-10/+12
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. There is a problem in ocfs2's direct io implement: if system crashed after extents allocated, and before data return, we will get a extent with dirty data on disk. This problem violate the journal=order semantics, which means meta changes take effect after data written to disk. To resolve this issue, direct write can use the UNWRITTEN flag to describe a extent during direct data writeback. The direct write procedure should act in the following order: phase 1: alloc extent with UNWRITTEN flag phase 2: submit direct data to disk, add zero page to page cache phase 3: clear UNWRITTEN flag when data has been written to disk This patch is to change the 'c_unwritten' member of ocfs2_write_cluster_desc to 'c_clear_unwritten'. Means whether to clear the unwritten flag. It do not care if a extent is allocated or not. And use 'c_new' to specify a newly allocated extent. So the direct io procedure can use c_clear_unwritten to control the UNWRITTEN bit on extent. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: add ocfs2_write_type_t type to identify the caller of writeRyan Ding3-13/+22
Patchset: fix ocfs2 direct io code patch to support sparse file and data ordering semantics The idea is to use buffer io(more precisely use the interface ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work beyond block size. And clear UNWRITTEN flag until direct io data has been written to disk, which can prevent data corruption when system crashed during direct write. And we will also archive a better performance: eg. dd direct write new file with block size 4KB: before this patchset: 2.5 MB/s after this patchset: 66.4 MB/s This patch (of 8): To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Remove unused args filp & flags. Add new arg type. The type is one of buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type has implemented. direct type will be implemented later. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: o2hb: fix double free bugJunxiao Bi1-2/+2
This is a regression issue and caused the following kernel panic when do ocfs2 multiple test. BUG: unable to handle kernel paging request at 00000002000800c0 IP: [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160 PGD 7bbe5067 PUD 0 Oops: 0000 [#1] SMP Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront CPU: 2 PID: 4044 Comm: mpirun Not tainted 4.5.0-rc5-next-20160225 #1 Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014 task: ffff88007a521a80 ti: ffff88007aed0000 task.ti: ffff88007aed0000 RIP: 0010:[<ffffffff81192978>] [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160 RSP: 0018:ffff88007aed3a48 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001991 RDX: 0000000000001990 RSI: 00000000024000c0 RDI: 000000000001b330 RBP: ffff88007aed3a98 R08: ffff88007d29b330 R09: 00000002000800c0 R10: 0000000c51376d87 R11: ffff8800792cac38 R12: ffff88007cc30f00 R13: 00000000024000c0 R14: ffffffff811b053f R15: ffff88007aed3ce7 FS: 0000000000000000(0000) GS:ffff88007d280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000002000800c0 CR3: 000000007aeb2000 CR4: 00000000000406e0 Call Trace: __d_alloc+0x2f/0x1a0 d_alloc+0x17/0x80 lookup_dcache+0x8a/0xc0 path_openat+0x3c3/0x1210 do_filp_open+0x80/0xe0 do_sys_open+0x110/0x200 SyS_open+0x19/0x20 do_syscall_64+0x72/0x230 entry_SYSCALL64_slow_path+0x25/0x25 Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84 dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63 RIP kmem_cache_alloc+0x78/0x160 CR2: 00000002000800c0 ---[ end trace 823969e602e4aaac ]--- Fixes: a4a1dfa4bb8b("ocfs2/cluster: fix memory leak in o2hb_region_release") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25drivers/input: eliminate INPUT_COMPAT_TEST macroAndrew Morton4-9/+7
INPUT_COMPAT_TEST became much simpler after commit f4056b52845283 ("input: redefine INPUT_COMPAT_TEST as in_compat_syscall()") so we can cleanly eliminate it altogether. Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom, oom_reaper: protect oom_reaper_list using simpler wayTetsuo Handa2-8/+2
"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it by simply checking tsk->oom_reaper_list != NULL. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom: make oom_reaper freezableMichal Hocko1-0/+2
After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom: make oom_reaper_list single linkedVladimir Davydov2-9/+8
Entries are only added/removed from oom_reaper_list at head so we can use a single linked list and hence save a word in task_struct. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom, oom_reaper: disable oom_reaper for oom_kill_allocating_taskMichal Hocko2-1/+7
Tetsuo has reported that oom_kill_allocating_task=1 will cause oom_reaper_list corruption because oom_kill_process doesn't follow standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue the same task multiple times - e.g. by sacrificing the same child multiple times. This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag which is set in oom_kill_process atomically and oom reaper is disabled if the flag was already set. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, oom_reaper: implement OOM victims queuingMichal Hocko2-17/+22
wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, oom_reaper: report success/failureMichal Hocko1-2/+15
Inform about the successful/failed oom_reaper attempts and dump all the held locks to tell us more who is blocking the progress. [akpm@linux-foundation.org: fix CONFIG_MMU=n build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address spaceMichal Hocko3-27/+50
When oom_reaper manages to unmap all the eligible vmas there shouldn't be much of the freable memory held by the oom victim left anymore so it makes sense to clear the TIF_MEMDIE flag for the victim and allow the OOM killer to select another task. The lack of TIF_MEMDIE also means that the victim cannot access memory reserves anymore but that shouldn't be a problem because it would get the access again if it needs to allocate and hits the OOM killer again due to the fatal_signal_pending resp. PF_EXITING check. We can safely hide the task from the OOM killer because it is clearly not a good candidate anymore as everyhing reclaimable has been torn down already. This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE and thus hold off further global OOM killer actions granted the oom reaper is able to take mmap_sem for the associated mm struct. This is not guaranteed now but further steps should make sure that mmap_sem for write should be blocked killable which will help to reduce such a lock contention. This is not done by this patch. Note that exit_oom_victim might be called on a remote task from __oom_reap_task now so we have to check and clear the flag atomically otherwise we might race and underflow oom_victims or wake up waiters too early. Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, oom: introduce oom reaperMichal Hocko4-13/+162
This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25sched: add schedule_timeout_idle()Andrew Morton2-0/+12
This will be needed in the patch "mm, oom: introduce oom reaper". Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25[IA64] Enable preadv2 and pwritev2 syscalls for ia64Tony Luck3-1/+5
New system calls added in: f17d8b35452cab31a70d224964cd583fb2845449 vfs: vfs: Define new syscalls preadv2,pwritev2 Signed-off-by: Tony Luck <tony.luck@intel.com>
2016-03-25Fix permissions of drivers/power/avs/rockchip-io-domain.cRafael J. Wysocki1-0/+0
The permissions of this file were modified by commit (f447671b9e4f PM / AVS: rockchip-io: add io selectors and supplies for rk3399) by mistake, so fix them. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-25libceph: use KMEM_CACHE macroGeliang Tang1-8/+2
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25ceph: use kmem_cache_zallocGeliang Tang2-2/+2
Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25rbd: use KMEM_CACHE macroGeliang Tang1-8/+2
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25ceph: use lookup request to revalidate dentryYan, Zheng2-0/+35
If dentry has no lease, ceph_d_revalidate() previously return 0. This causes VFS to invalidate the dentry and create a new dentry for later lookup. Invalidating a dentry also detach any underneath mount points. So mount point inside cephfs can disapear mystically (even the mount point is not modified by other hosts). The fix is using lookup request to revalidate dentry without lease. This can partly solve the mount points disapear issue (as long as the mount point is not modified by other hosts) Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: kill ceph_get_dentry_parent_inode()Yan, Zheng2-20/+5
use vfs helper dget_parent() instead Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix security xattr deadlockYan, Zheng8-11/+125
When security is enabled, security module can call filesystem's getxattr/setxattr callbacks during d_instantiate(). For cephfs, d_instantiate() is usually called by MDS' dispatch thread, while handling MDS reply. If the MDS reply does not include xattrs and corresponding caps, getxattr/setxattr need to send a new request to MDS and waits for the reply. This makes MDS' dispatch sleep, nobody handles later MDS replies. The fix is make sure lookup/atomic_open reply include xattrs and corresponding caps. So getxattr can be handled by cached xattrs. This requires some modification to both MDS and request message. (Client tells MDS what caps it wants; MDS encodes proper caps in the reply) Smack security module may call setxattr during d_instantiate(). Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL to us. So just make setxattr return error when called by MDS' dispatch thread. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: don't request vxattrs from MDSYan, Zheng1-2/+4
It's uselese because MDS reply does not carry any vxattr. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix mounting same fs multiple timesYan, Zheng1-18/+15
Now __ceph_open_session() only accepts closed client. An opened client will tigger BUG_ON(). Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: remove unnecessary NULL checkYan, Zheng1-2/+2
If page->mapping is NULL, releasepage() callback does not get called. Remove the unnecessary NULL check to make static code analysis tool happy Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: avoid updating directory inode's i_size accidentallyYan, Zheng1-0/+4
Directory inode's i_size is used by readdir cache. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix race during filling readdir cacheYan, Zheng1-2/+7
Readdir cache uses page cache to save dentry pointers. When adding dentry pointers to middle of a page, we need to make sure the page already exists. Otherwise the beginning part of the page will be invalid pointers. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25libceph: use sizeof_footer() moreIlya Dryomov1-16/+3
Don't open-code sizeof_footer() in read_partial_message() and ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now possible to use con_out_kvec_add() in prepare_write_message_footer(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-03-25ceph: kill ceph_empty_snapcIlya Dryomov4-34/+6
ceph_empty_snapc->num_snaps == 0 at all times. Passing such a snapc to ceph_osdc_alloc_request() (possibly through ceph_osdc_new_request()) is equivalent to passing NULL, as ceph_osdc_alloc_request() uses it only for sizing the request message. Further, in all four cases the subsequent ceph_osdc_build_request() is passed NULL for snapc, meaning that 0 is encoded for seq and num_snaps and making ceph_empty_snapc entirely useless. The two cases where it actually mattered were removed in commits 860560904962 ("ceph: avoid sending unnessesary FLUSHSNAP message") and 23078637e054 ("ceph: fix queuing inode to mdsdir's snaprealm"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: fix a wrong comparisonAnton Protopopov1-1/+1
A negative value rc compared to the positive value ENOENT in the finish_read() function. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: replace CURRENT_TIME by current_fs_time()Deepa Dinamani4-6/+6
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_fs_time() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25ceph: scattered page writebackYan, Zheng1-109/+196
This patch makes ceph_writepages_start() try using single OSD request to write all dirty pages within a strip unit. When a nonconsecutive dirty page is found, ceph_writepages_start() tries starting a new write operation to existing OSD request. If it succeeds, it uses the new operation to writeback the dirty page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25libceph: add helper that duplicates last extent operationYan, Zheng2-0/+24
This helper duplicates last extent operation in OSD request, then adjusts the new extent operation's offset and length. The helper is for scatterd page writeback, which adds nonconsecutive dirty pages to single OSD request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: enable large, variable-sized OSD requestsIlya Dryomov3-19/+32
Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: osdc->req_mempool should be backed by a slab poolIlya Dryomov1-2/+2
ceph_osd_request_cache was introduced a long time ago. Also, osd_req is about to get a flexible array member, which ceph_osd_request_cache is going to be aware of. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Privacy Policy