<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

11 years ago

blocking-b2g: 1.3? → 1.3+

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Updated

•

11 years ago

Target Milestone: --- → 1.3 Sprint 5 - 11/22

Reporter

Comment 3

•

11 years ago

I also saw that multiple compositor thread gets created randomly in FFOS 1.3 JB port. But only one compositor thread is doing composition. I am not sure whether this is related to this bug or not. For Ex: I can see following output in FFOS 1.3 JB port (running on multicore device). But only compositor thread id 871 is doing composition. $ adb shell top -t | grep Compositor 307 871 3 0% S 182112K 87192K root Compositor /system/b2g/b2g 307 871 3 0% S 189660K 97632K root Compositor /system/b2g/b2g 307 871 0 3% S 227736K 77704K root Compositor /system/b2g/b2g 307 1042 0 0% S 227736K 77704K root Compositor /system/b2g/b2g 307 871 0 6% S 228080K 74104K root Compositor /system/b2g/b2g 307 871 0 5% S 227056K 72176K root Compositor /system/b2g/b2g

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 4

•

11 years ago

Hey Ben, you know something about IPC right? Could you please help aim Tapas in the right direction here if possible?

Flags: needinfo?(bent.mozilla)

Comment 5

•

11 years ago

I don't know what to make of these logs really. There should definitely only be one compositor thread in the parent process but many other threads (including many threads from other processes) can send layers transactions to it. It looks to me like the IO thread is just being flooded by way too many layers requests. Tentatively moving this over to GFX in the hopes that these logs will mean more to the layers team.

Component: General → Graphics: Layers

Flags: needinfo?(bent.mozilla)

Product: Firefox OS → Core

bhavana bajaj [:bajaj]

Comment 6

•

11 years ago

(In reply to ben turner [:bent] (needinfo? encouraged) from comment #5) > I don't know what to make of these logs really. There should definitely only > be one compositor thread in the parent process but many other threads > (including many threads from other processes) can send layers transactions > to it. It looks to me like the IO thread is just being flooded by way too > many layers requests. > > Tentatively moving this over to GFX in the hopes that these logs will mean > more to the layers team. NI :milan here to see if he has any ideas or can help with an assignee to investigate this.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 7

•

11 years ago

What kind of device is this? Do we have access to it? CJ, is there one in Taipei, can somebody like Peter take a look?

Flags: needinfo?(milan) → needinfo?(cku)

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Reporter

Comment 8

•

11 years ago

Can you please try with any multicore device ? You may need to reboot multiple times to produce this issue Steps to produce: 1) Flash any multicore device with JB port of FFOS 1.3 2) Reboot it . Then try to see whether homescreen is displaying all icons or it is panning or not. 3) If you see that homescreen is showing all icons in step-2 then reboot it again and follow step-2 If you can produce this issue in step-2 then please run |adb shell top -t| and you will see Gecko_IOThread is taking 50% CPU usage . You call also see that homescreen is hanging here : inside MessageChannel::SendAndWait() http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#487 Bsaically, homescreen is sending some IPDL request to b2g and waiting forever for reply in IPC request.

u459114

Comment 9

•

11 years ago

Peter, please cooperate with performance team on this issue

Flags: needinfo?(pchang)

Flags: needinfo?(cyu)

u459114

Updated

•

11 years ago

Flags: needinfo?(cku)

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Reporter

Comment 10

•

11 years ago

I am trying to debug it further myself. I am seeing that homescreen synchronous IPDL request are blocked at MessageChannel::SendAndWait(Message* aMsg, Message* aReply)[1] and it waits on function WaitForSyncNotify()[2] forever. It happens randomly only between homescreen and b2g process during boot on multicore device [1] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#487 [2] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1176 [3] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#908 I saw for a successful synchronous IPDL request, following thing happens 1) homescreen process calls [1] . 2) b2g process receives this synchronous message in MessageChannel::DispatchSyncMessage(const Message& aMsg) [3] and calls appropriate message handler based on message type Can anyone tell me what happens between above two steps. I am curious to know implementation details which may help me to debug possible race condition .

Flags: needinfo?(bent.mozilla)

Comment 11

•

11 years ago

I tried to reproduce this on Nexus 4 with JB gonk and m-c gecko but didn't see the problem. From your log, it seems the b2g process fails to deliver the IPC request to the actor on the main thread. There is also another process sending a request but never getting the response: 01-01 02:10:48.289 1061 1061 I Gecko : IPDL::PContent::SendPCrashReporterConstructor Entered Even the composition thread is flooding, we should not see the consistent high cpu on the Gecko_IOThread. It looks like that some file descriptor the IO thread is monitoring is causing the spin. Could you turn on IPC logging with your logging instrumentation. We need to see whether the IO threads in the 2 processes really send/recv the message over the wire. Thanks.

Flags: needinfo?(cyu)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 12

•

11 years ago

And also, reproducing the problem with the profiler will be very helpful.

Comment 13

•

11 years ago

Yeah, I don't know what to recommend yet. I'd try what Cervantes suggests, the environment variable you need to set is MOZ_IPC_MESSAGE_LOG. Could it be that the main thread is blocking while trying to do a synchronous layer transaction and that the compositor has a bunch of other layer transactions to process before it gets there?

Flags: needinfo?(bent.mozilla)

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Reporter

Comment 14

•

11 years ago

Attached file IPC logs — Details

I saw that homescreen is stuck in following IPC request: [time:22687463344][940][PBrowserChild] Sending Msg_GetDefaultScale([TODO]) Can anyone suggest for my comment in #comments 10?

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Reporter

Updated

•

11 years ago

Flags: needinfo?(cyu)

Comment 15

•

11 years ago

(In reply to Tapas Kumar Kundu from comment #10) 1. mozilla::ipc::MessageChannel::Send() calls mLink->SendMessage(). For IPC, mLink is of class mozilla::ipc::ProcessLink. 2. mozilla::ipc::ProcessLink::SendMessage() post a task to the IPC thread to run Transport::Send() (alias of IPC::Channel::Send()). 3. IPC::Channel::Send() calls IPC::Channel::ChannelImpl::Send(). 4. IPC::Channel::ChannelImpl::Send() calls its OutputQueuePush() to enqueue the message and then ProcessOutgoingMessages() to flush the message. 5. ProcessOutgoingMessages() calls sendmsg() to send the message to over the socketpair file descriptor. --------------------------- process boundary --------------------------------- 6. The IPC thread message loop calls IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking(). 7. IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking() calls its ProcessIncomingMessages(), which calls recvmsg() to read the message from the socketpair file descriptor. 8. IPC::Channel::ChannelImpl::ProcessIncomingMessages() calls mozilla::ipc::ProcessLink::OnMessageReceived() of its listener_. 9. mozilla::ipc::ProcessLink::OnMessageReceived() calls mozilla::ipc::MessageChannel::OnMessageReceived() of its mChan. 10. mozilla::ipc::MessageChannel::OnMessageReceived() posts a mozilla::ipc::MessageChannel::DequeueTask to the main thread's message loop. 11. mozilla::ipc::MessageChannel::DequeueTask() calls mozilla::ipc::MessageChannel::RefCountedTask::Run(), which calls an instantiation of RunnableFunction or RunnableMethod template, which calls in instantiation of DispatchToMethod template. 12. DispatchToMethod calls mozilla::ipc::MessageChannel::OnMaybeDequeueOne(), which calls mozilla::ipc::MessageChannel::DispatchMessage() 13. mozilla::ipc::MessageChannel::DispatchMessage() calls DispatchSyncMessage(), which calls mListener->OnMessageReceived(). mListener is the protocol actor

Flags: needinfo?(cyu)

Preeti Raghunath(:Preeti)

Comment 16

•

11 years ago

(In reply to Tapas Kumar Kundu from comment #14) > Created attachment 8337162 [details] > IPC logs > > I saw that homescreen is stuck in following IPC request: > [time:22687463344][940][PBrowserChild] Sending Msg_GetDefaultScale([TODO]) > > Can anyone suggest for my comment in #comments 10? Sorry, my memory is blurred. This IPC log is also from the actors, not on the IPC thread. So we still don't know what's going on on the IPC thread. Maybe you can output the errno and dump the IPC message headers before sendmsg() and after recvmsg().

Comment 17

•

11 years ago

Cervantes, Since you have the most knowledge here, can you please be the owner of the same?

Flags: needinfo?(cyu)

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Comment 18

•

11 years ago

Tapas, on which device are you seeing this problem? If this is device is not available in Taipei, can I have remote access to your device to investigate? You may set up a tunnel and email me the access and I can take a look.

Assignee: nobody → cyu

Flags: needinfo?(cyu) → needinfo?(tkundu)

Peter Chang[:pchang]

Updated

•

11 years ago

Flags: needinfo?(pchang)

Reporter

Comment 19

•

11 years ago

Hi Cervantes, I sent you email for joining meeting with me online . I think that this is the best way to resolve it quickly. Thanks a lot for your help :)

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Reporter

Updated

•

11 years ago

Flags: needinfo?(cyu)

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Reporter

Updated

•

11 years ago

Flags: needinfo?(tkundu)

Summary: Gecko IPC hangs during boot for multicore device → Gecko IPC hangs during boot on jb-gonk

Comment 20

•

11 years ago

Attached image The stack trace of epoll_wait failure — Details

Flags: needinfo?(cyu)

Comment 21

•

11 years ago

In the debug session, we kept seeing backtrace of in the attachment on b2g's IO thread (5 out of 6 samples). This looks like quite suspicious. It looks like we are having a spinning because of epoll_wait() failure. The backtrace is about libevent generating a warning of epoll_failure, see http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/third_party/libevent/epoll.c#413 Since the error is "Invalid argument", I think this is the result of strerror(EINVAL). From epoll_wait() man page: "EINVAL epfd is not an epoll file descriptor, or maxevents is less than or equal to zero." Please check the parameters in calling epoll_wait() in http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/third_party/libevent/epoll.c#407 to see why we have epoll_wait() failure.

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Comment 22

•

11 years ago

To be more specific, please check that the file descriptor is a valid epollfd like: lrwx------ root root 2013-11-27 06:00 36 -> anon_inode:[eventpoll] under /proc/13849/fd, where 13849 is b2g's pid, and 36 is the epollfd. If the failure is due to invalid file descriptor, you should see that the argument passed to epoll_wait pointing to a nonexisting one, or not "anon_inode:[eventpoll]" If the failure is due to invalid maxevents, you should see it from gdb.

Reporter

Comment 23

•

11 years ago

Hi Cervantes, I shared logs with you by email . Please let me know when you are comfortable/free . I would like to see it together again. Thanks a lot for your help :)

Flags: needinfo?(cyu)

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Comment 24

•

11 years ago

From the log, there are something interesting: under /proc/<b2g pid>/fd: lr-x------ root root 1970-01-01 04:43 0 -> anon_inode:sync_fence It looks like the graphics driver is using file descriptor 0. and the following is from instrumentation to check for epoll_wait() arguments: 01-01 04:38:29.699 303 677 I Gecko : ALERTB2G EPOLLB2G WARN epoll_dispatch L:414 : epoll_wait: fd: 0, no_of_events=1, maxevents 32, timeout -1 So the IO thread is using epoll_wait() with a file descriptor that is not an epoll fd. And no one bails out from this error. We just keep getting EINVAL when we epoll_wait() this fd. There are 2 problems with this: 1. file descriptor 0 is used as anon_inode:sync_fence. This file descriptor should always be /dev/null if b2g is launched by init, or points to stdin if it is launced from the shell like this lrwx------ root root 2013-11-27 23:09 0 -> /dev/pts/0 I think you need to go back to use strace or QView to check why file descriptor 0 is closed and then used by the driver. 2. libevent's epollfd is changed to 0. We can know why the epollfd used by libevent is 0 right away with hardware watchpoint in gdb. Does the platform support this?

Flags: needinfo?(cyu)

Reporter

Comment 25

•

11 years ago

Thanks a lot for your help . I found that stdin is closed by a bug in MDP HAL. We will upstream a fix for that soon. I already asked my colleague for that . I also verified that IPC hang issue is not coming anymore ( I tried 20 times) after fixing stdin closing bug.

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Comment 26

•

11 years ago

Good to hear that this is fixed. But I'll be interested in how a bug in MDP HAL would ever result in the MessagePumpLibevent using 0 in epoll_wait.

Reporter

Comment 27

•

11 years ago

Cervantes Yu , MDP HAL closes fd 0 atleast two times without opening it. It happens during boot. when MDP HAL closes it 1st time , epoll_wait starts polling on fd 0 and it hangs when MDP closes it 2nd time. MDP HAL is running as a part of b2g process and Gecko_IOThread is also running part of b2g process. It seems to me that this is causing high CPI usage by Gecko_IOThread because it keeps polling on fd 0 continuously. Thanks a lot for your help :)

Assignee

Comment 28

•

11 years ago

Attached patch Initialize acquire and release fence fds of HWC_BACKGROUND layer. (obsolete) — Details — Splinter Review

Thanks Tapas and Cervantes Yu for pointing out fd 0 issue. HAL relies on framework to set proper acquire fence fds on all HWC layers. So acquire and release fence fds of HWC_BACKGROUND layer need to be initialized as well. Otherwise fd 0 gets closed in closeAcquireFds() during hwc set call which is a valid fd and it leads to bad state if fd 0 is being used as stdin. There is 1 more path which can also lead to closing of fd 0 in HAL. We are also up-streaming the HAL fix.

Updated

•

11 years ago

Assignee: cyu → sushilchauhan

Assignee

Updated

•

11 years ago

Component: Graphics: Layers → General

Product: Core → Firefox OS

Assignee

Updated

•

11 years ago

Attachment #8341380 - Flags: review?(dwilson)

Diego Wilson [:diego]

Comment 29

•

11 years ago

Comment on attachment 8341380 [details] [diff] [review] Initialize acquire and release fence fds of HWC_BACKGROUND layer. Review of attachment 8341380 [details] [diff] [review]: ----------------------------------------------------------------- LGTM

Attachment #8341380 - Flags: review?(dwilson) → review+

Diego Wilson [:diego]

Comment 30

•

11 years ago

Please add the bug number to the patch description

Component: General → Widget: Gonk

Product: Firefox OS → Core

Assignee

Comment 31

•

11 years ago

Attached patch Bug 940851 - Gecko IPC hangs during boot on jb-gonk. r=dwilson (obsolete) — Details — Splinter Review

Uploading HG friendly patch.

Attachment #8341380 - Attachment is obsolete: true

Attachment #8341419 - Flags: review+

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Comment 32

•

11 years ago

nit: would be nice if the commit message described what the patch was doing rather than just regurgitating the title of this bug.

Assignee

Comment 33

•

11 years ago

Attached patch Bug 940851 - Initialize acquire & release-fence fds of HWC_BACKGROUND layer. r=dwilson — Details — Splinter Review

Initialize acquire and release fence fds of HWC_BACKGROUND layer. Otherwise fd 0 gets closed during hwc set call which is a valid fd and it leads to bad state if fd 0 is being used as stdin. HAL relies on framework to set proper acquire fence fds on all HWC layers.

Attachment #8341419 - Attachment is obsolete: true

Attachment #8341461 - Flags: review+