Closed
Bug 940851
Opened 11 years ago
Closed 11 years ago
Gecko IPC hangs during boot on jb-gonk
Categories
(Core Graveyard :: Widget: Gonk, defect, P1)
Tracking
(blocking-b2g:1.3+, firefox28 fixed)
Tracking | Status | |
---|---|---|
firefox28 | --- | fixed |
People
(Reporter: tkundu, Assigned: sushilchauhan)
Details
Attachments
(5 files, 2 obsolete files)
I observed that multicore device with FFOS 1.3 freezes just after bootup. It displays homescreen but without any icons and also we cannot pan homescreen.
Further debugging shows that Homescreen sends synchronous IPC reuest to b2g process and waits for acknowledgement from b2g . But B2G process never replies back to Homescreen process.
I also observed that Gecko_IOThread (used by IPC) is consuming 50% CPU continuously.
This issue happens randomly and it is easier to reproduce with |adb shell setprop debug.composition.type dyn |
Please note that you may need to reboot multiple times to see this issue on multicore device (JB port of FFOS 1.3).
I attached logcat for IPC transaction during boot.
I also see that device has following process running during this hangs.
adb shell b2g-ps
APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME
b2g root 307 1 214792 71636 ffffffff 00000000 S /system/b2g/b2g
Homescreen u0_a995 995 307 70904 25316 ffffffff 00000000 S /system/b2g/plugin-container
plugin-containe root 1061 307 48784 11436 ffffffff 00000000 S /system/b2g/plugin-container
Reporter | ||
Comment 1•11 years ago
|
||
it shows that gecko IO thread's cpu usage is very high when this happens
Reporter | ||
Comment 2•11 years ago
|
||
I can also attach the patch which I used to generate this logs. Please let me know if it helps anybody
Reporter | ||
Updated•11 years ago
|
Severity: normal → blocker
blocking-b2g: --- → 1.3?
Priority: -- → P1
Updated•11 years ago
|
blocking-b2g: 1.3? → 1.3+
Updated•11 years ago
|
Target Milestone: --- → 1.3 Sprint 5 - 11/22
Reporter | ||
Comment 3•11 years ago
|
||
I also saw that multiple compositor thread gets created randomly in FFOS 1.3 JB port. But only one compositor thread is doing composition. I am not sure whether this is related to this bug or not.
For Ex: I can see following output in FFOS 1.3 JB port (running on multicore device). But only compositor thread id 871 is doing composition.
$ adb shell top -t | grep Compositor
307 871 3 0% S 182112K 87192K root Compositor /system/b2g/b2g
307 871 3 0% S 189660K 97632K root Compositor /system/b2g/b2g
307 871 0 3% S 227736K 77704K root Compositor /system/b2g/b2g
307 1042 0 0% S 227736K 77704K root Compositor /system/b2g/b2g
307 871 0 6% S 228080K 74104K root Compositor /system/b2g/b2g
307 871 0 5% S 227056K 72176K root Compositor /system/b2g/b2g
Comment 4•11 years ago
|
||
Hey Ben, you know something about IPC right? Could you please help aim Tapas in the right direction here if possible?
Flags: needinfo?(bent.mozilla)
I don't know what to make of these logs really. There should definitely only be one compositor thread in the parent process but many other threads (including many threads from other processes) can send layers transactions to it. It looks to me like the IO thread is just being flooded by way too many layers requests.
Tentatively moving this over to GFX in the hopes that these logs will mean more to the layers team.
Component: General → Graphics: Layers
Flags: needinfo?(bent.mozilla)
Product: Firefox OS → Core
Comment 6•11 years ago
|
||
(In reply to ben turner [:bent] (needinfo? encouraged) from comment #5)
> I don't know what to make of these logs really. There should definitely only
> be one compositor thread in the parent process but many other threads
> (including many threads from other processes) can send layers transactions
> to it. It looks to me like the IO thread is just being flooded by way too
> many layers requests.
>
> Tentatively moving this over to GFX in the hopes that these logs will mean
> more to the layers team.
NI :milan here to see if he has any ideas or can help with an assignee to investigate this.
Flags: needinfo?(milan)
Comment 7•11 years ago
|
||
What kind of device is this? Do we have access to it? CJ, is there one in Taipei, can somebody like Peter take a look?
Flags: needinfo?(milan) → needinfo?(cku)
Reporter | ||
Comment 8•11 years ago
|
||
Can you please try with any multicore device ? You may need to reboot multiple times to produce this issue
Steps to produce:
1) Flash any multicore device with JB port of FFOS 1.3
2) Reboot it . Then try to see whether homescreen is displaying all icons or it is panning or not.
3) If you see that homescreen is showing all icons in step-2 then reboot it again and follow step-2
If you can produce this issue in step-2 then please run |adb shell top -t| and you will see Gecko_IOThread is taking 50% CPU usage . You call also see that homescreen is hanging here : inside MessageChannel::SendAndWait()
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#487
Bsaically, homescreen is sending some IPDL request to b2g and waiting forever for reply in IPC request.
Peter, please cooperate with performance team on this issue
Flags: needinfo?(pchang)
Flags: needinfo?(cyu)
Reporter | ||
Comment 10•11 years ago
|
||
I am trying to debug it further myself. I am seeing that homescreen synchronous IPDL request are blocked at
MessageChannel::SendAndWait(Message* aMsg, Message* aReply)[1] and it waits on function WaitForSyncNotify()[2] forever. It happens randomly only between homescreen and b2g process during boot on multicore device
[1] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#487
[2] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1176
[3] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#908
I saw for a successful synchronous IPDL request, following thing happens
1) homescreen process calls [1] .
2) b2g process receives this synchronous message in MessageChannel::DispatchSyncMessage(const Message& aMsg) [3] and calls appropriate message handler based on message type
Can anyone tell me what happens between above two steps. I am curious to know implementation details which may help me to debug possible race condition .
Flags: needinfo?(bent.mozilla)
Comment 11•11 years ago
|
||
I tried to reproduce this on Nexus 4 with JB gonk and m-c gecko but didn't see the problem.
From your log, it seems the b2g process fails to deliver the IPC request to the actor on the main thread. There is also another process sending a request but never getting the response:
01-01 02:10:48.289 1061 1061 I Gecko : IPDL::PContent::SendPCrashReporterConstructor Entered
Even the composition thread is flooding, we should not see the consistent high cpu on the Gecko_IOThread. It looks like that some file descriptor the IO thread is monitoring is causing the spin.
Could you turn on IPC logging with your logging instrumentation. We need to see whether the IO threads in the 2 processes really send/recv the message over the wire. Thanks.
Flags: needinfo?(cyu)
Comment 12•11 years ago
|
||
And also, reproducing the problem with the profiler will be very helpful.
Yeah, I don't know what to recommend yet. I'd try what Cervantes suggests, the environment variable you need to set is MOZ_IPC_MESSAGE_LOG.
Could it be that the main thread is blocking while trying to do a synchronous layer transaction and that the compositor has a bunch of other layer transactions to process before it gets there?
Flags: needinfo?(bent.mozilla)
Reporter | ||
Comment 14•11 years ago
|
||
I saw that homescreen is stuck in following IPC request:
[time:22687463344][940][PBrowserChild] Sending Msg_GetDefaultScale([TODO])
Can anyone suggest for my comment in #comments 10?
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(cyu)
Comment 15•11 years ago
|
||
(In reply to Tapas Kumar Kundu from comment #10)
1. mozilla::ipc::MessageChannel::Send() calls mLink->SendMessage(). For IPC, mLink is of class mozilla::ipc::ProcessLink.
2. mozilla::ipc::ProcessLink::SendMessage() post a task to the IPC thread to run Transport::Send() (alias of IPC::Channel::Send()).
3. IPC::Channel::Send() calls IPC::Channel::ChannelImpl::Send().
4. IPC::Channel::ChannelImpl::Send() calls its OutputQueuePush() to enqueue the message and then ProcessOutgoingMessages() to flush the message.
5. ProcessOutgoingMessages() calls sendmsg() to send the message to over the socketpair file descriptor.
--------------------------- process boundary ---------------------------------
6. The IPC thread message loop calls IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking().
7. IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking() calls its ProcessIncomingMessages(), which calls recvmsg() to read the message from the socketpair file descriptor.
8. IPC::Channel::ChannelImpl::ProcessIncomingMessages() calls mozilla::ipc::ProcessLink::OnMessageReceived() of its listener_.
9. mozilla::ipc::ProcessLink::OnMessageReceived() calls mozilla::ipc::MessageChannel::OnMessageReceived() of its mChan.
10. mozilla::ipc::MessageChannel::OnMessageReceived() posts a mozilla::ipc::MessageChannel::DequeueTask to the main thread's message loop.
11. mozilla::ipc::MessageChannel::DequeueTask() calls mozilla::ipc::MessageChannel::RefCountedTask::Run(), which calls an instantiation of RunnableFunction or RunnableMethod template, which calls in instantiation of DispatchToMethod template.
12. DispatchToMethod calls mozilla::ipc::MessageChannel::OnMaybeDequeueOne(), which calls mozilla::ipc::MessageChannel::DispatchMessage()
13. mozilla::ipc::MessageChannel::DispatchMessage() calls DispatchSyncMessage(), which calls mListener->OnMessageReceived(). mListener is the protocol actor
Flags: needinfo?(cyu)
Comment 16•11 years ago
|
||
(In reply to Tapas Kumar Kundu from comment #14)
> Created attachment 8337162 [details]
> IPC logs
>
> I saw that homescreen is stuck in following IPC request:
> [time:22687463344][940][PBrowserChild] Sending Msg_GetDefaultScale([TODO])
>
> Can anyone suggest for my comment in #comments 10?
Sorry, my memory is blurred. This IPC log is also from the actors, not on the IPC thread. So we still don't know what's going on on the IPC thread. Maybe you can output the errno and dump the IPC message headers before sendmsg() and after recvmsg().
Comment 17•11 years ago
|
||
Cervantes,
Since you have the most knowledge here, can you please be the owner of the same?
Flags: needinfo?(cyu)
Comment 18•11 years ago
|
||
Tapas, on which device are you seeing this problem?
If this is device is not available in Taipei, can I have remote access to your device to investigate? You may set up a tunnel and email me the access and I can take a look.
Assignee: nobody → cyu
Flags: needinfo?(cyu) → needinfo?(tkundu)
Updated•11 years ago
|
Flags: needinfo?(pchang)
Reporter | ||
Comment 19•11 years ago
|
||
Hi Cervantes,
I sent you email for joining meeting with me online . I think that this is the best way to resolve it quickly. Thanks a lot for your help :)
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(cyu)
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(tkundu)
Summary: Gecko IPC hangs during boot for multicore device → Gecko IPC hangs during boot on jb-gonk
Comment 20•11 years ago
|
||
Flags: needinfo?(cyu)
Comment 21•11 years ago
|
||
In the debug session, we kept seeing backtrace of in the attachment on b2g's IO thread (5 out of 6 samples). This looks like quite suspicious.
It looks like we are having a spinning because of epoll_wait() failure. The backtrace is about libevent generating a warning of epoll_failure, see
http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/third_party/libevent/epoll.c#413
Since the error is "Invalid argument", I think this is the result of strerror(EINVAL). From epoll_wait() man page:
"EINVAL epfd is not an epoll file descriptor, or maxevents is less than or equal to zero."
Please check the parameters in calling epoll_wait() in
http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/third_party/libevent/epoll.c#407
to see why we have epoll_wait() failure.
Comment 22•11 years ago
|
||
To be more specific, please check that the file descriptor is a valid epollfd like:
lrwx------ root root 2013-11-27 06:00 36 -> anon_inode:[eventpoll]
under /proc/13849/fd, where 13849 is b2g's pid, and 36 is the epollfd.
If the failure is due to invalid file descriptor, you should see that the argument passed to epoll_wait pointing to a nonexisting one, or not "anon_inode:[eventpoll]"
If the failure is due to invalid maxevents, you should see it from gdb.
Reporter | ||
Comment 23•11 years ago
|
||
Hi Cervantes,
I shared logs with you by email . Please let me know when you are comfortable/free . I would like to see it together again.
Thanks a lot for your help :)
Flags: needinfo?(cyu)
Comment 24•11 years ago
|
||
From the log, there are something interesting:
under /proc/<b2g pid>/fd:
lr-x------ root root 1970-01-01 04:43 0 -> anon_inode:sync_fence
It looks like the graphics driver is using file descriptor 0.
and the following is from instrumentation to check for epoll_wait() arguments:
01-01 04:38:29.699 303 677 I Gecko : ALERTB2G EPOLLB2G WARN epoll_dispatch L:414 : epoll_wait: fd: 0, no_of_events=1, maxevents 32, timeout -1
So the IO thread is using epoll_wait() with a file descriptor that is not an epoll fd. And no one bails out from this error. We just keep getting EINVAL when we epoll_wait() this fd.
There are 2 problems with this:
1. file descriptor 0 is used as anon_inode:sync_fence. This file descriptor should always be /dev/null if b2g is launched by init, or points to stdin if it is launced from the shell like this
lrwx------ root root 2013-11-27 23:09 0 -> /dev/pts/0
I think you need to go back to use strace or QView to check why file descriptor 0 is closed and then used by the driver.
2. libevent's epollfd is changed to 0. We can know why the epollfd used by libevent is 0 right away with hardware watchpoint in gdb. Does the platform support this?
Flags: needinfo?(cyu)
Reporter | ||
Comment 25•11 years ago
|
||
Thanks a lot for your help . I found that stdin is closed by a bug in MDP HAL. We will upstream a fix for that soon. I already asked my colleague for that .
I also verified that IPC hang issue is not coming anymore ( I tried 20 times) after fixing stdin closing bug.
Comment 26•11 years ago
|
||
Good to hear that this is fixed. But I'll be interested in how a bug in MDP HAL would ever result in the MessagePumpLibevent using 0 in epoll_wait.
Reporter | ||
Comment 27•11 years ago
|
||
Cervantes Yu ,
MDP HAL closes fd 0 atleast two times without opening it. It happens during boot. when MDP HAL closes it 1st time , epoll_wait starts polling on fd 0 and it hangs when MDP closes it 2nd time. MDP HAL is running as a part of b2g process and Gecko_IOThread is also running part of b2g process. It seems to me that this is causing high CPI usage by Gecko_IOThread because it keeps polling on fd 0 continuously.
Thanks a lot for your help :)
Assignee | ||
Comment 28•11 years ago
|
||
Thanks Tapas and Cervantes Yu for pointing out fd 0 issue.
HAL relies on framework to set proper acquire fence fds on all HWC layers. So acquire and release fence fds of HWC_BACKGROUND layer need to be initialized as well. Otherwise fd 0 gets closed in closeAcquireFds() during hwc set call which is a valid fd and it leads to bad state if fd 0 is being used as stdin. There is 1 more path which can also lead to closing of fd 0 in HAL. We are also up-streaming the HAL fix.
Updated•11 years ago
|
Assignee: cyu → sushilchauhan
Component: Graphics: Layers → General
Product: Core → Firefox OS
Attachment #8341380 -
Flags: review?(dwilson)
Comment 29•11 years ago
|
||
Comment on attachment 8341380 [details] [diff] [review]
Initialize acquire and release fence fds of HWC_BACKGROUND layer.
Review of attachment 8341380 [details] [diff] [review]:
-----------------------------------------------------------------
LGTM
Attachment #8341380 -
Flags: review?(dwilson) → review+
Comment 30•11 years ago
|
||
Please add the bug number to the patch description
Component: General → Widget: Gonk
Product: Firefox OS → Core
Assignee | ||
Comment 31•11 years ago
|
||
Uploading HG friendly patch.
Attachment #8341380 -
Attachment is obsolete: true
Attachment #8341419 -
Flags: review+
Keywords: checkin-needed
Comment 32•11 years ago
|
||
nit: would be nice if the commit message described what the patch was doing rather than just regurgitating the title of this bug.
Assignee | ||
Comment 33•11 years ago
|
||
Initialize acquire and release fence fds of HWC_BACKGROUND layer. Otherwise fd 0 gets closed during hwc set call which is a valid fd and it leads to bad state if fd 0 is being used as stdin. HAL relies on framework to set proper acquire fence fds on all HWC layers.
Attachment #8341419 -
Attachment is obsolete: true
Attachment #8341461 -
Flags: review+
Comment 34•11 years ago
|
||
Keywords: checkin-needed
Comment 35•11 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
status-firefox28:
--- → fixed
Updated•11 years ago
|
Flags: in-moztrap-
Updated•6 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•