Print Page | Close Window

ryzen 1700 on taichi x370 no OC random hard lock

Printed From: ASRock.com
Category: Technical Support
Forum Name: AMD Motherboards
Forum Description: Question about ASRock AMD motherboards
URL: https://forum.asrock.com/forum_posts.asp?TID=5595
Printed Date: 11 May 2024 at 4:26am
Software Version: Web Wiz Forums 12.04 - http://www.webwizforums.com


Topic: ryzen 1700 on taichi x370 no OC random hard lock
Posted By: ioio
Subject: ryzen 1700 on taichi x370 no OC random hard lock
Date Posted: 15 Jul 2017 at 8:04am
cpu / mobo:  ryzen 1700 on taichi x370
Bios adjustments:  SMT disabled, OPcache disabled
BIOS Ver:  2.40.  I noticed a new one came out today.  I'll try this out next week.
System:  Ubuntu Server 16.04 64bit + HWE + mainline kernel 4.11.9
Memory: Crucial tech 16GB x2 (CT2K16G4DFD8213) operating in dual channel mode
PSU: New EVGA 750 GQ (GQ-0750-V1
Root and Boot installed on ADATA SU800 M.2 2280 128GB (ASU800NS38-128GT-C)
No RAID 
Additional storage drives installed, I can provide details if requested.
video:  nvidia quadro NVS 420.  (the issue still occurs with no video card installed)
temps:  iv not done a detailed analysis of the CPU temps.  I can't check it remotely yet (I'm usually remote).  GPU core is 41C 

I'm experiencing hard lockup between 3-9 hours of runtime.  The screen goes black, keyboard lights turn off, the network goes away.  The only option is power cycle.  I see no hardware errors in syslog, just some complaints from docker and VirtualBox about the mainline kernel. After 1 crash I noted a Dr. Dubug code of 90.  I'll take more samples next week.

The system does not lock up when I schedule a reboot every 3 hours in cron. Currently running 1.5 days since cron was setup.

no attempt has or will be made to oc the hardware.  Bios config is stock except for SMT disabled and OPcache disabled.

Load on the system is very light (as seen with htop).  Current roles are:
  >  LVM managing 4 4TB disks
  >  recording 8 512k video streams over NFS
  >  constant file sync + compression with 2 other servers (syncthing)
  >  file sharing via samba domain membership 

When researching I found several close sounding matches to my issue that seem to be resolved with a kernel and BIOS upgrade, which I've tried.  I also see that programmers are running into trouble when compiling but I've not tried that.  The majority of hits relate to OC which I'm not interested in at this time.  I would adjust/underclock to resolve this issue.

I'm hoping to overcome these crashes!  Early next week I'm planning to reseat/swap/test out memory, reflash/reset bios, and try an alternative NIC.  I'd appreciate any suggestions or advice to resolve the issue.

thanks



Replies:
Posted By: MisterJ
Date Posted: 15 Jul 2017 at 8:29am
" rel="nofollow - ioio, I cannot find your memory in the qualified list for Taichi.  What slots are you using (A2 & B2)?  Have you done a Load UEFI Defaults?  I am curious why you are running with SMT and OPcache disabled?  I had a similar problem (see signature for specs) and it turned out to be bad memory slots A1 and A2. A MB RMA solved the problem.  Have you opened a ticket with ASRock?  Please do, if not.  Can you beg, borrow or steal some memory on the qualified list?  For testing, try one stick in A1 and if it fails, try A2.  Continue until all slots are tested and if all fail, then try the other stick.  Good luck and enjoy, John.


-------------
Fat1 X399 Pro Gaming, TR 1950X, RAID0 3xSamsung SSD 960 EVO, G.SKILL FlareX F4-3200C14Q-32GFX, Win 10 x64 Pro, Enermx Platimax 850, Enermx Liqtech TR4 CPU Cooler, Radeon RX580, BIOS 2.00, 2xHDDs WD


Posted By: ioio
Date Posted: 16 Jul 2017 at 3:50am
John,

Thank you for sharing your experience.  I just put some qualified RAM on order  (HX421C14FBK2/8).  I should have it to test on Tuesday.  On Monday I'll try other slot configurations as you suggest.  

I went with non-qualified RAM because it's ECC at a decent price.   The 2 qualified ECC modules are scarce and expensive.

I disabled SMT and OPcache because my research found there might be problems with them on Linux.  AMD is giving that instruction to some users in their support forums.  It's affecting programmers with specific workloads.  I don't think I'm affected by this issue but I tried turning them off regardless.  It is the case that I had to upgrade to an unsupported kernel for a stable system, I think that is a must for all Ubuntu LTS/Ryzen users.  This is due to Ryzen's SMT implementation.   I will reset back to defaults on Monday and continue testing.

I'll open a ticket with ASRock support.  

I'll post back if it gets resolved.  Thanks again for the help.


Posted By: MisterJ
Date Posted: 16 Jul 2017 at 4:10am
Thanks for the update, ioio.  I did not realize your RAM was ECC.  Few here use it, I suspect.  You may be exploring almost virgin territory.  Hopefully you can get a refund on one or the other memories.  I'll keep an eye for updates.  Thanks and enjoy, John.


-------------
Fat1 X399 Pro Gaming, TR 1950X, RAID0 3xSamsung SSD 960 EVO, G.SKILL FlareX F4-3200C14Q-32GFX, Win 10 x64 Pro, Enermx Platimax 850, Enermx Liqtech TR4 CPU Cooler, Radeon RX580, BIOS 2.00, 2xHDDs WD


Posted By: wardog
Date Posted: 16 Jul 2017 at 9:56am
Originally posted by ioio ioio wrote:

On Monday I'll try other slot configurations as you suggest.


A2 and B2 are stated in your manual and online on the MB's Specification tab as a table beside Memory.

Crucial says they are compatible, BUT they are not ECC sticks. Both Crucial and other sites confim  this.

http://www.crucial.com/usa/en/x370-taichi/CT9993048#productDetails

Product Specifications
Brand Crucial
Form Factor UDIMM
Total Capacity 32GB Kit (16GBx2)
Warranty Limited Lifetime
Specs DDR4 PC4-17000 ??CL=15 ??Dual Ranked ??x8 based ??Unbuffered ??NON-ECC ??DDR4-2133 ??1.2V ??/td>
Series Crucial
ECC NON-ECC
Module Qty 2
Speed 2133 MT/S
Voltage 1.2V
DIMM Type Unbuffered


Posted By: ioio
Date Posted: 16 Jul 2017 at 11:26am
Good catch wardog.  Somehow I got the impression that it was ECC.  OPPS!  It looks like I need CT9993113 for ECC.  I'll try that if I can get the system stable on certified modules.  

This saved me some time in the near future when I would try to figure out why ECC tests fail.

This system is recording and syncing files non-stop so I would like the extra protection ECC provides.

Thanks! 


Posted By: ioio
Date Posted: 18 Jul 2017 at 4:58am
This morning I applied BIOS update 3.0 and reset UEIF to defaults.  The system has been stable under load for 6.5 hours.  I haven't seen it pass 10 hours without crashing so I'll have a better idea on the status tomorrow.  


Posted By: MisterJ
Date Posted: 18 Jul 2017 at 8:01am
Thanks, ioio.  BIOS 3.00 does seem to help in the memory area, judging from the posts here.  I assume you continue to run your initial RAM.  Enjoy, John.


-------------
Fat1 X399 Pro Gaming, TR 1950X, RAID0 3xSamsung SSD 960 EVO, G.SKILL FlareX F4-3200C14Q-32GFX, Win 10 x64 Pro, Enermx Platimax 850, Enermx Liqtech TR4 CPU Cooler, Radeon RX580, BIOS 2.00, 2xHDDs WD


Posted By: ioio
Date Posted: 28 Jul 2017 at 11:49pm
I think the system is running stable now.  I'm currently at 50 hours up time.  I think the key adjustment was to disable Global C-state control in the BIOS.  Here are some notes:
  • I have not yet installed the certified RAM.  Still using the CT2K16G4DFD8213
  • I tried different RAM configurations. For this 50 hour run, I had one stick in slot B2
  • I just booted with the recommended RAM configuration of A2 & B2 to see if it is stable with disabled c-state control
  • Upgrading to BIOS 3.00 did not fix the crashing
  • I found some threads on Reddit from Debian users with similar issues that reported success disabling global c-state control
  • I'm running mainline kernel 4.11.9-041109-generic.  Kernel 4.10 was just released to Ubuntu 16.04 with HWE activated.  I will try switching to that if things stay stable.

I'll report back on the c-state fix after more verification. 


Posted By: ioio
Date Posted: 31 Jul 2017 at 11:01pm
No crashes since I disabled Global C-state.  Approaching 60 hours with same RAM configuration I was using when I started this thread.


Posted By: ket
Date Posted: 01 Aug 2017 at 12:28am
Sounds like the root cause of the problem is how CPU voltage is adjusted with C-State. Manually setting CPU voltage to 1.22-1.25v may be another solution to your problem.



Print Page | Close Window

Forum Software by Web Wiz Forums® version 12.04 - http://www.webwizforums.com
Copyright ©2001-2021 Web Wiz Ltd. - https://www.webwiz.net