Hacker News new | ask | show | jobs
by pks2006 3418 days ago
I always wanted to apply the knowledge of the deep learning to my day to day work. We build our own hardware that runs the Linux on Intel CPU and then launches a virtual machine that has our propriety code. Our code generates a lot of system logs that varies based on what is the boot sequence, environment temperature, software config etc. Now we spend a significant amount of time go over these logs when the issues are reported. Most of the time, we have 1 to 1 mapping of issue to the logs but more often, RCA'ing the issue requires the knowledge of how system works and co-relating this to the logs generated. We have tons of these logs that can be used as training set. Now any clues on how we can put all these together to make RCA'ing the issue as less human involved as possible?
2 comments

Use dumber ML first, try some random forests. Not even because they're even that much better or worse, just because DL requires an enormous amount of knowledge and fiddliness but what you prolly want is for the bulk of the actual work to set up the data for ML, not hyperparameter fiddling and architecture fiddling.
Before you try any sort of ML, explore your data [1, 2]. Exploratory data analysis may very well tell you that there is absolutely no point in making a fancy predictive model at all. If a few heuristics get you 90% of the way to an optimal solution, then don't even bother to start on machine learning, unless that last 10% is going to provide significant value.

[1] https://en.wikipedia.org/wiki/Exploratory_data_analysis

[2] From one of my mentors: http://www.unofficialgoogledatascience.com/2016/10/practical...

I agree with this - try the "simpler" solutions first to see if they'll model what you need. No sense in getting lost in more complex methods if a simpler solution will suffice.
What you could do is assemble the data in tabular form so that your data is in the shape:

    Issue     System log
    -------- ------------
    issue_1   corresponding system log
    issue_2   corresponding system log
    issue_3   corresponding system log
    issue_4   corresponding system log
    issue_5   corresponding system log
Once you've done that, you can train some sort of classifier on it, e.g. something like [1]. There's a bunch of stuff you want to do to make sure you're not overfitting (I'd scale your data & use 5-fold cross validation), but that would get you started.

[1]: http://scikit-learn.org/stable/tutorial/text_analytics/worki...

First - great answer, and thanks for time and response! And now, for some issues, the RCA depends on the order of the syslogs. For some complex issues, the RCA changes based on what path the code took making the order of the syslog change and hence the RCA. Guess I will have to spend some time to incorporate syslog order to the table format you are suggesting.
If it's possible to split the log out into a more granular format, beyond what fnbr has suggested, then it can potentially be used with more complex models; keep the issue as the "label", and the "system log" (or a hash representation?) as well - but if the log entry can be broken up into other data points, it can be useful in other ML methods.

Then again, if the log entry has a somewhat set length (or can be truncated), you could feed that in as the input to a CNN (one input node/neuron per character), and the output layer could consist of the issue labels. I'm not sure what if anything that could net you; perhaps an unknown log could be input on the trained network, and it could classify it to an existing issue?

If you can upload a sample log, I'd be happy to take a look and try to provide some more specific guidance (email's in profile).

I did some work using stack traces to predict duplicate bug reports, so I'm somewhat familiar with a similar problem.