Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

khoa's internet hideout


Software Engineer @ Autodesk. Cal '15 -- Go Bears! Data Enthusiast. Loves tennis, martial arts, water sports, and salsa.

So I heard you're an aspiring Golden Bear Data Scient-ish

This post is inspired by William Chen's post on Quora for Harvard.

Disclaimer: I've not taken all of these classes, and some is more of a hearsay. YMMV. This list is also not meant to be exhaustive.

So you're at Berkeley, and you want to do Data Science? Great! Here are some classes that you may find helpful to prepare for a career in Data Science.

Computer Science

  • CS 61A/B/C: Structures & Interpretations of Computer Programs, Data Structures, Machine Structures
  • CS 170: Algorithms
  • CS 188: Artificial Intelligence
  • CS 186 / IEOR 115 / Info 257: Database
  • CS 169: Software Engineering (can be substituted by an internship)

Comments: CS 61ABC are foundational for every CS students. I can't imagine graduating from Berkeley CS without CS 170. CS 188 introduces students to a variety of modern AI research topics, including search & heuristics, constraint satisfaction, reinforcement learning, graphical models, and machine learning.

I don't think one can do data science without a basic understanding of database and SQL commands. CS 186 or IEOR 115 teach students precisely that. The former goes deep into database construction/programming, while the latter is more of a survey of common database topics that is more suitable for non-CS majors.

CS 169 teaches the life-cycle of software development and allows students to work on a semester-long project. If you already had an internship, chances are you won't learn as much in this class, but I would still recommend doing it if you have an empty slot in your schedule.

Probability & Statistics

  • Stat 134: Probability Theory (CS 70, CS 174, EE 126, or IEOR 172 may also suffice)
  • Stat 135 / IEOR 165: Statistics / Statistical Inference
  • Stat 151A: Linear Modeling
  • Stat 152: Survey Sampling
  • Stat 153: Time Series Analysis

Comments: Stat 134 and 135 are crucial for any statistics work, including data science. There are many 150-series classes at Cal (most are meant to be taken after 134 and 135), but if you only have time for a couple, I would recommend the ones above. Again, YMMV.

If Stat 134 doesn't fit your schedule, any of the other four classes may also work. What's the differences among them? AFAIK, CS 70 also teaches Discrete Mathematics alongside Probability Theory for CS majors, but doesn't go very deep into the latter. CS 174, Randomized Algorithms and Discrete Probability, is an advanced course, and is usually best served after CS 170 (e.g. you will be doing Randomized Min Cut, Quick Select, etc. in the first couple weeks of class). IEOR 172 tends toward Operations Research applications. I don't have any personal experience with EE 126, though I've heard it's a notoriously difficult beast.

Unlike the lower division CS classes, don't worry about the lower division Stat classes. They are mostly for non-majors and/or students who haven't learned basic stats in high school. If you have completed a year of Calculus, it should be fine to jump straight to Stat 134.

Data Analysis & Data Mining

  • Stat 133 / Stat 243: Concepts in Computing with Data / Statistical Computing
  • Stat 157: Collaborative and Reproducible Data Science
  • CS 194-16: Data Science
  • Info 290T-3: Data Mining

Comments: Stat 133 is a gentle introduction to programming for statistics majors, using R. If you already have some nontrivial programming experience, it should be fine to skip it, since Stat 135 also teaches students R from scratch for the lab portion of the class.

Stat 157 is a new course that was first offered in Fall 2013. The initial run was quite rough, but I think the class will eventually become a valuable learning experience for undergrads in the near future. This class teaches students the good practice of reproducible research, which is very essential if your goal is to go to grad school.

CS 194-16 is also a new course (though it was previously offered in a different format by Cloudera's Founder and Chief Scientist, Jeff Hammerbacher). There's a cap on the enrollment for the class at the moment because the material is still being developed, but it should be fine for future semesters. You learn the nuts and bolts of data science in this class, from scraping and preparing data to mining them for patterns.

Finally, Info 290T is a fun and relaxing course that teaches you the basics of Data Mining, often taught by Yelp's Engineering Managers. The pace is very gentle, and students work on a collaborative final project that analyzes a dataset of their choice. The class doesn't go very deep into the theory behind many algorithms, however.

Optimization & Bad-Ass Math

  • Math 110: (Advanced) Linear Algebra
  • EE 127 / EE 227B: Optimization / Advanced Optimization
  • Stat 150 / IEOR 161 / IEOR 263A: Stochastic Processes

Comments: Surprisingly (to me at least), Linear Algebra is everywhere in modern AI topics, from Machine Learning to Computer Graphics/Vision. Math 54, the lower division Linear Algebra & Differential Equations course, introduces students to basic concepts in Linear Algebra, and Math 110 extends upon those and teaches students many fundamental LA theorems, including the Spectral Theorem (which you will see a lot in analyzing Gaussian models) and Jordan Form. Upper division mathematics classes are notoriously mind-bending and difficult, but if you have time for just one, Math 110 is the one.

Optimization, which also borrows heavily from Linear Algebra (SVD, PCA, etc.), is also an essential class. This class is best taken in concurrent with a Machine Learning class, and there's some overlap between it and Math 110.

Stat 150 personally fries my brain. I wouldn't recommend doing it unless you have had advanced training in mathematics (e.g. Real and Complex Analysis). You will recite Markov Chain like the Multiplication Table coming out of this class. The course also touches on Markov Chain Monte Carlo (MCMC), which is an advanced sampling topic that is also taught (briefly) in CS 188 and CS 281A.

Machine Learning & Statistical Learning

  • CS 189 / Stat 154: Machine Learning
  • CS 281A / Stat 241A: Statistical Learning Theory (Probabilistic Graphical Models)
  • CS 281B / Stat 241B: Advanced Topics in Statistical Learning Theory
  • Info 290-10: Machine Learning in Education

Comments: If you can pick between CS 189/Stat 154, I would hands down go with the former. Both teach students the fundamentals of machine learning, and even though you might not be on the Big Data/Machine Learning bandwagon, chances are you will find yourself applying some ML concepts in the future. It's a must-take IMO.

CS 281A is a very theoretical beast (it's a graduate-level course, after all), and students are expected to be solid in Linear Algebra and Probability Theory from day one. You learn to apply concepts from these two topics on graphical models to analyze a variety of real-world probabilistic inference problems.

Other Related Topics

There are many popular AI topics that share a lot in common with Data Science. Whether it's Text Mining or Computer Vision, you will likely find yourself working with a bunch of messy data and trying to analyze them using similar mathematical and statistical concepts. Again, this list of related topics is also inexhaustive.

  • CS 288 / Info 256: Natural Language Processing
  • CS 287: Robotics
  • CS 280: Computer Vision
  • CS 184 / CS 194-26: Computer Graphics / Computational Photography
  • Info 202: Information Retrieval

Final Words

There are obviously a ton of other classes that are helpful for a career in Data Science, and the ones suggested above are not enough to complete either a CS or Stat degree. You are encouraged to take as many of the suggested classes as possible. In addition, any other upper division or graduate CS/Stat/Math classes will never hurt.

Beside classes, don't forget to hack on a personal project or two and analyze an interesting dataset of your choice. Kaggle will be one of your best friends. There are tons of inspirational posts for a Data Science career on Quora and other websites, so I'm gonna stop here. Feel free to let me know if you have any suggestions to add to the list above. And Go Data Science Bears!


About the author

Khoa Tran

Berkeley, CA


Discussions

comments powered by Disqus