Highlights from KDD 2020

The 26th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2020), perhaps the most prestigious academic Data Science conference, took place last week. It was held entirely online, thanks to the travel restrictions brought by COVID-19. I watched many of the presentations and talked to people. Here I summarize what I learned.

Online Conferences

I tried to make friends. We all need friends.

First of all, did the virtual format worked? Overall, I still prefer physical conferences, but there are positive aspects in the online version, notably:

  • It is much cheaper. I had already spent a lot to go to AAAI 2020 last February, so I had not planned to attend KDD this year. Thanks to the much cheaper online format, however, I changed my plans and participated.
  • In some ways, it is easier to talk to people. After all, anyone is just one message away within the conference app. I had very productive conversations and learned a lot in this manner. In fact, I was the third most chatty person.

As for the negative aspects:

  • It is not so easy to prioritize the conference when you are in your natural environment, where work and home duties are much closer. So whatever stressors you have you your life, they remain with you.
  • The coffee breaks do not work because you aren’t really synchronized with other participants. You can’t see them and you don’t walk out of a room with them.
  • Casual conversation is difficult, and no one will say anything that might get them in trouble, since everything is recorded.
  • Exhibitors and sponsors probably suffered. A “virtual booth” seems far less effective, at least in the format that was used.
  • It simply is not as fun. We are three-dimensional creatures, space is important and traveling clears the mind, so in a sense this was all quite inhuman.

In summary, I learned a lot and kept my money, which is great, but I would still prefer to travel to the next edition! That said, note that AAAI 2021 will also be held virtually next Februrary, so it should be easy and cheap to attend.

Finally, it is worth noting that there is considerable space for improving virtual experience. The time has come for some proper online conference tool. The ones used by KDD (Vfairs and Whova) got the job done, but are way too imperfect to make this a sustainable format in my opinion. Surely some entrepreneur will take the bait.

Industry and Data Science

There were many presentations, panels and conversations regarding the relationship between academia, research and industrial practice.

  • Manuela Veloso‘s talk on AI for intelligent financial services was to me one of the most insightful. Having spent decades in academia and recently moved to J.P. Morgan as Head of AI Research, she is in a privileged position to comment on the potential and challenges of bringing AI research thinking into more traditional companies. Three main takeaways for Data Scientists in this respect:
    • Act as a scientist, even if that is not business as usual for the company. There’s no point in hiring you if they will not let you work, so take a stand.
    • Executive leadership trust and support are essential. You won’t be able to work properly otherwise. This means, for example, not deploying prototypes in production before they are ready, even if they are already impressive (see her Mondrian project).
    • In return, recognize that business people know the business, not you (initially anyway). We should all learn with each other. Otherwise, it doesn’t work.
  • Some very interesting research she talked about:
    • Mondrian, a project for trading recommendations based on images (see below in Time Series section).
    • Synthetic data generation (see below in Time Series section).
    • AI pptX, an automated generator of PowerPoint presentations. I guess this is the ultimate business application for AI.
  • Students are educated through well-defined problems, but that almost never happens in practice. As a result, they are seldom ready to face industry challenges. I might add that Kaggle-style training has the same problem.
  • If a Data Scientist spends 50% of his/her time cleaning data, he/she is lucky. Unfortunately, and as we should all know by now, data quality and access problems are usually inescapable and often underestimated.
  • One of the hardest things is to create good metrics. In fact, “metric/goal/objective function engineering” is something to be taken quite seriously. See also the tutorial on User Metrics and the paper A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments.
  • See Causal Meta-Mediation Analysis Inferring Dose-Response Function From Summary Statistics of Many Randomized Experiments. See also Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned.
  • Although experimentation is the best way to optimize prices, this is not always possible for either business or technical limitations. Therefore, it is important to have techiques to do so through causal inference only. See the case of Walmart in Price Investment using Prescriptive Analytics and Optimization in Retail.

I talked to some other participants in the chat rooms and these conversations gave me some important insights, such as:

  • A lot of companies out there, even well-established ones, have no proper Data Science methodology. It is often ad-hoc, unsystematic, crazy work.
  • One solution is to define a strategy first, and then systematically execute it. This is what I am doing at my current management position and what others recommended as well. Everybody seems to basically have the same advice: have a guiding technical strategy, work incrementally, ensure users are observed, heard and involved, take feedback, repeat and improve.
  • Price optimization through experimentation can bring huge profits, but it is challenging to convince business stakeholders to pursue this strategy.

 

Iterative cultural change. It is working for J.P. Morgan.

 

Great companies support all of their services with advanced technology and analytics.

This has to be the ultimate business application for Artificial Intelligence. If a super-human AI was to go rogue and consume all of the universe’s resources over-optimizing something, as Dr. Bostrom fears, I think endless PowerPoint decks would be more suitable madness than endless paperclips.
Another take on how to build technologies that actually solve practical problems. EM = Entity Matching.

Do things manually at first to understand how it works. Then automate it. This is what I have been telling everyone to do.

Recommender Systems and Information Retrieval

Recommender systems are really at the center of our lives these days. At least in my life anyway, now that forced home-office makes me order way more food through apps than I used to just a few months ago. Their presence in the conference reflects this growing importance. There are several related themes as well that in a way or another are connected to recommendations (e.g., Reinforcement Learning).

The algorithm can use the graph to know which attributes to use to query users.

 

 

Adam is the only one benefited here. Everyone else looses. Alice will never be found, although she is just as good as Adam, the difference can be just noise. The company hiring will not find a sufficiently varied pool of candidates. The platform itself will suffer, as it will be less useful for the majority of its customers.
Marketplaces are not just for food and transportation. In fact, aren’t markets one of the oldest and most developed human institutions? It is only natural that their digital counterparts should mirror this tradition and complexity.
Note how some metrics conflict with others in this correlation chart. Sometimes trade-offs are necessary.
No matter how much you love chocolate cake, I bet you couldn’t eat it exclusively for the rest of your life.

 

The natural experiments that arise from application execution are all confounders, they influence both the metric at hand and the business KPI (e.g., gross merchandise volume). This makes the metric effect estimation difficult.
If we can know how offline validation metrics affect the final business KPIs, obviously we can improve the validations to perform and make everyone happier.

AutoML

Automatic Machine Learning, or AutoML, got a lot of attention too.

AutoML steps. This cycle was presented by more than one person, so I suppose it is standard in the field.
Note how easy it is to specify the options for each step of the pipeline. The result is very readable. Reminds me a lot of Process Algebras.
Apparently Lale covers all the important steps in AutoML.

 

An elegant way to put it.

 

Time Series

I really like time series. They are everywhere and their analyses can help in many different ways, from fighting a pandemic to getting rich in the stock market.

Note how the past is gradually incorporated from layer to layer.
Active learning is keeping us safe. Who knew?

 

This tool helps humans input what they consider strange. A great practical example of how to combine human and machine capabilities.

 

Well, maybe when you feared there was a problem, there really was a problem and the flight attendant lied to you! One more reason to serve passengers some hard liquor during takeoff.

 

 

 

Using well-known Convolutional Neural Networks to classify images of time series. Brilliant! How come I did not think about this before?

 

Apparently it works well. I bet there are problems to execute the trade, but that was not covered.

 

I was so glad to see that synthetic time series generation are being studied in a practical setting. The fact that they are using multi-agent simulation to this end is a nice bonus.

 

It is important to ensure that synthetic data is similar to real data on selected metrics.

Natural Language Processing

NLP has enjoyed considerable progress over the last few years, so of course it was well represented here. In particular:

  • Embeddings are everywhere, not only in NLP.
  • The text summarization tutorial presented the basics of Transformers and latest techniques for summarization, based for instance on the BERT language model though BertSum.
    • Summarization can be either extractive (i.e., select some sentences) or abstractive (i.e., paraphrases the text).
    • The task can be seen as a form of translation from and to the same language, but with a length constraint.
    • Some useful data sources for summarization training, because their articles contain human-crafted summaries: CNN, Daily Mail, The New York Times, XSum.
  • Microsoft hosts a very nice repository with “easy” to use NLP recipes for different types of problems. This includes the summarization techniques mentioned above.

Healthcare and COVID-19

There were also various workshops dedicated to healthcare, but I could not attend them. What I did pay attention to was some of the COVID-19 talks.

  • Healthcare-related workshops if you feel like looking through their content:
  • Unsurprisingly, COVID-19 was extensively discussed.
    • A lot of smart and well-intentioned people are working in modeling the pandemic. I’m afraid, however, that these results, including my own unpublished model, can’t really be trusted yet. At best, they must be interpreted by experts in the context of their domain knowledge and as support for other reasoning methods. The reason is that there’s no way to properly validate these models. They are constantly adapted to account for the latest facts, which means their predictions cannot go to far into the future (i.e., they are not validated beyond short-term predictions). However, they are helping us to learn a lot about computational modeling, and hopefully at some point we will have modern and reliable epidemic models for the next pandemic.
    • I also think that since there is no strict, formal, control of predictions, it is actually hard to check which models are really working.

 

 

 

Share your thoughts