All things Data & AI

Some of the most interesting papers/articles that I read or podcasts that I have consume during the week.

August 18-25, 2024

Interview with the CEO of Github: Yet another great podcast from Decoder. There are lots of interesting tidbits in this podcast and highly recommended. One that stood out for me was that they use 3.5 Turbo for auto-completion because of the low latency and accuracy requirement, and for other scenarios like chat, they use 4o. An example of how a single Application might require different models, depending on the context.
https://epochai.org/blog/can-ai-scaling-continue-through-2030 Fascinating paper on whether it is technically feasible for the current rapid pace of AI training scaling—approximately 4x per year—to continue through 2030. This looks at the four major factors that constrain scaling: power availability, chip manufacturing capacity, data scarcity, and the “latency wall” (e.g. sequential operations create a natural limit). This means that we are likely to see drastic differences similar to what we saw from GPT-2 (rudimentary text generation) to GPT-4 (sophisticated problem solving abilities) from 2019 to 2023.
Learning with synthetic data: Meanwhile, in a corner of X (aka Twitter) there is a fun discussion going on about the human brain and learning. Humans excel at ‘sampling efficiency’ i.e. build generalized patterns of the world with limited data (experiences). Is that all? Can we think of humans having long context windows and use in-context learning to navigate the world, then dump the context into a vector store (the brain) and use that to generate synthetic data (i.e. dreams) and use that to train into networks (continued pre-training). If all this sounds reductionist, it is – after all, we are all betting with our careers on AI, and eventually AGI!

August 11-17, 2024

Causal Agent based on LLM: TLDR: Framework to build agents that can perform causal analysis with observational data to determine the causation/correlation between observations and outcomes. Very relevant for lots of enterprise use-cases from marketing (e.g. measure the impact of a programmatic campaign on click thru rates), healthcare (e.g. measure the correlation between a clinical observation and the presence of a condition) – what is done today with data scientists and semi-automated tools.
Opensource AI: Meta’s manifesto: TLDR: Meta’s vision of how the AI model ecosystem will evolve. While they are positioning Llama family approach as similar to Linux, there is a important difference: Meta does not intend to release the training data, which is the true IP. Nevertheless, this is expected to spawn lots of spin-offs, most relevant for us might be the push for smaller, cheaper models, which may in turn get the ROI that is a key barrier to adoption.
How far will we go with AI virtual assistants? (Podcast) TLDR: Interview with Replika CEO, which has a service that can create AI agents that can get very real. Fun to listen to, and also makes you think – how far will we go with AI in our personal lives? Reminds you of the movie ‘Her’ where the main character falls in love with an AI agent. Is this the future we are building? The Decoder podcast is one of my favorite

August 4-10, 2024

Apple Intelligence Foundation Language Models. TLDR: Two types of models (on-device and on-server) with continued pre-training. The interesting part is their ‘Responsible AI’ implementation – starting with a taxonomy, tradeoffs between helpfulness and harmlessness with SFT and RLHF, and the idea of ‘red teaming’. Fascinating how their approach is driven by the fact that there is a massive, diverse user-base and all the associated risks

Super Tiny Language Models TLDR: The idea is to figure out ways to build slimmed-down versions of LLMs, with far fewer parameters, and get them to deliver similar levels of effectiveness. Interesting to see some of the strategies used, but more importantly, we should expect these to become more prevalent over time as they become more viable from a price/performance point of view. Related to the work from Apple.

LLMs with Structured Data TLDR: Getting LLMs to work with structured data is a well-known problem. This explores the idea of a multi-step process to narrow down the context (‘Learning to Reduce’). Interesting applicability with Agents to try to get to the correct reduced context, the main challenge with large structured datasets, to get the LLM to perform more accurately. Closely watching this space, given the huge opportunity across multiple use-cases.

Analysis of Redshift fleet TLDR: The main argument is to move away from narrow benchmarking (e.g. query execution) to multiple categories. What was revealing is that write-heavy data pipelines are prominent, workloads vary over time (in both load and type), queries are repetitive, and how most properties of queries or workloads experience very long tailed distributions. Great reflection on how data management is rapidly changing.

Share this: