Bin Yu is Chancellor’s Professor in the Departments of Statistics and of Electrical Engineering & Computer Science at the University of California at Berkeley. Her current research interests focus on statistics and machine learning theory, methodologies, and algorithms for solving high- dimensional data problems. Her group is engaged in interdisciplinary research with scientists from genomics, neuroscience, and remote sensing. She obtained her B.S. degree in Mathematics from Peking University in 1984, her M.A.and Ph.D. degress in Statistics from the University of California at Berkeley in 1987 and 1990, respectively. She held faculty positions at the Univ of Wisconsin-Madison and Yale University and was a Member of Technical Staff at Bell Labs, Lucent. She was Chair of Department of Statistics at UC Berkeley from 2009 to 2012, and is a founding co-director of the Microsoft Lab on Statistics and Information Technology at Peking University, China, and Chair of the Scientific Advisory Committee of the Statistical Science Center at Peking University. She is Member of the U.S. National Academy of Sciences and Fellow of the American Academy of Arts and Sciences. She was a Guggenheim Fellow in 2006, an Invited Speaker at ICIAM in 2011, and the Tukey Memorial Lecturer of the Bernoulli Society in 2012. She was President of IMS (Institute of Mathematical Statistics) in 2013-2014, and will be the Rietz Lecturer of IMS in 2016.
The multi-facets of a data science project to answer: how are organs formed?

Genome wide data reveal an intricate landscape where gene actions and interactions in diverse spatial areas are common both during development and in normal and abnormal tissues. Understanding local gene networks is thus key to developing treatments for human diseases. Given the size and complexity of recently available systematic spatial data, defining the biologically relevant spatial areas and modeling the corresponding local biological networks present an exciting and on-going challenge. It requires the integration of biology, statistics and computer science; that is, it requires data science. In this talk, I present results from a current project co-led by biologist Erwin Frise from Lawrence Berkeley National Lab (LBNL) to answer the fundamental systems biology question in the talk title. My group (Siqi Wu, Antony Joseph, Karl Kumbier) collaborates with Dr. Erwin and other biologists (Ann Hommands) of Celniker's Lab at LBNL that generate the Drosophila spatial expression embryonic image data. We leverage our group's prior research experience from computational neuroscience to use appropriate ideas of statistical machine learning in order to create a novel image representation decomposing spatial data into building blocks (or principal patterns). These principal patterns provide an innovative and biologically meaningful approach for the interpretation and analysis of large complex spatial data. They are the basis for constructing local gene networks, and we have been able to reproduced almost all the links in the Nobel-prize winning (local) gap-gene network. In fact, Celniker's lab is running knock-out experiments to validate our predictions on gene-gene interactions. Moreover, to understand the decomposition algorithm of images, we have derived sufficient and almost necessary conditions for local identifiability of the algorithm in the noiseless and complete case. Finally, we are collaborating with Dr. Wei Xue from Tsinghua Univ to devise a scalable open software package to manage the acquisition and computation of imaged data, designed in a manner that will be usable by biologists and expandable by developers.

1990年毕业于北京大学，获理学博士。曾任中科院计算所副研究员、研究员、博士生导师、软件室主任、软件方向首席科学家。2000年起参与组建国家计算机网络应急技术协调中心（CNCERT/CC），2002年起任上海证券交易所总工程师，2012年起任上海证券通信有限责任公司董事长。主要研究方向：基于内存的分布式事务处理系统设计、自然语言处理与信息检索、信息安全。

1. 大数据产业链全球生态格局 2. 大数据技术与应用 3. 基于分布式架构的大数据商业建模 4. 大数据建模应用场景 - 用户画像 5. 大数据商业建模应用案例

Dr. Xinyue Ye’s research focuses on space-time analytics development, implementation, and application of big social data. His work won the national first-place award of "research and analysis" from the US University Economic Development Association in 2011 and he received the emerging scholar award from AAG’s Regional Development and Planning Specialty Group in 2012. He has co-edited eight journal special issues and about 50 journal articles. Dr. Ye is the founding director of Computational Social Science Lab at Kent State University since 2013. Recent and current main federal research projects include University Center Program (Department of Commerce), Coastal Ohio Wind (Department of Energy), Comparative Space-Time Dynamics (National Science Foundation), and Spatiotemporal Modeling of Human Dynamics Across Social Media and Social Networks (National Science Foundation). Dr. Ye’s research is closely related to the mission of R and open source computation in his work on computational social science, especially social media analytics, spatial social network analysis, and firm-level spatial economic analysis. Dr. Ye got Ph.D. in Geography from University of California, Santa Barbara.
​Open Source Comparative Spatiotemporal Dynamics

A powerful analytical framework for identifying research gaps and frontiers is fundamental to comparative study of spatiotemporal phenomena throughout the social sciences. The multiple dimensions and scales of socioeconomic dynamics pose numerous challenges for the application and evaluation of public policies in the comparative context. At the same time, research in the fields of temporal GIS and spatial econometrics has generated many novel space-time methods. However, the strengths of these spatiotemporal modeling methods have rarely been utilized to their full potential because the characteristics and structure of space-time datasets vary greatly in different fields of study. Hence, duplicated efforts exist and many critical gaps remain unexplored. This talk aims at contributing to comparative analysis of the dynamics of spatial inequality. Achieving a more balanced territorial distribution of wealth is among the biggest challenges for public policy design. Comparative analysis of spatial economies will reveal the dynamics of spatial economic structures, such as the emergence and evolution of poverty traps and convergence clubs, enabling economies to benefit from each other’s experiences and lessons learned. More specifically, I will develop a methodological framework for comparing spatial inequality dynamics and an open source toolkit that can be used to systematically analyze and assess the differences between two socioeconomic systems. This framework will not only pave the path to developing models for explaining such inequality but also provide a vehicle for projective studies. The open source approach allows a broader community to incorporate additional advances in research inquiry for specific goals, thus facilitating interdisciplinary collaboration.

transfer learning在广告点击率预估的应用

ECharts 团队成员。ECharts-X 作者。目前专注于前端图形和可视化方向。

ECharts 的简单介绍和演示。 ECharts 的目前状态，github 关注数等。 ECharts-X 分支介绍。globe viz 和 3D plots 的演示。 在 ECharts 3.0 中已经加入和即将加入的新特性。

Collective Attention Flows on the Web

In information age, human attention has been becoming a scarce resource. To know how collective attention flowing on the sea of information resources is of importance. We model collective attention flows as open flow networks. In the first study, we embed the network of the Indiana university clickstream data into a high dimension space and show how attention distributing on websites. Second, we show the flows of collective attention along various paths in the network may determine the success of an online community by the users' behavior data of the largest ask-answer community stackexchange. Third, we study 30,000 online forums of Baidu Tieba, and show that forums resembling organisms have metabolism and obey the generalized Kleiber law. The scaling exponent of the Kleiber law can be treated as a novel and stable indicator of stickness of the given forum.

Least Squares Estimation of Spatial Autoregressive Models for Large-Scale Social Networks

Due to the rapid development of various online SNS websites, the usefulness of the spatial autoregressive model has been recognized and popularly used to explore social network structures. However, traditional estimation methods are practically infeasible if network size is huge (e.g., Facebook, Twitter, Sina Weibo, WeChat, etc). We propose here a novel least squares estimation (LSE) approach, the computational complexity of LSE is only linear in the network size for sparse network. Under certain regularity conditions, we show theoretically that the proposed least square estimator (LSE) is $\sqrt{n}$-consistent and asymptotically normal. In addition to that, the proposed method can be readily applied to sampled network data. Numerical studies based on both simulated and real datasets are presented.

R语言在程序化交易与高频交易中的应用

1、国内程序化交易现状 2、R语言在程序化交易中的应用 3、高频交易的研究思路及发展趋势

Distillation of News Flow into Analysis of Stock Reactions

News carry information of market moves. The gargantuan plethora of opinions, facts and tweets on financial business offers the opportunity to test and analyze the influence of such text sources on future directions of stocks. It also creates though the necessity to distill via statistical technology the informative elements of this prodigious and indeed colossal data source. Using mixed text sources from professional platforms, blog fora and stock message boards we distill via different lexica sentiment variables. These are employed for an analysis of stock reactions: volatility, volume and returns. An increased (negative) sentiment will influence volatility as well as volume. This influence is contingent on the lexical projection and different across GICS sectors. Based on review articles on 100 S&P 500 constituents for the period of October 20, 2009 to October 13, 2014 we project into BL, MPQA, LM lexica and use the distilled sentiment variables to forecast individual stock indicators in a panel context. Exploiting different lexical projections, and using different stock reaction indicators we aim at answering the following research questions: (i) Are the lexica consistent in their analytic ability to produce stock reaction indicators, including volatility, detrended log trading volume and return? (ii) To which degree is there an asymmetric response given the sentiment scales (positive v.s. negative)? (iii) Are the news of high attention firms diffusing faster and result in more timely and efficient stock reaction? (iv) Is there a sector specific reaction from the distilled sentiment measures? We find there is significant incremental information in the distilled news flow. The three lexica though are not consistent in their analytic ability. Based on confidence bands an asymmetric, attention-specific and sector-specific response of stock reactions is diagnosed.

C++/R工作环境配置

ESG 经济情境产生器之系统开发

2014.07 ~至今 南京邮电大学，计算机学院，副教授，硕士生导师 2012.12 ~2014.06 南京邮电大学，计算机学院，讲师，硕士生导师 2012.06~2012.12 华为公司（北京）研究所，高级工程师 2010.09~2012.05 深圳光启高等理工研究院，高级研究员 2010.04~2010.07 杜克大学，电子与计算机工程系，研究助理 2009.02~2010.03 杜克大学，统计科学系，研究学者，同时担任统计与应用数学研究 所（美国国立研究所）序列蒙特卡洛方法学项目研究员
Adaptive Annealed Importance Sampling for Bayesian Multimodal Posterior Exploration

In this talk, I describe an algorithm that can adaptively provide mixture summaries of multimodal posterior distributions in the context of Bayesian inference. This work was motivated by an astrophysical problem called extrasolar planet detection, wherein the computation of stochastic integrals that are required for Bayesian model comparison is challenging. The difficulty comes from the highly nonlinear models that lead to multimodal posterior distributions. An importance sampling procedure is used to estimate the integrals, and the task is translated to be how to find a parametric approximation of the posterior. A mixture proposal distribution is used to capture the multimodal structure in the posterior. The parameters of the mixture proposal are tailored by a proposed iterative delete/merge/add process, which works in tandem with an expectation–maximization step. The efficiency of the proposed method is tested via both simulation studies and real exoplanet data analysis. The result was published in a flagship journal on astrophysics.

R中的地理时空数据分析与可视化

J.D.POWER 数据分析师，浙江大学客座教师，统计之都核心成员。生物信息硕士，毕业于四川大学、军事医学科学院，拥有互联网、汽车、制药、农业等领域工作背景，熟悉R, HTML5/CSS3, Python, Javasript 工程开发。、拥有一项国家发明专利和四项软件著作权。曾在国际著名期Bioinformatics（生物信息学）上发表论文两篇，在Nuclear Acid Research（核酸研究）上发表论文一篇。
R与可视化的邂逅让数据有了别样的味道

R作为数据分析处理和统计建模的语言，已经成为数据科学领域的不二之选；数据可视化作为数据分析以及数据结果输出的重要形式，赋予数据新的生命。因此，作为数据分析建模利器的R与赋予数据新生命的可视化邂逅在一起，让数据产生了无限的可能性和别样的味道。本次演讲将围绕数据可视化（特别是动态可交互的数据可视化）在R中的主要实现机制，分别探讨作为普通用户而言，有哪些可以获取的数据可视化资源和实现方式(leaflet/DT等扩展包)；作为开发者而言，如何使用htmlwidgets/shiny/RMarkdown框架实现快速、简单、有效地开发出自定义的数据可视化产品。

GIS+R正在加速地理信息的商业应用

GIS是什么？ 地理+信息+系统+？ GIS有哪些应用场景？ 数据管理、可视化、网点规划、选址评估、物流优化、渠道管理…… GIS商业应用有哪些开发工具？ Arcgis，Mapinfo，Q-Gis，PostGis，R，…… 辰智咨询运用GIS做了哪些商业应用产品？ 商圈秀、叠趣、客户拜访系统、全量数据分析平台 辰智咨询如何运用GIS+R的技术研发产品？ GIS是空间数据的容器与底层开发平台，R是空间建模的算法与部分可视化的利器

Learning to Rank在RTB中的应用

Learning to Rank在RTB（实时交易平台）有大量的应用。对于DSP（需求方平台），需要需要衡量每一个用户的价值，并以合适的价格去竞到，由于竞争的存在，同样的准确率下，Ranking不一样，会导致竞价结果完全不一样。模型需要关注更多转化率（点击率）更高的用户。我们会讨论Learning to Rank在点击率，转化率和推荐系统的应用并分析各种评估Metric对效果的影响。

R在Seven Bridges Genomics平台上的开发，部署，使用和分享

Seven Bridges Genomics是全球领先的生物信息云计算公司，为全球多家企业和研究机构，以及美国国家政府和英国国家政府提供基因组分析的云计算存储和分析解决方案。生物信息是生物，计算机和数学统计碰撞产生的火花，SBG同样致力于对开源社区和开源语言的支持，基于docker和common workflow languange标准流程描述语言的rabix项目，可以轻量简单的在本地和SBG的云端进行开发部署与分享开源软件。本次报告，将介绍SBG平台对R语言的支持，包括云端文件的加载，Rstudio的使用，基于docker和rabix的R包的开发，部署和分享，以及如何与应用库里的已存工具进行对接，来完成生物数据的数据分析，挖掘和统计分析。

imputeR: A General Imputation Framework

2006年就读于西安建筑科技大学应用数学专业硕士研究生，2009年加入西安欧亚学院

Rainbow 7（北京大学）

LightLDA: Making Super Large Topic Model Possible

When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens --- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly-efficient $\mathcal{O}(1)$ Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limited memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These contributions are built on top of the Petuum open-source distributed ML framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.

Natural Language Processing in a Deep Way

Semi-supervised Document Classification through a Bayesian Hierarchical Model of Latent Topics

Integrating word segmentation with text classification

Python中文社区创始人(之一) / 管理员, 热心于 Python 等等社区的公益事业, 大家熟知的社区"大妈"; O.B.P (Open Book Proj.~中文蟒样开放图书计划) 及 蟒营(PythoniCamp) 工程设计者 /主持人; 参与并主持各种线上 / 线下活动; 主持编撰了 <<可爱的python>> 坚持用 Pythonic 感化国人进入 FLOSS世界进行学习 / 分享 / 创造...
R or Py 这是个问题

R语言中有许多设计精妙的用法（比如管道），它们基于R语言本身灵活的语法支持，实现出了各种近乎神奇的效果。本次演讲将尝试用“重复发明轮子”的方法，尽可能简洁地来重建这些“暗黑魔法”，使大家能够理解它们背后的实现机制。演讲的具体内容预计包括：(1) 函数与自定义运算符 (2) 闭包与环境 (3) 延迟计算 (4) 语法解析和构建。

R语言中的最优化方法

R是一个专业的统计计算环境，但同时也是一个非常灵活的开发平台。最优化方法本来不是R擅长的领域，但是这些年随着R语言越来越流行，很多作者将不少非常优秀的最优化工具整合到了R环境中，赋予了R更强大的功能。 本次报告将会结合演讲者的工作经验，介绍非线性规划、线性规划、非线性混合整数规划、遗传算法等业界常用的最优化方法及其在R环境中的实现方式，此外，还会针对这些最优化方法的应用场景与运行性能和商业软件进行比较。

Julia语言进展及面向领域的支撑环境

Julia语言的语法对计算领域非常亲和，有很强的分布式高性能支持能力和数学库扩展能力。本报告第一部分介绍Julia语言的语法新特性，分享基本程序结构特征。第二部分重点阐述Julia语言对并行计算的支持能力和面向领域的工具库制作使用方法。第三部分介绍OpenBlas算法库方面的工作新进展。最后，阐述面向企业计算的高性能Julia云编程环境，示例基于深度学习的领域应用构建方法的近期尝试。

rjulia：提高R计算效率的另外一条途径

rjulia结合了R和Julia两者的优点，提供了另外一条提高R计算效率的途径，为想要结合R和Julia的数据分析者提供了一个便利的工具。用户可以在计算密集的部分使用julia，而无需使用C/Fortran或Rcpp等语言来编写扩展包，降低了代码编写、调试的难度，同时也获得了效率提升。另外，还可利用julia的并行计算来进行大数据处理和分布式计算，对目前R尚不完善的分布式计算提供了补充。

spark与R，Python在数据建模上的协作✏️

1.互联网理财呈现出资产多元化态势，各类金融资产、类金融资产及非金融类资产，通过互联网金融平台，设计成新型的理财产品 2.不同资产的特征决定了产品设计、风险特征、投资者认知 3.不同资产被“互联网化”的逻辑是什么，趋势如何 4.互联网理财产品与资产证券化的异同，市场比较，替代关系

ICA：独立成分的估计顺序研究