Spark Thrift Servers
- 提供JDBC/ODBC连接的服务
- 服务运行方式是一个Spark的应用程序,只是这个应用程序支持JDBC/ODBC的连接,
- 所以:可以通过应用的4040页面来进行查看操作
beeline连接
!connect jdbc:hive2://ser-01:10015
连接yarn
spark-shell --master yarn
spark-submit --master yarn --deploy-mode client examples/src/main/python/pi.py
spark-submit --master yarn
--deploy-mode client \
--class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.5.jar
问题
spark中Array[String]
类型如何索引:map(x => (x(0),1))
val rdd2 = rdd.map(x => x.split("\\^")).map(x => (x(0),1)).take(10)rdd2: Array[(String, Int)] = Array((460010187619746,1), (460010255255352,1), (460010319500136,1), (460010339514283,1), (460010751106661,1), (460010993315713,1), (460011042271600,1), (460011042272309,1), (460011057659235,1), (460011102532419,1))
python缺少包ImportError: No module named sklearn.cluster
- client服务器本地
--conf spark.yarn.dist.archives=/home/hadoop/python37 \
--conf spark.pyspark.driver.python=/home/hadoop/python37/bin/python \
--conf spark.pyspark.python=/home/hadoop/python37/bin/python \原文链接:https://blog.csdn.net/yawei_liu1688/article/details/112304595
anaconda安装,清华开源镜像地址 ! [https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/] -- 后续过程需联网
1. 输入安装命令:bash Anaconda3-5.3.1-Linux-x86_64.sh
2. 回车
3. 输入:yes
4. 这一步可选择默认安装,点击回车,当然也可以修改,见下图。这里我选择另定义安装目录,输入:/tmp/software/anaconda3
5. 输入:yesvim /etc/profile
export ANACONDA_HOME=/tmp/software/anaconda3
export PATH=$ANACONDA_HOME/bin:$PATH
export PYSPARK_PYTHON=$ANACONDA_HOME/bin/python# 测试
pyspark# 坑,下载sk learn 后 conda报错 no module
conda install scikit-learn# 重新覆盖安装 anaconda , 添加 -u 参数
/home/bonc_zj/anaconda2
在外网机器下载好后打包 py27.tar.gz
spark-submit使用 py27.tar.gz 中的python环境
spark-submit --master yarn --deploy-mode client \
--conf spark.yarn.dist.archives=hdfs://zjltcluster/share/external_table/share/external_table/app_bonc_zj/hdfs/hivedb/test_mean_shift/py27.tar.gz#python27 \
--conf spark.pyspark.driver.python=./python27/py27/bin/python \
--conf spark.pyspark.python=./python27/py27/bin/python \
testMeanShift.py# 两个坑
# 1. 解压缩路径
#python27 是指将py27.tar.gz解压到 python27文件夹下
引用的时候,注意压缩时是对py27文件夹进行压缩的,所以引用./python27/py27/bin/python# 2. testMeanShift.py 指定解压缩路径,同 1
import os
os.environ['PYSPARK_PYTHON'] = './python27/py27/bin/python'
spark-submit提交后
ImportError: No module named sklearn.cluster.mean_shift_重新打包,将mean-shift打包
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
版本不一致
spark 1.6.2 支持python2.6+
conda create -n mlpy_env --copy -y -q python=2.7 numpy scikit-learn
2.7+scikit-learn+numpy重新打包